How to train your speech recognition model?

Although many tools for machine learning are open source and freely available, they can often have a steep learning curve. This is a barrier for communities who want to use these tools to create beneficial outcomes. Democratising technology - making it accessible to more people - is not just about “pushing it to GitHub” - it must also be easy to use. This challenge is the focus of an emerging field called developer experience. Developer experience borrows heavily from user experience and design thinking. One of its key tenets is reducing “mean time to hello world” - that is, the time a developer must invest in a piece of software or hardware to achieve a goal.

This was exactly the challenge faced recently by PhD candidate, Kathy Reid, who works part time for tech giant Mozilla. Mozilla’s DeepSpeech toolkit uses machine learning to create speech recognition models. It is intended for use across all languages and is part of broader efforts to bring voice technology to many under-served languages and the people who speak them. However, DeepSpeech has a notoriously steep learning curve, meaning that developers often give up using it out of frustration.

To address this issue, Kathy interviewed several developers who work with DeepSpeech to understand what their key hurdles were when first using the software for speech recognition. She then worked with two experienced developers to structure explanatory information, tutorials and code examples into an accessible, step by step “PlayBook”. The DeepSpeech PlayBook was first released in early February and has already been used hundreds of times according to analytics provided by GitHub. Developers who have used the PlayBook have contributed their tacit knowledge by making improvements and suggesting additions; this harnesses the collective tacit knowledge of the DeepSpeech community.

This work is a complement to Kathy’s PhD research. She is investigating the metadata applied to voice datasets that are used by machine learning to create speech recognition models. Currently, many speech recognition models often recognise men better than women, or some accents better than others. This is because the data used to build models often contain samples that are skewed to ages, genders or accents. By better labelling voice datasets, we may be able to influence machine learning practitioners to build more equitable speech recognition models. This is of particular importance as voice assistants, and other voice-enabled cyber-physical systems, go to scale across the world.

You can see the PlayBook here.

Search this site

How to train your speech recognition model?

You are on Aboriginal land.