Are you listening to me? The inner workings of speech recognition

These days, having a device understand your voice is no far-flung dream. Whether it crafts a text message you dictate, takes notes in a meeting, or responds to your never-ending questions about the weather; speech recognition has become increasingly embedded in interactive technology.

Despite its permeating presence in modern society, how does it actually work? Here, we take a deep dive into the complex system of speech recognition technology with PhD candidate at The Australian National University’s School of Cybernetics, Kathy Reid. To understand where we are now, we must first understand the rich history of speech recognition, complete with pitfalls and equity issues, and how we need to steer the technology to work better for more people.

With a predicted eight billion digital voice assistants in existence in 2023, the rate of growth suggests it won’t be long until the number of digital voice assistants surpasses the number of humans on Earth. Substantial progress has been made since the first iterations of speech recognition, where the machines took up several cubic meters of space and could only recognise nine words.

However, there is still a long way to go in ensuring accurate and equitable speech recognition software.

Voice recognition is a complex system that faces a unique set of difficult and persistent systematic challenges. If you are interested in understanding the complex systems engineered in our society though a cybernetic lens, learn more about the School of Cybernetics learning experiences.

Transforming Complex Systems with Cybernetics examines these ideas and provides participants a unique opportunity to learn from industry experts, including School Director, Professor Genevieve Bell.

Check out our other offerings

How did we get here?#

“Put simply, speech recognition takes our spoken voice and turns it into written words through a process of transformation,” Reid explains. But this process is not as easy as it sounds. The history of speech recognition has been a slow burn to get to where we are today.

Audrey – the first machine capable of recognising human speech was invented in 1952 by Bell Labs in the US. Taking up almost half a room worth of space, Audrey could only recognise the spoken numbers one to nine. Ten years later, in the 1960’s, IBM unveiled a much smaller machine called the Shoebox. Developments in technology meant that this iteration could now recognise 16 words – the digits zero to ten, and six control words including ‘plus’, ‘minus’ and ‘total’.

The Shoebox worked by splitting the spoken words into phonemes– the smallest units of sound that still have meaning. Once the machine recognised the phonemes, they needed to be put back together as words, Reid explains. “That meant the Shoebox operated using two models – one acoustic model to recognise the phonemes, and another language model to convert the phonemes into words.”

Several other iterations of the technology followed this, including the Harpy speech recognition system, which was the first to use a probabilistic guess at what words are most likely to follow each other – similar to what we now see in systems like ChatGPT. While the Harpy operated slowly – taking around 13 seconds to recognise one second of spoken audio, it was momentous at the time. But it wasn’t until much later that the development rapidly progressed.

Modern speech recognition#

“Speech recognition changed forever in around 2016 with what is called end-to-end speech recognition,” Reid says. End-to-end speech recognition doesn’t split speech into phonemes just to transform it back into words like past systems. End-to-end speech recognition transforms spoken audio straight into words.

At the time, this was a revolutionary change for the accuracy, speed and diversity of words comprehended by speech recognition software. “End-to-end speech recognition relies on deep learning algorithms. It is trained on both spoken voice data and written transcripts, meaning the quality of the input, or training material, impacts the quality of the output,” Reid explains.

Instead of recognising phonemes, end-to-end recognition recognises characters – such as the 26 letters in the English alphabet. This is connected to a language model that determines which string of characters produces a word using a probabilistic approach, usually based on how frequently the word occurs in the training data.

Can you hear me?#

While voice recognition software will listen to everyone, there are equity issues in which voices are recognised accurately. “Often when we have an acoustic model that isn’t trained well on accents or different languages, it will show in the character error rate. So then when the language model puts the characters together, there will be a poor word error rate,” Reid explains.

“In speech recognition at the moment, some of the main equity issues stem from where the data is coming from. Even now, we still have massive discrepancies in recognition between majority accents and minority accents.”

The difficulty with speech recognition systems is volume and diversity of data required for models to be accurately trained. Due to the current lack of data, we are seeing vast issues in equality.

“A lot of the data we have is captured from devices or platforms that are used by particular speakers,” Reid says. “For example, the data collected from an iPhone is skewed towards a more affluent population given the cost of the equipment.”

Further, there is a lack of data created for the specific purpose of training speech recognition software systems. One of the most used datasets for speech recognition is called Librispeech. “Librispeech was never created as a speech recognition dataset, even though it has now found a life of its own,” Reid explains.

Originally, it was a dataset created from the LibriVox project, where volunteers read out of copyright works to provide free audio books as part of the free and open-source software movement. This became a perfect training set as accurate transcripts already existed for the recorded audio.

While the LibriVox data had well-balanced gender split – with many speech samples of male-identifying and female-identifying contributors, most of the speakers spoke English with an American accent. “On this, we have lots of data for North American speakers, but much less from people who speak African American English or Australian Aboriginal English. Because we don’t have that data, it’s much harder to train models that work well for people who speak with those accents,” Reid explains.

This presents vast challenges for speakers with minority accents. Voice recognition systems need more diverse, well-documented training data to ensure all speakers can be equally understood. “In other commonly used datasets, the majority of the contributors are young, so in their 20s or 30s or 40s, not so much in the older demographics,” Reid explains.

“As we have an ageing population and we progressively use speech recognition technology to help people stay at home and age well, there is an equity problem in the lack of voice data in this demographic.”

Where are we going?#

To improve recognition of accents and different languages, there is now an increased push for projects like Common Voice, a publicly available voice dataset powered by the voices of volunteer contributors from around the world. They are countering the issue that currently most voice datasets are owned by companies, stifling potential innovation. By asking individuals to share their voice, they can hopefully collect a more culturally and linguistically diverse dataset to improve speech recognition for all.

“Speech recognition development is a problem of more data, better data, more languages and more variance,” Reid says. Despite advances towards equity, there are still many challenges ahead in the realm of speech recognition and other technologies that use voice data.

With the rise of Generative Artificial Intelligence, less than 30 seconds of speech is needed to clone a voice. VALL-E from Microsoft can closely simulate a person’s voice with only a 3-second audio sample. Once it learns a specific voice, VALL-E can synthesise audio of the person saying anything, including changes in emotional tone.

“Two years ago it took 30 minutes of audio to clone a voice. Four years ago, it would have taken 30 hours,” Reid explains. This raises another bout of questions about the rapid development of artificial intelligence, and how we can maintain a personal connection with something so intrinsic to our being as our voice.

Voice recognition has developed substantially over the past 70 years. The improvement in accuracy and speed is astonishing, but there are leaps and bounds of ground to cover to create equitable software systems for all. While voice recognition systems are listening, they are not hearing us all.

More information on all the learning experiences is available through our education page. Register now to take part in one, or all of these rewarding offerings.

Search this site

Are you listening to me? The inner workings of speech recognition

How did we get here?#

Modern speech recognition#

Can you hear me?#

Where are we going?#

You are on Aboriginal land.