Johns Hopkins Medicine researchers, others found that commercial AI speech recognition systems have 23.31% error rate compared to advertised error rates of 2 to 3%
Thanks to advances in artificial intelligence and machine learning, speech recognition technologies such as Amazon Alexa, Google Assistant, Microsoft Cortana, and Apple Siri in iPhone are changing the way we interact with our devices, homes, cars, and jobs.
The first speech recognition system was developed in 1952 when Bell Laboratories designed the “Audrey” system which could recognize a single voice speaking digits aloud. Unlike the speech recognition systems we have today, the first speech recognition systems were focused on numbers, not words. It took another decade before IBM introduced “Shoebox” which understood and responded to 16 words in English.
Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or speech-to-text, is a capability that enables a software program to process human speech into a written format. Although it’s commonly confused with voice recognition, speech recognition focuses on the translation of speech from a verbal format to a text one while voice recognition just seeks to identify an individual human’s voice. With AI and machine learning, ASR is now everywhere; it dictates meetings and emails, helps to manage smart appliances, and much more.
Today, speech recognition technologies have become much more sophisticated, thanks again to advances in artificial intelligence and natural language processing (NLP). In 2017, Google said that its voice recognition software had attained a Word Error Rate (WER) of about 4.7%.WER is a common metric used to compare the accuracy of the transcripts produced by speech recognition APIs.
However, even after 70 years since the technology was first introduced, the question is, what are the accuracy rates of these speech recognition systems compared to human transcriptionists?
According to research first reported by Wired, human transcriptionists had an error rate of about 4% while commercially available ASR transcription software’s error rate was found to have an error rate of 12%. In other words, the error rate of commercial speech recognition systems is three times as bad as that of humans.
If you’re still in doubt, new research showed that some automatic speech recognition (ASR) systems might be less accurate than we previously thought. According to a recent study by researchers at Johns Hopkins University, the Poznań University of Technology in Poland, the Wrocław University of Science and Technology, and startup Avaya, they found that commercial speech recognition systems have an error rate of up to 23.31%.
The authors of the research claimed that the WER was significantly higher than the best reported results and that this could indicate a wider-ranging problem in the field of natural language processing (NLP). A comprehensive benchmark of ASR models cites WER as low as 2% to 3% for commercial speech recognition systems on the market.
But the coauthors of this latest study reject that statistic. They claimed that the majority of interactions with ASRs happen in the context of “chatbot-like interactions,” where people are aware they’re conversing with a machine and thus simplify their commands to short, well-structured phrases as opposed to the disfluent hallmarks of natural conversation.
As part of their research, they evaluated many ASR systems on a dataset of 50 call center conversations from 1,595 agents and 1,261 customers, which spanned 8.5 hours in length — 2.2 hours of which was speech. Depending on the dataset, the ASR systems’ previously published error rates didn’t exceed 15% and dropped as low as 2%. This was in contrast with the study’s findings; tested across recorded phone conversations about finance, insurance, telecom, and booking, the coauthors observed WER as high as 23.31%.
The highest rates they found were on the booking and telecom calls, perhaps because the conversations referred to specific dates and times, money, places, and product and company names. But the WER was above 13.73% in every area they tested.
Although AI-based transcription technology has improved tremendously over the past seven decades, offering cheap and fast results, the error rate has to be roughly halved until an ASR system can reach near-human levels of accuracy.
In addition, the higher error rate is especially dangerous when AI is used in such life-critical industries as healthcare. Even though AI enabled the possibility for automatic speech recognition systems (ASR) to provide very quick transcription results, speed often becomes not the most important factor. When accuracy is mandatory, human-powered transcribing still has the upper hand.
“AI cannot handle difficult medical terminology which results in inaccurate transcription,” explains Mindaugas Čaplinskas, the CEO of GoTranscript, a company that transcribes around 5000 hours of medical content per year. “This is unacceptable in the field of healthcare—having someone in-house to review and edit it takes tons of time. Despite companies thinking this option is cheaper, in the end, they spend more time and resources compared to the more efficient alternative of outsourcing it.”
Although ASR still has a long way to go to catch up with human transcripts, there is no denying that we have made much progress over the past seven years. You can read the entire study below.
ASR Error Rates