As advanced as most automatic speech recognition (ASR) systems are, they still have difficulty handling and recognizing proper names. The sheer volume of names in any given language presents an issue from the outset. Most English-speaking people have an average vocabulary of 20,000 to 35,000 words. However, there are more than 150,000 surnames in common usage in the U.S. alone, and nearly a million names in total. This leads to proper names outnumbering regular words 10 to 1. So, while today’s ASR systems are trained on hundreds of thousands of hours of audio, the frequency of proper names compared to more commonly used words is still very low. Because of this, ASR systems will often mistake proper names for more common words. Also, they often transcribe the names of famous people in place of less common names because of training data bias.
Names can also be pronounced differently by different people. For example, “Helena” can be pronounced as “HEL-eh-nuh” or “Heh-LAY-nuh.” They can also be orthographically spelled in many ways; Catherine and Kathryn sound exactly the same and, with only audio input, it is impossible for ASR systems to choose between the two spellings unless the context for a particular name is clear. This variability in both phonetics and orthography further adds to the ambiguity and complexity that ASR systems face when processing and transcribing names.
All these factors (limited training data, training biases, and heterogeneity) can lead to poor performance of an AI solution. For example, if you said, “Wally Shaw, please” the system might transcribe this as “Wallace Shawn, please” and then act on this, potentially compromising private health information (PHI). Similarly, doctor’s names can be misinterpreted or confused (recognizing “Mike Daly” as “my daily”, for example), and this could result in frustration when the system doesn’t understand and/or takes you to an operator that leads to a queue. In healthcare, it is essential that ASR systems minimize these kinds of mistakes, but the speech recognition systems used by most healthcare institutions still regularly make them.
Parlance has spent over 25 years building a set of proprietary tools and technologies to generate better performance from ASR and NLP systems to eliminate these problems. By digitizing the voice channel, Parlance harnesses the power of speech to improve patient experience, optimize business operations, and save money for hundreds of health systems throughout the US, UK, and Canada.
By Will Sadkin