AI recognizes language better than humans
Voice assistants such as Alexa, Cortana or Siri make it possible to automatically transcribe spoken texts and create translations. The speech recognition systems use artificial neural networks that assign acoustic signals to individual syllables and words using libraries. The results are now very good when the assistants are addressed directly or when a text is read aloud, but problems often arise in everyday life, which, as a study by the Ruhr-Universität-Bochum (RUB) recently showed, can also lead to language assistants unintentionally activated by incorrectly understood signal words.
Conversations between several people are still often causing problems. According to Alex Waibel from the Karlsruhe Institute of Technology (KIT), “there are breaks, stutterers, filling sounds like ‘uh’ or ‘hm’ and also laughs or coughs when people talk to each other.” In addition, as Waibel explains, “words are often added indistinctly pronounced. ”As a result, even humans have problems creating an exact transcription of such an informal dialogue. However, artificial intelligence (AI) has even greater difficulties.
According to a preprint published by arXiv , scientists working with Waibel have now succeeded in developing an AI that transcribes everyday conversations faster and better than humans. The basis of the new system is a technology that translates university lectures from German and English in real time. So-called encoder-decoder networks are used for this, which analyze acoustic signals and assign words to them. According to Waibel, “the recognition of spontaneous language is the most important component in this system, because errors and delays quickly make the translation incomprehensible.”
Accuracy increased and latency decreased
Now the KIT scientists have significantly further developed the system and, above all, have significantly reduced the latency. Waibel and his team used an approach based on the probability of certain word combinations and linked this with two other recognition modules.
In a standardized test, the new speech recognition heard excerpts from a collection of around 2,000 hours of telephone conversations, which the system was supposed to automatically transcribe. According to Waibel, “the error rate of people here is around 5.5 percent.” The AI, on the other hand, only had an error rate of 5.0 percent, which is the first time that humans are able to recognize everyday conversations. The latency time, i.e. the delay between the arrival of the signal and the result, is very fast with an average of 1.63 seconds, but does not quite come close to the average 1 second latency of a person.
The new system could be used in the future, for example, as a basis for automatic translations or for other scenarios in which computers are supposed to process natural language.