Natural Language Processing (NLP) is not a simple task. For an AI-service to understand a language, not to mention to interpret words when spoken and translate them in a contextually and culturally correct way, you must ensure that it has the best possible foundation.
This is why at Pexip our foundation is NVIDIA’s Riva and Maxine technologies, which we use to process spoken audio for quality enhancements, and for speech-to-speech translation and speech-to-text services.
In this blog, we will dive deeper into how Riva and Maxine technologies are applied to solve real-time language translation in video conferencing.
Let’s start with the challenges with machine translation between languages in real-time?
The challenges can be broadly divided into four subcategories
Audio capture and transmission
Capturing clear audio can be quite challenging! How many times have you been on the phone, trying to catch what the other person is saying? Background noise, whether it's coming from a busy street or a car, can significantly hinder communicationWhen capturing audio, elements such as poor acoustics, background noise, and low-quality equipment need to be contended with. These issues become even more complex when the audio is intended for machine interpretation. Mumbling or rambling speech, along with the nuances of special accents and dialectsare difficult for the machine to capture and interpret.
Moreover, the complexity escalates when the recorded audio must be transmitted over the internet, where issues like network packet loss, jitter, and latency, along with various signal processing challenges, can arise. This task becomes even more daunting given the diversity of languages the system may encounter. The audio could be in any language, each with its own unique set of phonetics, idioms, and nuances, further complicating the process.
Speech-to-text (Automated Speech Recognition or ASR for short)
When the audio reaches the AI service, ASR must be performed to start understanding what has been said. And what better challenge to talk here than homophones, homonyms, and homographs! You might not know these phrases, but simply put:
- A homophone refers to two words that sound the same but differ in spelling, such as eight/ate, know/no, or pour/poor.
- A homonym refers to words that both sound the same and are spelled the same, but have different meanings, such as can/can, tire/tire, or ring/ring.
- Lastly, there are words that, despite being spelled identically, are pronounced differently, such as bass/bass, bow/bow, and tear/tear.
For a machine to accurately translate and interpret the meaning of such words in a sentence, it must understand the context, the cultural references, and the rest of the words in the sentence. This process is far from straight forward.
On top of this, there are challenges related to language comprehension, which arise when the AI incorrectly predicts words in a sentence. This could be, for instance, a sentence that is clear in context, but contains some unexpected words for the given scenario.
These challenges make it difficult for a machine to actually understand what is being said.
Translation
In translation between languages, it is not only the word-by-word machine translation that is important. Context is equally important, as are cultural references. This is where interpretation differs from translation.
Context ambiguity, as with the various homophones, homonyms, and homographs, makes translation hard. When a sentence could have different meanings based on its context or when its interpretation may differ across languages, this highlights the intricate nature of language and the challenges it poses for accurate translation.
English scientist Geoff Hinton discusses the difficulties of translating between languages using the English sentence “The trophy wouldn't fit in the suitcase because it was too big” in what is known as the Trophy/Suitcase problem. When we speak, we understand the context of why the trophy does not fit in the suitcase. But when a machine translates this sentence, it is impossible to know what “it” refers to, the suitcase or the trophy.
Then come the nuances of cultural references and idioms. Every language has its set of idioms and expressions, which often lose their meaning and context when translated literally into another language.
Lastly, there’s the issue of language pair limitations. Does the AI service actually support the language pairs that are needed? It is easy to think that translation is often needed between (some language) and English, but what about (some language) and (some other language)?
NVIDIA is already supporting more than 60 language pairs for the most comprehensive services, which makes it ideal for most applicable challenges.
Text to speech challenges
The fourth category is interesting. Dealing with a machine is most often emotionless, unless you are Joaquin Phoenix in the movie “Her”, where he falls in love with his Smart home voice (who happens to be played by Scarlett Johansen).
Not only do machines often speak using synthetic voices, but they also have challenges pronouncing special phrases, such as names. And a machine voice has yet to be able to understand and portray emotions effectively. Is it a sad voice you are hearing, or a joyful one?
Using NVIDIA Riva and Maxine for audio processing and translation
Inside the Pexip AIMS (AI Media Server) are three components. Two of them are from NVIDIA, Riva and Maxine:
- NVIDIA Maxine is a platform for deploying AI features enhancing audio, video, and for adding augmented reality effects in real time. At Pexip, we use Maxine specifically to remove background noise and enhance the audio we receive, preparing it for ASR and translation by Riva. Working with Maxine’s new BNR 2.0 (Background Noise Reduction 2.0), we see significant improvements in intelligibility and speech quality, making Automatic Speak Recognition much easier using Riva in the AI workflow pipe.
- NVIDIA® Riva then provides multilingual speech and translation AI capabilities for real-time conversations, and includes automatic speech recognition (ASR), text-to-speech (TTS), and neural machine translation (NMT) applications. At Pexip, we use Riva to perform bot ASR, TTS and NMT,
- The third component in the AIMS is unique to Pexip: PULSE (Pexip Unified Libraries for Secure Engagement) manages multiple simultaneous input and output data or media streams from the network to AIMS.
Read why private and secure AI is so crucial for video conferencing.
- Secure Meetings