Get In Touch

Alexa, What’s The Technology Behind You?

June 29, 2018 |

It was the early 90’s when speech recognition first took off with significant DARPA funding being invested in various research projects conducted by the top universities. However, the data available at that time was simply not sufficient for the technology to grow on. But that changed over the last decade due to substantial digital advancements that led to the availability of data at hand to train models. The biggest milestone achieved in the sphere of voice recognition technology is the Amazon Alexa that has used extremely complex machine learning processes to revolutionize the way we conduct everyday tasks.

Last year, Amazon opened up its technology behind Alexa, called the Lex. The system combines natural language understanding technology with automatic speech recognition and can be used by developers who want to build their own conversational applications like a chatbot. According to Amazon, the technology can be used to serve a variety of purposes especially web and mobile applications.


So How Does This Technology Works?

1. Signalling Process

It all starts with the signal processing. This allows the device to make sense of the audio input by cleaning the signal. It’s one of the most crucial challenges in far-field audio. The primary objective is to recognise the target signal which can only be done by first identifying and minimizing ambient noises like the TV or the dishwasher. These issues are handled through beamforming which uses seven microphones to identify the source of the signal so that the device can focus on it. Moreover, acoustic echo cancellation knows when it’s playing and can eliminate that signal so only the important signal is left.


2. Wake Word Detection

Once the signalling is done the next task is Wake Word Detection. It determines if the user has said one of the words for which the device is programmed to turn on, such as ‘Alexa’. This is very important as voice commands could be picked from conversations around and may result in accidental purchases and angry customers. Furthermore, it needs to identify pronunciation differences as well and that too with limited CPU power and everything needs to be quick, so it requires high accuracy and low latency.


3. Audio To Text Conversion

If the wake word is detected, the signal is then sent to the speech recognition software in the cloud. Here the audio is converted into text format. This process is performed by converting a binary classification problem into a sequence-to-sequence problem. It needs to browse through all the words in the English language to determine the input and produce the desired output. That’s huge, and the cloud is the only technology capable of scaling it sufficiently. The input is not a one-word query. It can be any possible question and therefore you need the context of it.


4. Natural Language Understanding (NLU)

Natural Language Understanding (NLU) is used to convert text into a meaningful representation. So, let’s say you ask for the weather in New York. Here, the intent would be weather and New York. But, the problem pops up with cross-domain intent classification. For example, if someone said ‘play remind me’, this is very different to remind me to go play. However, it could easily be misinterpreted. Additionally, there are a lot of commonly used words which sound the same but has completely different meanings. Like ‘BY’ can be misinterpreted as ‘BUY’ leading to unwanted consequences. Out-of-domain utterances are also eliminated at this stage if they don’t make any sense. This prevents the device from mistakenly hearing commands from televisions and the likes.

Researchers are working constantly to improve the speech recognition software. Further improvements will see Alexa better hold a conversation, remember what a person has said previously, and applying that knowledge to consequent interactions. Get into a discussion with our technology experts and analyse the future of Alexa.

Get in Touch