Intro
Smart speakers with Amazon’s Alexa and Google’s Assistant have been welcomed with open arms into US households. A recent report reveals that there are now over 57 million smart speaker users in the US alone. People appreciate the ease of use and the nearly immediate access to the information we seek. We ask, they answer, we command, they do and so the story goes. However, by now most of us have witnessed some strange behavior with our smart speakers. Once in a while they light up and listen without being prompted. Almost randomly, they interrupt our conversation and blurt out some useless information. Most of us tend to accept this behavior as normal, keep calm and carry on. Vocalize.ai, a software tools and research company decided to dig a bit deeper into this phenomenon. We set out to answer the questions. Why do smart speakers wake up without the user request? How often does it occur? Is one smart speaker more susceptible than another? Is this a privacy issue? Is this a potential Achilles heel for the voice first movement?
Wake Words
Understanding how smart speakers operate is the first step. In general, smart speakers are always listening and scanning for their wake word (e.g. Alexa or Hey Google). This is accomplished locally on the smart speaker itself. When they determine the wake word was heard, they then start recording all audio and send the recording to the cloud for processing. In a perfect world the smart assistant would never wake up until it hears the wake word. However, in reality the virtual assistant has to deal many voice and audio environment variables. For example, when an adult says “Alexa” it sounds very different than when a child says “Alexa.” Also consider saying “Hey Google” standing next to the device as compared to shouting “Hey Google” from across the room. The wake word detection software has to be robust enough to deal with gender, age, accent, level, distance, reflections, etc. As it turns out this is no easy task and sometimes the virtual assistant gets it wrong.
False Positives
With regard to “getting it wrong,” there are two main failure modes; false negatives and false positives. False negatives are when we speak “Alexa” and the smart speaker doesn’t respond. In this instance the virtual assistant rejected our attempt to wake it up and we simply need to try again (maybe a little louder or a little clearer). The other failure mode of false positive is more concerning. In the false positive scenario, without our request and sometimes without our knowledge, the device wakes up. Once awake the smart speaker starts to record whatever it hears and sends it off to the cloud. Earlier this year a stunning example of false positive behavior made headlines when an Amazon Alexa smart speaker recorded a family’s conversation and sent it to a random person in their contacts.
Wake word detection provides a clear-cut example of the Goldilocks principle. If the wake word detection algorithm is too cold, then it will reject valid attempts and frustrate users (too many false negatives). If the wake word detection algorithm is too hot, then it will wake up when it should not and potentially record any sounds in our home or business (looking at you Marriott). The ultimate goal is a wake word algorithm that is just right.
Reality Check
Before digging too much deeper into false positive performance, Vocalize.ai worked with Voicebot.ai and we conducted a survey with 328 smart speaker users. We asked these users two main questions. First, what is the primary smart speaker that you use (Alexa, Google, Siri)? Second, how often does the smart speaker unexpectedly wake up (false positive)? The results were quite surprising. Over half of the users (53%) observed a false positive at least once a week. Nearly a third of the users (30%) reported false positives at least once a day. Finally, 16% of users claimed false positive experiences many times per day. If we extrapolate these percentages to the installed user base, it equates to at least 30,000,000 false wake ups per week. Clearly there is some work to do on improving this user experience and privacy.
Back to School
There is no standard method for evaluating wake word performance. Most smart speaker companies have their own guarded, secret procedures and tests du-jour. However, our market data indicates the that status quo is not sufficient. So where do we start? Vocalize.ai has had success leveraging audiology procedures for speech recognition evaluations and we decided to continue on that path. Over the summer we were hosted by the University of Washington St. Louis and their audiology department. A brainstorming session with Dr. Nancy Tye-Murray and Dr. Brent Spehar yielded a novel approach. Tye-Murray and Spehar, with funding from the NIH, study audiovisual speech perception and how humans distinguish homophenes and auditory neighbors during ongoing conversation. Homophenes are words that look the same on the lips as would be seen by a lip reader. For example, in “Hey Google” the word “Who” is a homophene of “Hey.” Auditory neighbors are words that are different by one phoneme (sound). For example, “Way” is an auditory neighbor of “Hey.” Using their proprietary software, we decided to focus on creating a test procedure based on homophenes and auditory neighbors of wake words. These types of words, with the restriction that any homophene tested was also an auditory neighbor of the wake word, were chosen to provide a challenging and repeatable method for evaluating false positive performance.
Hey Google – Results
The homophene and auditory neighbor method yielded 23,000 possible word combinations for our Google Home evaluation. Due to the volume of possible combinations, testing is ongoing, but early results yield a 3% false positive rate. Examples of the false positives can be found in the video below. Some of the more entertaining false positives are “He Ghoul Oil” or “Hey Coup Girl.” The word “ghoul” was one of the most common triggers for a false positive with Google. Perhaps Google Home users will see more false positives as Halloween is near. One of the more troubling results was “Hey Goo Guile.” The Google Home translated it as “Hey Google Dial” and asked for a contact to dial. Somewhat reminiscent of the Alexa scenario reported earlier.
Alexa – Results
The set of homophenes and auditory neighbors for Alexa was much smaller with a total of 2662 combinations. However, it was noted that A-lex-a doesn’t require the leading “A” and can be reliably triggered with just “Lexa” (go ahead and try it). Using this method, the result is a 3.7% failure rate and very similar to the Google Home. Examples of the Lexa false positives can be viewed in the video below. It is also very interesting to note that a few of the potential trigger words for Alexa include some of the most common words used in conversation; “the” and “uh.” Actually “the” is the most common word in the English language.
Connecting the Dots
Imagine a phone that randomly dialed a contact once a week. Think about a webcam that lit up and recorded several times a day without your knowledge. Products like those would quickly end up unplugged and returned. So why is it acceptable for a smart speaker to accidentally wake up and record our home without our knowledge or permission? Our guess is that most people do not realize they are actually being recorded and further the recording is being sent to the cloud.
At Vocalize.ai we are firm believers in the voice first future enabled by AI powered virtual assistants, but transparency is fundamental to building trust. Consumer data shows that false positives are a real-world phenomenon with over half the users experiencing it, on at least a weekly basis. Vocalize.ai demonstrated a homophene and auditory neighbor method which easily generates a 3% false positive rate for both Google and Alexa virtual assistants. This is meant to be the first step in creating a method to benchmark wake word performance across various virtual assistants.
The end goal is tech industry commitment to a published consumer friendly rating system for customer affecting issues like wake word detection and false positive performance. We also want to start the conversation about a standard set of controls and alarms so users are more aware when a false positive occurs. Empowering consumers to make an informed decision before bringing an “always listening” AI assistant into their home or business. Google and Amazon are in a race to become the dominant virtual assistant in your life. New products are released to sound better, look better and do more. However, the real winner may be the company that solves the Goldilocks principle for wake word detection…this one is just right.