Charles Rojas, “How Voice-Activated AI Assistants and Humans Share Similar Patterns of Word Recognition”
Mentor: Jae Yung Song, Linguistics
Poster #160
Our ability to recognize speech is affected by several factors, such as the speed in which a speaker speaks, and how difficult a word is to understand. The goal of the current study was to examine if these same factors extend to voice-activated artificial intelligent (AI) assistants, such as Amazon’s Alexa. We looked at two factors: a speaker-related factor (normal vs fast speaking rate) and a word-related factor (easy vs difficult words). Easy words were defined as words having a higher frequency (words that occur more frequently in speech) and lower phonological neighborhood density (words that have fewer similarly sounding “neighbors” to which it can be misrecognized). In contrast, difficult words were defined as having a lower frequency (words that occur less frequently in speech) and higher phonological neighborhood density (words that have many similarly sounding “neighbors” to which it can be misrecognized). In the current study, 21 native English speakers asked Alexa to spell 150 words using the sentence “Alexa, I want you to spell ____”. Participants asked Alexa to spell each word twice: once at a normal speaking rate, and once at a faster speaking rate. The whole procedure was audio-recorded, which was subsequently analyzed to determine whether Alexa spelled each word accurately or not. The preliminary results from four participants showed that when they spoke normally, Alexa recognized the target words more accurately than when they spoke more quickly (85% vs 70%). Alexa also recognized easy words more accurately than difficult words (89% vs 64%). Overall, the findings suggest that Alexa’s word recognition is influenced by the same factors that affect the human listener. We believe that this study offers a compelling linguistic example of how AI and human speech processing are similar, corroborating the idea that deep-learning models integrated into AI devices mimic human learning.