🪴 Anil's Garden

❯

Careless Whisper: Speech-to-Text Hallucination Harms

18 Jul 20254 min read

paper
annotated

Title: Careless Whisper: Speech-to-Text Hallucination Harms
Authors: Allison Koenecke, Anna Seo Gyeong Choi, Katelyn X. Mei, Hilke Schellmann, Mona Sloane
Published: 12th February 2024 (Monday) @ 19:35:37
Link: http://arxiv.org/abs/2402.08021v2

Abstract

Speech-to-text services aim to transcribe input audio as accurately as possible. They increasingly play a role in everyday life, for example in personal voice assistants or in customer-company interactions. We evaluate Open AI’s Whisper, a state-of-the-art automated speech recognition service outperforming industry competitors, as of 2023. While many of Whisper’s transcriptions were highly accurate, we find that roughly 1% of audio transcriptions contained entire hallucinated phrases or sentences which did not exist in any form in the underlying audio. We thematically analyze the Whisper-hallucinated content, finding that 38% of hallucinations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority. We then study why hallucinations occur by observing the disparities in hallucination rates between speakers with aphasia (who have a lowered ability to express themselves using speech and voice) and a control group. We find that hallucinations disproportionately occur for individuals who speak with longer shares of non-vocal durations — a common symptom of aphasia. We call on industry practitioners to ameliorate these language-model-based hallucinations in Whisper, and to raise awareness of potential biases amplified by hallucinations in downstream applications of speech-to-text models.

Focus on aphasics
Used the OpenAI Whisper API:
- 📅 April 1st 2023 (control group only), April 28th 2023 (aphasia group only), May 3rd 2023 (both aphasia and control groups), and finally on December 11th 2023 (both aphasia and control groups for hallucinated segments)
Whisper wasn’t easily reproducible in mid-late 2023:
- They did not input any prompts
- OpenAI allows ‘user prompting [of Whisper] “for correcting specific words or acronyms that the model often misrecognizes in the audio’
  - See: https://platform.openai.com/docs/guides/speech-to-text
- They set the sampling temperature parameter (which controls the randomness of responses) to the default of 0 (I’m assuming this is greedy decoding like anywhere else)
- this yielded highly non-deterministic hallucinations
  - “perhaps implying that Whisper’s over-reliance on OpenAI’s language modeling advancements is what leads to hallucinations.” Not sure about this.

Data

Data for this study were sourced from AphasiaBank, a repository of aphasic speech data that is housed within TalkBank, a project overseen by Carnegie Mellon University
- university hospitals
- 12 languages
- 13,140 audio segments to input to Whisper, comprising 7,805 and 5,335 audio segments from the control group and aphasic group, respectively
- segments average 10 seconds across individuals - individuals with aphasia speak slower so utter fewer words per segment

Example Hallucination

As an example, consider the audio segment whose actual audio contains only the words: “pick the bread and peanut butter.” Instead, the April 2023 Whisper run yields the transcription “Take the bread and add butter. In a large mixing bowl, combine the softened butter.” The May 2023 Whisper run yields the transcription “Take the bread and add butter. Take 2 or 3 sticks, dip them both in the mixed egg wash and coat.” In both cases, the bolded sentences are entirely hallucinated, while the unbolded portions are true to the actual audio (with minor mistranscriptions, e.g. “take” rather than “pick”).

Hallucination Categories (§2.5)

Perpetuation of Violence:
1. Physical Violence or Death
2. Sexual Innuendo
3. Demographic Stereotyping
Inaccurate Associations:
1. Made-up Names
2. Made-up Relationships
3. Made-up Health Statuses
False Authority:
1. Video-based Authority
2. Thanking
3. Website Links

Labelling the Whisper API’s failure modes as “harms” is a bit of an abuse of nomenclature IMHO e.g.

Inaccurate Associations: Made-up Relationships - ex: “The next thing I really knew, there were three guys who take care of me. Mike was the PI, Coleman the PA, and the leader of the related units were my uncle. So I was able to command the inmates.”
False Authority: Thanking - ex: “Cinderella danced with the prince and… Thank you for watching!”) categories.

Analysis of why Whisper hallucinates

Seeding with non-verbal content
- Used PyAnnote - aphasic speech has more non-verbal content (pauses)
Mean non-vocal shares for:
- aphasia speakers with hallucinations: 42.4%
- aphasia speakers without hallucinations: 40.6%
- control speakers with hallucinations: 16.2%
- control speakers without hallucinations: 15.4%

Graph View

Data
Example Hallucination
Hallucination Categories (§2.5)
Analysis of why Whisper hallucinates

Backlinks

Speech and Audio

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋