Study Finds OpenAI’s Popular Transcription Tool ‘Whisper’ Hallucinates Text
A recent multi-institutional study carried out by U.S. researchers has found that Whisper, a widely known transcription tool powered by OpenAI (CA, U.S.), is potentially putting patients at risk by transcribing medical information that was never said.
According to the study, published in the ACM Digital Library, Whisper “hallucinated” -invented words and sentences that were never said- in around 1% of its transcriptions. The tool occasionally created harmful content, including racially offensive and violent statements and naming non-existing medications.
Whisper: The Go-To Transcription Tool
Whisper AI is an automatic speech recognition system built by ChatGPT’s developer OpenAI, designed to convert spoken language into written content in different languages.
Incorporated into specific versions of ChatGPT and embedded within the cloud computing platforms of Oracle and Microsoft, Whisper is one of the most widely used voice recognition models today. According to the public AI platform HuggingFace (NY, U.S.), the system surpassed 4 million downloads last month.
Moreover, Whisper is also being widely used in a medical context. AI tool developer Nabla (Paris, France) has incorporated Whisper into its tools for medical transcribing. Despite Open AI’s warnings against using Whisper in “high-risk domains”, Nabla reported that more than 30,000 clinicians and 40 medical organizations rely on the technology to transcribe patient visits.
The Study’s Concerning Findings
The U.S. multi-institutional study found that roughly 1% of the transcriptions Whisper generated contained “hallucinations”, which are patterns in the system that lead to inaccurate or invented results.
Researchers noted that Whisper “made up” sections of text or even entire sentences. One example of this took place when a speaker said “He, the boy, was going to, I’m not sure exactly, take the umbrella.”, and Whisper added, “He took a big piece of a cross, a teeny, small piece … I’m sure he didn’t have a terror knife so he killed a number of people.”
In another case, when a speaker described “two girls and one lady”, the system said “two other girls and one lady, um, which were Black.” And thirdly, Whisper invented a fictional medication called “hyperactivated antibiotics”. Researchers determined that 38% of hallucinations include explicit harm.
The Mankato Clinic in Minnesota and Children’s Hospital Los Angeles are among the 40 health systems that use a Whisper AI copilot service from medical tech company Nabla.
Nabla’s chief technology officer Martin Raison reportedly said that “it’s impossible to compare Nabla’s AI-generated transcript to the original recording because Nabla’s tool erases the original audio for data safety reasons.”
This creates more problems as doctors cannot verify accuracy against the original material.
Moreover, according to an AP report, a researcher from the University of Michigan also discovered hallucinations in 8 out of 10 Whisper audio transcriptions. Meanwhile, another machine learning engineer found hallucinations in about half of over 100 hours of the transcriptions reviewed. Additionally, a third developer detected hallucinations in nearly all of the 26,000 transcripts examined.
Another study found that other transcription tools seem to also be prone to generate inaccurate written text and Google’s AI Overviews were under scrutiny after the system suggested using glue to prevent cheese from slipping off pizza.
Why do AI Tools Hallucinate?
It all goes back to the training. Hallucinations are most associated with large language models (LLMs) and are usually the result of using a small dataset to train the LLM. This leads to insufficient training data or the presence of biases in the training data.
LLM models, like Whisper are built on what’s known as a transformer architecture, which processes tokens and predicts the next token in a sequence. For Whisper, the input is tokenized audio data, and its output is a prediction of what is most likely; not necessarily what is most accurate. Therefore, if Whisper isn’t given sufficient data to make an accurate decision, it will revert to what it already knows from its training data and that might not be correct at all.
What to Expect
According to AP, although OpenAI stated that Whisper “approaches human level robustness and accuracy”, a company spokesperson said that it is consistently working to minimize hallucinations and values the researchers’ findings. It was noted that OpenAI integrates feedback into its model updates. Additionally, Nabla acknowledges the hallucination concerns and is reportedly taking steps to resolve them.
Despite recent advice to use Whisper with caution, continuing to rely on these types of technologies to speed up medical transcription procedures could potentially endanger the health of patients.
This situation reflects a broader sense of uncertainty in the application of AI in healthcare. While AI has shown the potential to advance different areas of healthcare, including breast cancer detection and drug design, there are significant concerns regarding the lack of regulations, device reliability, inherent biases, and patient data breaches that are preventing AI from becoming a reliable tool to enhance modern patient care. Therefore, further efforts must focus on addressing these issues to leverage the full potential of these technologies and enhance human-driven medicine while ensuring patients’ needs are safeguarded.