AI Scribes: Beyond the Hype & Why You Should Be Sceptical

AI scribes, powered by a combination of large language models (LLMs) and speech-to-text dictation, are promising to slash administrative time by automating medical documentation. For clinicians, this means reclaiming hours that could be better spent on direct patient care. It sounds revolutionary, but is it reality?
Beneath the optimism lies a complex reality: concerns over hallucinations, biases, regulatory challenges, and long-term sustainability. As a frontline clinician in the NHS, I am exploring the real impact of AI scribes.
Introduction to the Writer
I am Dr. Chloe Jacklin, a frontline clinician within the NHS, currently working in Respiratory and Acute Medicine at a district general hospital in North Central London. Before this, I held a role as an academic doctor at Oxford University Hospitals, a trust renowned for its early adoption of digital healthcare solutions. Over the past five years in the NHS, I have developed a strong foundation in the integration of digital innovations into healthcare delivery.
So, What is an AI Scribe?
AI scribes have several alternative names: virtual assistants, ambient voice, digital scribes, and AI documentation assistants. There are over 110 products on the market with rapid expansion over the last 12 months – although the first AI scribe, Dragon, has been around for over 5 years.

Market Map of AI Ambient Scribes, Source: Terence Tan.
AI scribes work by augmenting speech-to-text dictation with an LLM. Essentially, it listens to patient consultations and then the LLM summarises it as clinical notes.
An example of a speech-to-text dictation software is Whisper from Open AI. Whisper is trained on 680,000 hours of multi-lingual and multi-task supervised data; however, the source of this data is not disclosed. It has been suggested that it is likely captioned YouTube videos, as Whisper tends to produce output such as ‘like and subscribe’ or ‘drop a comment below’ when faced with gaps in audio. Without clinical training data, using this tool in a clinical setting entails risk as hallucinations could be more frequent.
LLMs do not provide consistent responses, which makes them difficult to regulate. Currently, no LLMs are approved as medical devices. Many deem that LLMs are inadequate for autonomous decision-making in clinical settings. No AI scribes have been approved by governmental bodies such as the U.S. Food and Drug Administration (FDA) and the United Kingdom Conformity Assessment (UKCA).
However, because AI scribes have a clinician-in-the-loop to verify the output and because they do not directly impact patient care, they are deemed relatively low risk. Thus, it was determined that FDA or UKCA approval for scribes is not required, and currently, many hospitals and GP practices have started to adopt them despite their lack of approval.
Why the Hype?
As a frontline clinician, I wholly appreciate the critical role of high-quality documentation in patient safety and medico-legal protection. I can also confirm and agree with the study showing that doctors spend several hours a day on documentation.
Furthermore, prioritization of time-sensitive clinical demands means documentation can often occur after a shift was meant to end. So, a tool to automate this process and save precious clinical time would be highly valuable.
Not only could AI scribes save time, but they could also:
- Alleviate the cognitive burden of translating shorthand notes into full prose.
- Clinicians can remain fully engaged with a patient rather than note-taking.
- Help rapidly produce patient-friendly summaries alongside the more typical jargon-heavy document.
- Translate into the patient’s first language.
If AI scribes eliminate much of this workload, what happens next? Will this newfound time be given back to clinicians to spend as they wish? Whether this will be for their wellbeing, such as going home on time, or other clinical priorities like training. Or will it be absorbed into an expectation to see more patients within a given time frame? If the latter is true, the heightened cognitive burden from additional patient assessments could outweigh any benefits gained.
How are AI Scribes Being Used Currently?
Several AI scribe solutions have already been deployed in hospital and healthcare settings for real-world patient care:
- Stanford Hospital is using DAX Co-pilot, developed by Nuance Communications, a Microsoft company.
- EPIC, a widespread electronic patient record provider, has integrated Abridge into the software package, and Johns Hopkins Hospital, and Mayo Clinic are early adopters.
- Great Ormond Street Hospital NHS Foundation Trust has piloted TORTUS.
These AI tools appear to be enhancing their competitive edge by developing brand recognition via association with established brands.
AI scribe products also differ in their capabilities—for instance, Heidi Health offers a feature to generate patient-friendly summaries, Ambience Healthcare integrates real-time clinical coding, ensuring that medical records are accurately categorized for insurance reimbursement, and Nabla can operate in over 30 languages. They also vary in template sophistication and their ability to incorporate new data, for example, when a clinician pauses a recording and later adds additional information. However, a commonality across all AI scribe tools is the inclusion of hold-harmless clauses in subscriber contracts and the assertion that no recordings are stored, to ensure confidentiality.
Beneath the Optimism
Despite their potential, AI scribes present significant drawbacks that must be addressed before widespread adoption. As a clinician, I need assurance that this tool genuinely enhances my current practice, and I need to fully understand its limitations and my legal responsibility if anything goes awry. If a flawed system is imposed on clinicians—making them liable for errors that originate from engineering issues—clinicians will be reluctant to adopt it, potentially missing out on the significant benefits AI scribes could offer.
Flaws in the Making – Hallucinations, Omissions, and Biases
“Whisper is an OpenAI pre-trained speech recognition model with potential applications for ASR solutions for developers. However, due to weak supervision and large-scale noisy data, it should be used with caution in high-risk domains.”
Statement by the OpenAI team.
OpenAI has explicitly warned against deploying Whisper in high-risk domains such as healthcare, citing concerns over accuracy, reliability, and the potential for generating misleading information. Despite this warning, Whisper are already in use in practices and hospitals around the world.
Generative AI, like Whisper, has the capacity to hallucinate, often in a highly convincing manner. The more frequently these errors occur, the more time clinicians must spend verifying and editing the scribed content, increasing the risk of overlooked mistakes and potential litigations. Therefore, hallucination rates are a critical metric for clinicians to assess AI scribes’ reliability, yet they are not shared openly as they are commercially sensitive.
AI scribes have also been known to filter out relevant information and miscategorize data, such as conflating historical and current symptoms. They can be particularly error-prone with numerical values— in my experience, for instance, a heart rate of 60 beats per minute has been inaccurately recorded as 16 beats per minute, which is a dangerous discrepancy.
Another critical concern is bias in AI-generated documentation. The Careless Whisper study exposed how AI can perpetuate harmful biases, even producing hateful or violent content which may reflect the data source used to train Whisper. The research found that hallucinations were more likely in certain patient groups, such as those with speech impediments, strong accents, or frequent use of slang.
Transparency is essential; just as we have established reporting standards for other AI applications and regulated fields like pharmaceuticals, AI in clinical healthcare must also be held to similarly rigorous standards.
Are We Sure We Want Clinicians to Think Less?
While automation promises efficiency, it also raises concerns about the erosion of critical thinking in medical decision-making. If AI-generated summaries replace a clinician’s process of synthesizing information, could this ultimately weaken diagnostic acumen?
AI scribes can generate – or even hallucinate – automated diagnoses, which are frequently incorrect. Even when a clinician identifies these errors, the mere presence of an AI-suggested diagnosis could subtly influence clinical reasoning, increasing the risk of cognitive anchoring or confirmation bias.
Similarly, automation bias poses a significant risk. Clinicians may place undue trust in AI-generated documentation, particularly when presented in polished, full-prose text, making errors less obvious and potentially leading to undiscerning acceptance of inaccurate information.
How Will Patients Respond to AI Scribes?
AI scribes could enhance patient satisfaction by improving eye contact. On the other hand, some patients may feel uneasy knowing their conversations are being recorded, potentially making them less open to share sensitive information. Patients may also question and worry about how their data is handled.
Consent for use of AI scribes is clearly essential and should be requested at each consultation. However, the best approach to effective consent remains an open question when clinicians themselves are not fully informed of the risks and pitfalls.
We also need to see more real-world data showing how AI scribes perform in patients with different accents and first languages, and in different clinical settings such as the noisy emergency department.
So, Can We Overcome these Challenges?
To overcome these challenges, a multi-faceted approach is needed.
Firstly, transparency is critical for gaining trust from patients and clinicians. We need to see published data on error rates, preferably from real-world settings, to understand the contexts in which AI scribes are prone to make mistakes. Clinicians and commissioners must scrutinize this data, given that hold-harmless clauses in contracts push liability away from the supplier.
Secondly, if widespread adoption is truly taking place, it should be mandatory for clinicians to be trained in using AI scribes, in which training module content must be informed by real-world data. In my opinion, a hybrid workflow, where clinicians continue to make some shorthand notes, should be encouraged. These notes would focus on critical symptoms, such as ‘red flag’ indicators, and numerical data. This not only serves as a safeguard against transcription errors but also provides a backup should the AI malfunction or lose the documentation.
Thirdly, a ‘clinician in the loop’ model for training the AI could improve accuracy and reliability, as well as bolstering clinicians’ and patients’ confidence. One example is the proprietary CREOLA platform, which is used to train the TORTUS AI scribe.
Furthermore, an established route for clinicians to provide real-world feedback on errors to vendors would support continuous refinement.
Finally, independent auditing and benchmarking of AI scribe performance would maintain accountability and trust in the technology.
To Conclude
AI scribes hold great promise, however, if concerns surrounding accuracy and reliability remain unresolved, I worry AI scribes could become mired in controversy, ultimately hindering their successful adoption. To address these concerns, transparent real-world performance data is vital, especially in healthcare where the stakes are exceptionally high.