AI Chatbots Misdiagnose In Over 80% Of Early Medical Cases

Share This:

From “technocracy.news”

Don’t trust your favorite consumer-grade LLM chatbot with your health decisions. If you can’t provide the correct information in the first place, don’t expect an accurate diagnosis. Despite warnings on all LLMs to trust your doctor or medical professional, many people put false hope in the Chatbot anyway. ⁃ Patrick Wood, Editor.

Consumer AI chatbots falter when used to make medical diagnoses, particularly when faced with incomplete information, according to new research highlighting the risks of relying on them as digital doctors.

The study finds that leading large language models struggle to suggest a range of possible diagnoses when patient data is limited, frequently narrowing too quickly to a single answer.

The results point to a broader limitation in AI: while chatbots can identify likely conditions once a case is fully specified, they are less reliable at the earlier, more uncertain stages of clinical reasoning.

The findings highlight the dangers of relying on the technology alone to pinpoint health problems, particularly in cases where the data users input may be vague or patchy.

“These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn’t much information,” said Arya Rao, the study’s lead author and a researcher at the Massachusetts-based Mass General Brigham healthcare system.

The study, published in Jama Network Open on Monday, tested AI models using 29 clinical vignettes based on a standard medical reference text.

The experiment involved step-by-step disclosure of data including the history of present illness, physical examination findings and laboratory results. The researchers posed the LLMs diagnostic queries and measured their failure rates, defined as the proportion of questions not answered fully correctly.

The researchers evaluated 21 LLMs, including leading models by OpenAI, Anthropic, Google, xAI and DeepSeek.

It found that failure rates exceeded 80 per cent for all models when they needed to do so-called differential diagnosis — when full patient information was lacking.

The failure rates fell to less than 40 per cent for final diagnoses with more complete data, with the best performers exceeding 90 per cent accuracy.

Claude is trained to direct people who ask medical questions to professionals, Anthropic said. Gemini is designed to do the same and has reminders built into its app to prompt users to double-check information, Google said.

OpenAI’s usage policy says its services should not be used to provide medical advice requiring a licence without appropriate professional involvement.
xAI did not respond to a request for comment. DeepSeek could not be reached for comment.

Companies have been developing more specialised medical LLMs such as Google’s Articulate Medical Intelligence Explorer (AMIE) and MedFound.

Early results from evaluations of models such as AMIE were promising, said Sanjay Kinra, a clinical epidemiologist at the London School of Hygiene & Tropical Medicine. But they were unlikely to be able to match how doctors’ clinical assessments “rely heavily on the look and feel of the patient”, he added.

“Nevertheless, they may have a role to play, particularly in situations or geographies in which access to doctors is limited,” Kinra said. “So we urgently need research studies with actual patients from those settings.”

Read full story here…

Share This: