ChatGPT Defeated Doctors at Diagnosing Illness - The New York Times
Excerpt
A small study found ChatGPT outdid human physicians when assessing medical case histories, even when those doctors were using a chatbot.
You have been granted access, use your keyboard to continue reading.
A small study found ChatGPT outdid human physicians when assessing medical case histories, even when those doctors were using a chatbot.
In an experiment, doctors who were given ChatGPT to diagnose illness did only slightly better than doctors who did not. But the chatbot alone outperformed all the doctors.CreditâŠMichelle Gustafson for The New York Times
Nov. 17, 2024
Dr. Adam Rodman, an expert in internal medicine at Beth Israel Deaconess Medical Center in Boston, confidently expected that chatbots built to use artificial intelligence would help doctors diagnose illnesses.
He was wrong.
Instead, in a study Dr. Rodman helped design, doctors who were given ChatGPT-4 along with conventional resources did only slightly better than doctors who did not have access to the bot. And, to the researchersâ surprise, ChatGPT alone outperformed the doctors.
âI was shocked,â Dr. Rodman said.
The chatbot, from the company OpenAI, scored an average of 90 percent when diagnosing a medical condition from a case report and explaining its reasoning. Doctors randomly assigned to use the chatbot got an average score of 76 percent. Those randomly assigned not to use it had an average score of 74 percent.
The study showed more than just the chatbotâs superior performance.
It unveiled doctorsâ sometimes unwavering belief in a diagnosis they made, even when a chatbot potentially suggests a better one.
And the study illustrated that while doctors are being exposed to the tools of artificial intelligence for their work, few know how to exploit the abilities of chatbots. As a result, they failed to take advantage of A.I. systemsâ ability to solve complex diagnostic problems and offer explanations for their diagnoses.
A.I. systems should be âdoctor extenders,â Dr. Rodman said, offering valuable second opinions on diagnoses.
But it looks as if there is a way to go before that potential is realized.
Case History, Case Future
The experiment involved 50 doctors, a mix of residents and attending physicians recruited through a few large American hospital systems, and was published last month in the journal JAMA Network Open.
The test subjects were given six case histories and were graded on their ability to suggest diagnoses and explain why they favored or ruled them out. Their grades also included getting the final diagnosis right.
The graders were medical experts who saw only the participantsâ answers, without knowing whether they were from a doctor with ChatGPT, a doctor without it or from ChatGPT by itself.
The case histories used in the study were based on real patients and are part of a set of 105 cases that has been used by researchers since the 1990s. The cases intentionally have never been published so that medical students and others could be tested on them without any foreknowledge. That also meant that ChatGPT could not have been trained on them.
But, to illustrate what the study involved, the investigators published one of the six cases the doctors were tested on, along with answers to the test questions on that case from a doctor who scored high and from one whose score was low.
That test case involved a 76-year-old patient with severe pain in his low back, buttocks and calves when he walked. The pain started a few days after he had been treated with balloon angioplasty to widen a coronary artery. He had been treated with the blood thinner heparin for 48 hours after the procedure.
The man complained that he felt feverish and tired. His cardiologist had done lab studies that indicated a new onset of anemia and a buildup of nitrogen and other kidney waste products in his blood. The man had had bypass surgery for heart disease a decade earlier.
The case vignette continued to include details of the manâs physical exam, and then provided his lab test results.
The correct diagnosis was cholesterol embolism â a condition in which shards of cholesterol break off from plaque in arteries and block blood vessels.
Participants were asked for three possible diagnoses, with supporting evidence for each. They also were asked to provide, for each possible diagnosis, findings that do not support it or that were expected but not present.
The participants also were asked to provide a final diagnosis. Then they were to name up to three additional steps they would take in their diagnostic process.
Like the diagnosis for the published case, the diagnoses for the other five cases in the study were not easy to figure out. But neither were they so rare as to be almost unheard-of. Yet the doctors on average did worse than the chatbot.
What, the researchers asked, was going on?
The answer seems to hinge on questions of how doctors settle on a diagnosis, and how they use a tool like artificial intelligence.
The Physician in the Machine
How, then, do doctors diagnose patients?
The problem, said Dr. Andrew Lea, a historian of medicine at Brigham and Womenâs Hospital who was not involved with the study, is that âwe really donât know how doctors think.â
In describing how they came up with a diagnosis, doctors would say, âintuition,â or, âbased on my experience,â Dr. Lea said.
That sort of vagueness has challenged researchers for decades as they tried to make computer programs that can think like a doctor.
The quest began almost 70 years ago.
âEver since there were computers, there were people trying to use them to make diagnoses,â Dr. Lea said.
One of the most ambitious attempts began in the 1970s at the University of Pittsburgh. Computer scientists there recruited Dr. Jack Myers, chairman of the medical schoolâs department of internal medicine who was known as a master diagnostician. He had a photographic memory and spent 20 hours a week in the medical library, trying to learn everything that was known in medicine.
Dr. Myers was given medical details of cases and explained his reasoning as he pondered diagnoses. Computer scientists converted his logic chains into code. The resulting program, called INTERNIST-1, included over 500 diseases and about 3,500 symptoms of disease.
To test it, researchers gave it cases from the New England Journal of Medicine. âThe computer did really well,â Dr. Rodman said. Its performance âwas probably better than a human could do,â he added.
But INTERNIST-1 never took off. It was difficult to use, requiring more than an hour to give it the information needed to make a diagnosis. And, its creators noted, âthe present form of the program is not sufficiently reliable for clinical applications.â
Research continued. By the mid-1990s there were about a half dozen computer programs that tried to make medical diagnoses. None came into widespread use.
âItâs not just that it has to be user friendly, but doctors had to trust it,â Dr. Rodman said.
And with the uncertainty about how doctors think, experts began to ask whether they should care. How important is it to try to design computer programs to make diagnoses the same way humans do?
âThere were arguments over how much a computer program should mimic human reasoning,â Dr. Lea said. âWhy donât we play to the strength of the computer?â
The computer may not be able to give a clear explanation of its decision pathway, but does that matter if it gets the diagnosis right?
The conversation changed with the advent of large language models like ChatGPT. They make no explicit attempt to replicate a doctorâs thinking; their diagnostic abilities come from their ability to predict language.
âThe chat interface is the killer app,â said Dr. Jonathan H. Chen, a physician and computer scientist at Stanford who was an author of the new study.
âWe can pop a whole case into the computer,â he said. âBefore a couple of years ago, computers did not understand language.â
But many doctors may not be exploiting its potential.
Operator Error
After his initial shock at the results of the new study, Dr. Rodman decided to probe a little deeper into the data and look at the actual logs of messages between the doctors and ChatGPT. The doctors must have seen the chatbotâs diagnoses and reasoning, so why didnât those using the chatbot do better?
It turns out that the doctors often were not persuaded by the chatbot when it pointed out something that was at odds with their diagnoses. Instead, they tended to be wedded to their own idea of the correct diagnosis.
âThey didnât listen to A.I. when A.I. told them things they didnât agree with,â Dr. Rodman said.
That makes sense, said Laura Zwaan, who studies clinical reasoning and diagnostic error at Erasmus Medical Center in Rotterdam and was not involved in the study.
âPeople generally are overconfident when they think they are right,â she said.
But there was another issue: Many of the doctors did not know how to use a chatbot to its fullest extent.
Dr. Chen said he noticed that when he peered into the doctorsâ chat logs, âthey were treating it like a search engine for directed questions: âIs cirrhosis a risk factor for cancer? What are possible diagnoses for eye pain?ââ
âIt was only a fraction of the doctors who realized they could literally copy-paste in the entire case history into the chatbot and just ask it to give a comprehensive answer to the entire question,â Dr. Chen added.
âOnly a fraction of doctors actually saw the surprisingly smart and comprehensive answers the chatbot was capable of producing.â
A version of this article appears in print on Nov. 19, 2024, Section D, Page 4 of the New York edition with the headline: A.I. Chatbots Defeated Doctors at Diagnosing Illness. Order Reprints | Todayâs Paper | Subscribe
Related Content
inEducation: Computer Science
New York Times inEducation has been designed as a resource to connect Times journalism with key areas of study for students and faculty through our Education Subscription Program. If you are affiliated with a U.S. college, visit accessnyt.com to learn if your institution provides New York Times access. Others should inquire with their school or local library. If you are a faculty or staff member interested in bringing The New York Times to your school, visit the Group Subscriptions Page.
More in Health
-
Jamie Chung for The New York Times
-
Kim Komenich/The San Francisco Chronicle
-
Pool photo by Bradly J. Boner
-
Natureâs Faces/Science Source
-
Rachel Woolf for The New York Times
-
Marco Garro for The New York Times
Editorsâ Picks
-
Nico Schinco for The New York Times. Food Stylist: Barrett Washburne.
-
Aktarer Zaman
-
Miguel Porlan
Trending in The Times
Shared with you by a Times subscriber
You have access to this article thanks to someone you know. Keep exploring The Times with a free account.