GastroGPT Outperforms General Models in GI Clinical Tasks

COPENHAGEN — GastroGPT, a novel specialty-specific, clinically-oriented artificial intelligence model, demonstrates superiority in overall utility and in key clinical tasks of gastroenterology when compared with leading general-purpose large language models (LLMs), show findings from a proof-of-concept study.

In the first head-to-head systematic comparison in gastroenterology, the researchers found that overall, GastroGPT significantly outperformed general AI models (OpenAIs GPT-4, Google’s Bard, and Anthropic’s Claude), scoring higher on all 10 simulated test cases.

“This is an exciting first step showing that an AI tool built for gastroenterology from the ground up can achieve higher performance and utility than a one-size-fits-all model,” said lead investigator Cem Simsek, MD, from Hacettepe University, Division of Gastroenterology, Hepatology and Endoscopy, Ankara, Turkey.

General purpose language models like ChatGPT4 that can chat about any topic are widely known, but to date their capabilities narrow significantly when applied to highly technical fields like medicine, said Simsek, who presented the results at United European Gastroenterology (UEG) Week 2023 .

“A great conversationalist doesn’t necessarily make a great clinician,” he explained. “Medical AI is different. You need models that deeply understand patient care, have the latest specialty knowledge, and integrate into clinical workflows.

“GastroGPT pioneers an approach where an LLM successfully performs routine clinical and administrative tasks,” specific to gastroenterology, he said.

It may save time by automating clinical tasks that usually require experts, “and bring expertise to settings where specialist doctors are hard to access,” he told Medscape Medical News.

An important advantage of AI systems like GastroGPT is the potential to provide quality GI care to underserved patient populations, especially in low- and middle-income countries, he added. “There is an enormous shortage of gastroenterology specialists in many parts of the world. Tools like GastroGPT could help democratize access to expert-level GI care globally.”

Proof-of-Concept AI for Gastroenterology

Simsek and his team hypothesized that GastroGPT — a first-of-its-kind, proof-of-concept, clinical LLM — would perform better than general knowledge LLMs in realistic clinical tasks including patient assessments, diagnostic recommendations, patient counselling, and treatment plans.

In designing their study, Simsek and his team drew on expert reviewers from various subspecialties, including hepatology, pancreatology, inflammatory bowel disease, endoscopy, gastrointestinal oncology, upper GI, and lower GI, from across the EU. The panel helped to assess the responses of GastroGPT in comparison to the three general-purpose LLMs.

For the evaluation, the experts helped to generate 10 simulated patient cases that were closely representative of reality. These cases varied in complexity (simple, medium, complex), frequency (common, medium, rare), subspecialty (endoscopy, hepatology, pancreas, oncology, nutrition, surgery), and setting (outpatient, inpatient, emergency, counseling, consultation).

The primary outcome was overall performance across tasks, while the secondary outcomes comprised performance on individual tasks, consistency of scores across evaluation criteria relating to frequency, complexity, and coherency. 

GastroGPT and the other AI models were challenged with seven clinical tasks for each case: assessment, additional history gathering, diagnostic test recommendation, management, multidisciplinary care and referral, follow-up plan, and patient counseling/education. “For each task, the panel was asked to evaluate the outputs with respect to accuracy, relevance, alignment, usability, and practicality,” said Simsek.

A total of 480 evaluations were performed, and GastroGPT performed better than all three models in most domains, he reported.

Overall scores across tasks and cases were mean 8.30 (± 1.28 SD) for GastroGPT compared with 5.58 (± 2.02 SD), 6.23 (± 2.16 SD), and 7.78 (± 1.42 SD) for ChatGPT4, Bard, and Claude, respectively; all were statistically significant (P < .001).

“Only in follow-up planning did the Anthropic Claude model outperform GastroGPT at 7.82 (± 1.88) vs 7.45 (± 2.04) respectively,” reported Simsek.

“We were very impressed by GastroGPT’s clinical aptitude,” he added. “It displayed a nuanced understanding of gastroenterology and pragmatism that the general models clearly lacked.” 

Looking ahead, Simsek said GastroGPT might have value in other applications including screening patient cases and flagging high-risk situations, providing second opinions on complex cases, catching potential errors or inconsistencies, automating components of care plans and referrals, being available 24/7 for patient questions and triage, and supporting research and education.

GastroGPT vs Real People?

Co-moderator Monika Ferlitsch, MD, gastroenterologist from the Department of Internal Medicine III, Medical University Vienna, Vienna, Austria, asked Simsek, “Where do you see the application for this? Is it for a young gastroenterologist before they treat the patient or is it for remote clinics that might not have access to a gastroenterologist perhaps? Also, will it ever outperform experts?”

“This is version one of our results,” replied Simsek. “My first motivation for this project was to provide gastroenterology expertise for people around the world who cannot reach quality care. With these models, they will be able to reach quality care that is expert level instantly and of minimum cost.”  

Also commenting from the audience was Laurence Lovat, MD, consultant gastroenterologist at UCL Hospitals NHS Foundation Trust, London, UK, who is chair of the AI Task Force for the British Society of Gastroenterology. “We’re very excited about what you’re doing. We see your model works very well next to the more general large language models, but they don’t work very well at all. Do you have any feel for how well it works next to real people yet?”

“This was just a proof-of-concept model and we definitely want to compare it to human output,” replied Simsek. “In my experience, it is definitely not inferior, and if the comparison is a non-expert physician for an expertise-requiring question, then I believe it is superior.”

“In the future, I do not think that we should ever let AI do a task with a patient on its own,” Simsek added. “It should always be under the supervision of a healthcare provider or expert physician, until we obtain enough data to show it is valid.”

Simsek and Ferlitsch report no relevant financial relationships. Lovat is chair of the AI Task Force for the British Society of Gastroenterology; receives research grants from Medtronic and Pentax Medical; and receives scientific advisory boards fees from Odin Vision.

United European Gastroenterology (UEG) Week 2023 : Abstract LB16. Presented October 17 , 2023.

For more news, follow Medscape on Facebook, X (formerly Twitter), Instagram, YouTube, and LinkedIn

Source: Read Full Article