MANHASSET, N.Y.--(BUSINESS WIRE)--While the popular artificial intelligence (AI) ChatGPT is seen as a potential educational tool, it won’t be getting its medical specialty certification anytime soon. To test its abilities and accuracy, investigators at The Feinstein Institutes for Medical Research asked the consumer-facing ChatGPT (Chat Generative Pre-trained Transformer, OpenAI) to take the 2021 and 2022 multiple-choice self-assessment tests for the American College of Gastroenterology. ChatGPT failed to make the grade, scoring 65.1 percent and 62.4 percent compared to the required 70 percent to pass the exams. Full details of the study were published today in the American Journal of Gastroenterology.
ChatGPT is a 175-billion-parameter natural language processing model that generates human-like text in response to user prompts. The tool is a large language model (LLM) trained to predict word sequences based on context. ChatGPT has been tested before, even passing the United States Medical Licensing Exam. In this study, the Feinstein Institutes’ researchers wanted to challenge ChatGPT’s (versions 3 and 4) ability to pass the ACG assessment, which is supposed to gauge how one would fare on the actual American Board of Internal Medicine (ABIM) Gastroenterology board examination.
“Recently, there has been a lot of attention on ChatGPT and the use of AI across various industries. When it comes to medical education, there is a lack of research around this potential ground-breaking tool,” said Arvind Trindade, MD, associate professor at the Feinstein Institutes’ Institute of Heath System Science and senior author on the paper. “Based on our research, ChatGPT should not be used for medical education in gastroenterology at this time and has a ways to go before it should be implemented into the health care field.”
Each ACG test consists of 300 multiple-choice questions with real-time feedback. Each question and answer was copied and pasted directly into the ChatGPT versions 3 and 4. Overall, ChatGPT answered 455 questions (145 questions were excluded because of an image requirement). Chat GPT-3 answered 296 of 455 questions correctly (65.1 percent) across the two exams, and Chat GPT-4 answered 284 questions correctly (62.4 percent).
“ChatGPT has sparked enthusiasm, but with that enthusiasm comes skepticism around the accuracy and validity of AI’s current role in health care and education,” Andrew C. Yacht, MD, senior vice president, academic affairs and chief academic officer at Northwell Health. “Dr. Trindade’s fascinating study is a reminder that, at least for now, nothing beats hitting time-tested resources like books, journals and traditional studying to pass those all-important medical exams.”
ChatGPT does not have any intrinsic understanding of a topic or issue. Potential explanations for ChatGPT’s failing grade could be the lack of access to paid subscription medical journals or ChatGPT’s sourcing of questionable outdated or non-medical sources, with more research needed before it is used reliably.
About the Feinstein Institutes
The Feinstein Institutes for Medical Research is the home of the research institutes of Northwell Health, the largest health care provider and private employer in New York State. Encompassing 50 research labs, 3,000 clinical research studies and 5,000 researchers and staff, the Feinstein Institutes raises the standard of medical innovation through its five institutes of behavioral science, bioelectronic medicine, cancer, health system science, and molecular medicine. We make breakthroughs in genetics, oncology, brain research, mental health, autoimmunity, and are the global scientific leader in bioelectronic medicine – a new field of science that has the potential to revolutionize medicine. For more information about how we produce knowledge to cure disease, visit http://feinstein.northwell.edu and follow us on LinkedIn.