[ Ana Sayfa | Editörler | Danışma Kurulu | Dergi Hakkında | İçindekiler | Arşiv | Yayın Arama | Yazarlara Bilgi | E-Posta ]
Fırat University Medical Journal of Health Sciences
2026, Cilt 40, Sayı 1, Sayfa(lar) 098-104
[ Turkish ] [ Tam Metin ] [ PDF ]
Comparative Evaluation of Three Artificial Intelligence Chatbots in Providing Information on Pediatric Celiac Disease
Ecem İpek ALTINOK1, Özlem SÜMER COŞAR2, Volkan ALTINOK3
1Ordu University, Faculty of Medicine, Department of Child Health and Diseases, Ordu, TÜRKİYE
2Gazi University, Faculty of Medicine, Department of Pediatric Gastroenterology, Ankara, TÜRKİYE
3Ordu University, Faculty of Medicine, Department of Pediatric Surgery, Ordu, TÜRKİYE
Keywords: Celiac disease, pediatrics, artificial intelligence, chatbot

Objective: This study aimed to evaluate and compare the performance of three widely used chatbots—ChatGPT, Gemini, and Copilot—in providing accurate and reliable answers to frequently asked questions (FAQs) related to pediatric celiac disease (CD).

Materials and Methods: A 40-item FAQ set was developed based on international guidelines and recent review articles, covering definitions, diagnosis, clinical features, laboratory tests, complications, treatment, and follow-up. Each question was independently posed in Turkish to ChatGPT, Gemini, and Copilot in August 2025 using new sessions to minimize contextual bias. Responses were blindly evaluated by a pediatric gastroenterologist, a pediatrician, and a pediatric surgeon with celiac disease. Answers were classified as: (1) comprehensive/accurate, (2) incomplete/partially accurate, (3) mixed/misleading, or (4) incorrect/irrelevant. Inter-model agreement was assessed using Cohen’s kappa, and comparative statistical analyses were performed to evaluate differences in response accuracy.

Results: ChatGPT provided the highest proportion of comprehensive/accurate responses (35/40; 87.5%), followed by Gemini and Copilot (28/40; 70% each). ChatGPT demonstrated significantly higher accuracy compared with the other chatbots (χ² test, p<0.05). Copilot generated the highest rate of misleading responses (6/40; 15%). In subgroup analyses, ChatGPT performed best in treatment and follow-up questions (16/17; 94.1%), while Gemini showed relatively better performance in basic knowledge and clinical features (5/8; 62.5%) without producing misleading answers. Inter-model agreement was limited (ChatGPT–Copilot κ=0.32; Gemini–Copilot κ=0.35; ChatGPT–Gemini κ=0.11).

Conclusion: ChatGPT demonstrated the most guideline-concordant performance, whereas Copilot carried a higher risk of misleading outputs. These findings highlight both the potential and limitations of AI chatbots as first-contact tools for patient and family education, emphasizing the need for expert oversight, awareness of possible hallucinations, and guideline-based frameworks.


[ Turkish ] [ Tam Metin ] [ PDF ]
[ Ana Sayfa | Editörler | Danışma Kurulu | Dergi Hakkında | İçindekiler | Arşiv | Yayın Arama | Yazarlara Bilgi | E-Posta ]