Performance of Large Language Models on Official Periodontology Questions: A 13-Year Analysis of the Turkish Dental Specialization Examination
Abstract
Background: This study systematically evaluated the performance of large language models (LLMs) on official periodontology questions from the Turkish Dental Specialization Examination (DUS).
Methods: A total of 180 text-based questions (159 multiple-choice (MCQs), 21 combination-type MCQs (C-MCQs)) were categorized into nine domains across 13 years (2012–2024). In April 2025, eight LLMs were tested: ChatGPT-4o, ChatGPT-4o mini (OpenAI), Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash (Google DeepMind), Copilot (Microsoft), DeepSeek-V3 (DeepSeek), and Qwen 2.5-Max (Alibaba Cloud). Each question was submitted independently via official interfaces. Accuracy rates were compared across models, domains, years, and question types using Pearson’s chi-square test, with Cramér’s V and Phi coefficients reported for effect sizes.
Results: Accuracy differed significantly by domain (χ²(8, N = 1440) = 38.20, p < .001, Cramér’s V = .163). Gemini 2.5 Pro achieved the highest performance, scoring 100% in six domains and ≥87.5% in others. ChatGPT-4o mini and Qwen 2.5-Max underperformed, particularly in Periodontium and Periodontal Treatment. Year-based analysis showed stable performance across 2012–2024 (χ²(12, N = 1440) = 14.51, p = .269). No difference emerged between MCQs and C-MCQs (χ²(1, N = 1440) = 1.42, p = .233).
Conclusion: LLM accuracy in periodontology is domain- and model-dependent. Advanced systems such as Gemini 2.5 Pro show potential as supportive tools for education and clinical decision-making, yet persistent weaknesses in reasoning- and calculation-intensive areas underscore the need for expert oversight.
Keywords
Supporting Institution
No specific grant or financial support was received from any institution for this research.
Ethical Statement
Ethical approval was not required for this study as it involved analysis of publicly available data/literature and did not include human or animal subjects.
Thanks
Not applicable
References
- 1. Kamath P, Kamath P, Saldanha SJR, et al. A brief exploration of artificial intelligence in dental healthcare: a narrative review. F1000Res. 2024;13:37 DOI:10.12688/f1000research.140481.2
- 2. El-Hakim M, Anthonappa R and Fawzy A. Artificial intelligence in dental education: a scoping review of applications, challenges, and gaps. Dent J. 2025;13:384. DOI:10.3390/dj13090384
- 3. Patil R and Gudivada V. A review of current trends, techniques, and challenges in large language models (LLMs). Appl Sci. 2024;14:2074. DOI:10.3390/app14052074
- 4. Ahmad P, Asif JA, Alam MK et al. A bibliometric analysis of Periodontology 2000. Periodontol 2000. 2020;82:286–97. DOI:10.1111/prd.12328
- 5. You W, Hao A, Li S, et al. Deep learning-based dental plaque detection on primary teeth: a comparison with clinical assessments. BMC Oral Health. 2020;20:141. DOI:10.1186/s12903-020-01114-6
- 6. Lin CC, Sun JS, Chang CH, et al. Performance of artificial intelligence chatbots in national dental licensing examination. J Dent Sci. 2025 DOI:10.1016/j.jds.2025.05.012
- 7. Ölçme, Seçme ve Yerleştirme Merkezi (ÖSYM). DUS çıkmış sorular. Available from: http://www.osym.gov.tr/TR,15070/dus-cikmissorular. html (Accessed April 21, 2025).
- 8. Sismanoglu S and Capan BS. Performance of artificial intelligence on Turkish dental specialization exam: can ChatGPT-4.0 and Gemini Advanced achieve comparable results to humans?. BMC Med Educ. 2025;25:214 DOI:10.1186/s12909-024-06389-9
Details
Primary Language
English
Subjects
Clinical Sciences (Other)
Journal Section
Research Article
Publication Date
February 20, 2026
Submission Date
November 3, 2025
Acceptance Date
December 11, 2025
Published in Issue
Year 2026 Volume: 17 Number: January, February, March 2026