Chinese Generative AI Models Challenge Western AI in Clinical Chemistry MCQs: A Benchmarking Follow-up Study on AI Use in Health Education

Malik  Sallam; Kholoud  Al-Mahzoum; Huda  Eid; Khaled  Al-Salahat; Mohammed  Sallam; Guma  Ali; Maad M.  Mijwil

doi:10.58496/BJAI/2025/001

PDF

Published: 2025-02-08

DOI: https://doi.org/10.58496/BJAI/2025/001

Keywords:

AI, Benchmarking, LLM, DeepSeek, Qwen

Malik Sallam

Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman 11942, Jordan

https://orcid.org/0000-0002-0165-9670

Kholoud Al-Mahzoum

Sheikh Jaber Al-Ahmad Al-Sabah Hospital, Ministry of Health, Kuwait City, Kuwait

https://orcid.org/0009-0002-0143-7645

Huda Eid

Danube Private University, Steiner Landstraße 124, Krems-Stein, 3500, Austria

https://orcid.org/0000-0002-2804-1942

Khaled Al-Salahat

Faculty of Medicine, Aqaba Medical Sciences University (AMSU), Aqaba, Jordan

https://orcid.org/0000-0002-6277-7937

Mohammed Sallam

Department of Pharmacy, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai P.O. Box 505004, United Arab Emirates

https://orcid.org/0000-0003-3273-524X

Guma Ali

Department of Computer and Information Science, Faculty of Technoscience, Muni University, Arua, Uganda

https://orcid.org/0000-0003-3234-6420

Maad M. Mijwil

College of Administration and Economics, Al-Iraqia University, Baghdad, Iraq

https://orcid.org/0000-0002-2884-2504

Abstract

Background: The emergence of Chinese generative AI (genAI) models, such as DeepSeek and Qwen, has introduced strong competition to Western genAI models. These advancements hold significant potential in healthcare education. However, benchmarking the performance of genAI models in specialized medical disciplines is crucial to assess their strengths and limitations. This study builds on prior research evaluating ChatGPT (GPT-3.5 and GPT-4), Bing, and Bard against human postgraduate students in Medical Laboratory Sciences, now incorporating DeepSeek and Qwen to assess their effectiveness in Clinical Chemistry Multiple-Choice Questions (MCQs).

Methods: This study followed the METRICS framework for genAI-based healthcare evaluations, assessing six models using 60 Clinical Chemistry MCQs previously administered to 20 MSc students. The facility index and Bloom’s taxonomy classification were used to benchmark performance. GenAI models included DeepSeek-V3, Qwen 2.5-Max, ChatGPT-4, ChatGPT-3.5, Microsoft Bing, and Google Bard, evaluated in a controlled, non-interactive environment using standardized prompts.

Results: The evaluated genAI models showed varying accuracy across Bloom’s taxonomy levels. DeepSeek-V3 (0.92) and ChatGPT-4 (1.00) outperformed humans (0.74) in the Remember category, while Qwen 2.5-Max (0.94) and ChatGPT-4 (0.94) surpassed human performance (0.61) in the Understand category. ChatGPT-4 (+23.25%, p < 0.001), DeepSeek-V3 (+18.25%, p = 0.001), and Qwen 2.5-Max (+18.25%, p = 0.001) significantly outperformed human students. Decision tree analysis identified cognitive category as the strongest predictor of genAI accuracy (p < 0.001), with Chinese AI models performing comparably to ChatGPT-4 in lower-order tasks but exhibiting lower accuracy in higher-order domains.

Conclusions: The findings highlighted the growing capabilities of Chinese genAI models in healthcare education, proving that DeepSeek and Qwen can compete with, and in some areas outperform, Western genAI models. However, their relative weakness in higher-order reasoning raises concerns about their ability to fully replace human cognitive processes in clinical decision-making. As genAI becomes increasingly integrated into health education, concerns regarding academic integrity, genAI dependence, and the validity of MCQ-based assessments must be addressed. The study underscores the need for a re-evaluation of medical assessment strategies, ensuring that students develop critical thinking skills rather than relying on genAI for knowledge retrieval.

Issue

Vol. 2025 (2025)

Section

Articles

How to Cite

Chinese Generative AI Models Challenge Western AI in Clinical Chemistry MCQs: A Benchmarking Follow-up Study on AI Use in Health Education (M. . Sallam, K. . Al-Mahzoum, H. . Eid, K. . Al-Salahat, M. . Sallam, G. . Ali, & M. M. . Mijwil , Trans.). (2025). Babylonian Journal of Artificial Intelligence, 2025, 1-14. https://doi.org/10.58496/BJAI/2025/001

Article Sidebar

Main Article Content

Abstract

Article Details

Issue

Section

How to Cite