Chinese Generative AI Models Challenge Western AI in Clinical Chemistry MCQs: A Benchmarking Follow-up Study on AI Use in Health Education

Main Article Content

Malik Sallam
Kholoud Al-Mahzoum
Huda Eid
Khaled Al-Salahat
Mohammed Sallam
Guma Ali
Maad M. Mijwil

Abstract

Background: The emergence of Chinese generative AI (genAI) models, such as DeepSeek and Qwen, has introduced strong competition to Western genAI models. These advancements hold significant potential in healthcare education. However, benchmarking the performance of genAI models in specialized medical disciplines is crucial to assess their strengths and limitations. This study builds on prior research evaluating ChatGPT (GPT-3.5 and GPT-4), Bing, and Bard against human postgraduate students in Medical Laboratory Sciences, now incorporating DeepSeek and Qwen to assess their effectiveness in Clinical Chemistry Multiple-Choice Questions (MCQs).


Methods: This study followed the METRICS framework for genAI-based healthcare evaluations, assessing six models using 60 Clinical Chemistry MCQs previously administered to 20 MSc students. The facility index and Bloom’s taxonomy classification were used to benchmark performance. GenAI models included DeepSeek-V3, Qwen 2.5-Max, ChatGPT-4, ChatGPT-3.5, Microsoft Bing, and Google Bard, evaluated in a controlled, non-interactive environment using standardized prompts.


Results: The evaluated genAI models showed varying accuracy across Bloom’s taxonomy levels. DeepSeek-V3 (0.92) and ChatGPT-4 (1.00) outperformed humans (0.74) in the Remember category, while Qwen 2.5-Max (0.94) and ChatGPT-4 (0.94) surpassed human performance (0.61) in the Understand category. ChatGPT-4 (+23.25%, p < 0.001), DeepSeek-V3 (+18.25%, p = 0.001), and Qwen 2.5-Max (+18.25%, p = 0.001) significantly outperformed human students. Decision tree analysis identified cognitive category as the strongest predictor of genAI accuracy (p < 0.001), with Chinese AI models performing comparably to ChatGPT-4 in lower-order tasks but exhibiting lower accuracy in higher-order domains.


Conclusions: The findings highlighted the growing capabilities of Chinese genAI models in healthcare education, proving that DeepSeek and Qwen can compete with, and in some areas outperform, Western genAI models. However, their relative weakness in higher-order reasoning raises concerns about their ability to fully replace human cognitive processes in clinical decision-making. As genAI becomes increasingly integrated into health education, concerns regarding academic integrity, genAI dependence, and the validity of MCQ-based assessments must be addressed. The study underscores the need for a re-evaluation of medical assessment strategies, ensuring that students develop critical thinking skills rather than relying on genAI for knowledge retrieval.

Article Details

Section

Articles

How to Cite

Chinese Generative AI Models Challenge Western AI in Clinical Chemistry MCQs: A Benchmarking Follow-up Study on AI Use in Health Education (M. . Sallam, K. . Al-Mahzoum, H. . Eid, K. . Al-Salahat, M. . Sallam, G. . Ali, & M. M. . Mijwil , Trans.). (2025). Babylonian Journal of Artificial Intelligence, 2025, 1-14. https://doi.org/10.58496/BJAI/2025/001