Evaluating ChatGPT performance in Arabic dialects: A comparative study showing defects in responding to Jordanian and Tunisian general health prompts

Main Article Content

Malik Sallam
Dhia Mousa

Abstract

Background: The role of artificial intelligence (AI) is increasingly recognized to enhance digital health literacy. There is of particular importance with widespread availability and popularity of AI chatbots such as ChatGPT and its possible impact on health literacy. The involves the need to understand AI models’ performance across different languages, dialects, and cultural contexts. This study aimed to evaluate ChatGPT performance in response to prompting in two different Arabic dialects, namely Tunisian and Jordanian.


Methods: This descriptive study followed the METRICS checklist for the design and reporting of AI based studies in healthcare. Ten general health queries were translated into Tunisian and Jordanian dialects of Arabic by bilingual native speakers. The performance of two AI models, ChatGPT-3.5 and ChatGPT-4 in response to Tunisian, Jordanian, and English were evaluated using the CLEAR tool tailored for assessment of health information generated by AI models.


Results: ChatGPT-3.5 performance was categorized as average in Tunisian Arabic, with an overall CLEAR score of 2.83, compared to above average score of 3.40 in Jordanian Arabic. ChatGPT-4 showed a similar pattern with marginally better outcomes with a CLEAR score of 3.20 in Tunisian rated as average and above average performance in Jordanian with a CLEAR score of 3.53. The CLEAR components consistently showed superior performance in the Jordanian dialect for both models despite the lack of statistical significance. Using English content as a reference, the responses to both Tunisian and Jordanian dialects were significantly inferior (P<.001).


Conclusion: The findings highlight a critical dialectical performance gap in ChatGPT, underlining the need to enhance linguistic and cultural diversity in AI models’ development, particularly for health-related content. Collaborative efforts among AI developers, linguists, and healthcare professionals are needed to improve the performance of AI models across different languages, dialects, and cultural contexts. Future studies are recommended to broaden the scope across an extensive range of languages and dialects, which would help in achieving equitable access to health information across various communities.

Downloads

Download data is not yet available.

Article Details

How to Cite
Sallam, M., & Mousa, D. (2024). Evaluating ChatGPT performance in Arabic dialects: A comparative study showing defects in responding to Jordanian and Tunisian general health prompts. Mesopotamian Journal of Artificial Intelligence in Healthcare, 2024, 1–7. https://doi.org/10.58496/MJAIH/2024/001
Section
Articles