Evaluating ChatGPT performance in Arabic dialects: A comparative study showing defects in responding to Jordanian and Tunisian general health prompts

Malik Sallam; Dhia Mousa

doi:10.58496/MJAIH/2024/001

PDF

Published: 2024-01-10

DOI: https://doi.org/10.58496/MJAIH/2024/001%20

Keywords:

AI chatbots, Health literacy, ChatGPT, Health information, Digital health

Malik Sallam

Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan.

https://orcid.org/0000-0002-0165-9670

Dhia Mousa

Scientific Approaches to Fight Epidemics of Infectious Diseases (SAFE-ID) Research Group, The University of Jordan, Amman, Jordan.

https://orcid.org/0009-0005-0440-2431

Abstract

Background: The role of artificial intelligence (AI) is increasingly recognized to enhance digital health literacy. There is of particular importance with widespread availability and popularity of AI chatbots such as ChatGPT and its possible impact on health literacy. The involves the need to understand AI models’ performance across different languages, dialects, and cultural contexts. This study aimed to evaluate ChatGPT performance in response to prompting in two different Arabic dialects, namely Tunisian and Jordanian.

Methods: This descriptive study followed the METRICS checklist for the design and reporting of AI based studies in healthcare. Ten general health queries were translated into Tunisian and Jordanian dialects of Arabic by bilingual native speakers. The performance of two AI models, ChatGPT-3.5 and ChatGPT-4 in response to Tunisian, Jordanian, and English were evaluated using the CLEAR tool tailored for assessment of health information generated by AI models.

Results: ChatGPT-3.5 performance was categorized as average in Tunisian Arabic, with an overall CLEAR score of 2.83, compared to above average score of 3.40 in Jordanian Arabic. ChatGPT-4 showed a similar pattern with marginally better outcomes with a CLEAR score of 3.20 in Tunisian rated as average and above average performance in Jordanian with a CLEAR score of 3.53. The CLEAR components consistently showed superior performance in the Jordanian dialect for both models despite the lack of statistical significance. Using English content as a reference, the responses to both Tunisian and Jordanian dialects were significantly inferior (P<.001).

Conclusion: The findings highlight a critical dialectical performance gap in ChatGPT, underlining the need to enhance linguistic and cultural diversity in AI models’ development, particularly for health-related content. Collaborative efforts among AI developers, linguists, and healthcare professionals are needed to improve the performance of AI models across different languages, dialects, and cultural contexts. Future studies are recommended to broaden the scope across an extensive range of languages and dialects, which would help in achieving equitable access to health information across various communities.

Issue

Vol. 2024 (2024)

Section

Articles

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

Evaluating ChatGPT performance in Arabic dialects: A comparative study showing defects in responding to Jordanian and Tunisian general health prompts (M. Sallam & D. Mousa , Trans.). (2024). Mesopotamian Journal of Artificial Intelligence in Healthcare, 2024, 1-7. https://doi.org/10.58496/MJAIH/2024/001

Article Sidebar

Main Article Content

Abstract

Article Details

Issue

Section

How to Cite

Similar Articles