Benchmarking Generative AI: A Call for Establishing a Comprehensive Framework and a Generative AIQ Test

Malik Sallam; Roaa Khalil; Mohammed Sallam

doi:10.58496/MJAIH/2024/010

PDF

Published: 2024-07-02

DOI: https://doi.org/10.58496/MJAIH/2024/010%20

Keywords:

Artificial intelligence, Clinical decision-making, Benchmarking, GPT-4, ChatGPT

Malik Sallam

Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan.

https://orcid.org/0000-0002-0165-9670

Roaa Khalil

Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan.

https://orcid.org/0009-0005-9912-5413

Mohammed Sallam

Department of Pharmacy, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai, United Arab Emirates.

https://orcid.org/0000-0003-3273-524X

Abstract

The introduction and rapid evolution of generative artificial intelligence (genAI) models necessitates a refined understanding for the concept of “intelligence”. The genAI tools are known for its capability to produce complex, creative, and contextually relevant output. Nevertheless, the deployment of genAI models in healthcare should be accompanied appropriate and rigorous performance evaluation tools. In this rapid communication, we emphasizes the urgent need to develop a “Generative AIQ Test” as a novel tailored tool for comprehensive benchmarking of genAI models against multiple human-like intelligence attributes. A preliminary framework is proposed in this communication. This framework incorporates miscellaneous performance metrics including accuracy, diversity, novelty, and consistency. These metrics were considered critical in the evaluation of genAI models that might be utilized to generate diagnostic recommendations, treatment plans, and patient interaction suggestions. This communication also highlights the importance of orchestrated collaboration to construct robust and well-annotated benchmarking datasets to capture the complexity of diverse medical scenarios and patient demographics. This communication suggests an approach aiming to ensure that genAI models are effective, equitable, and transparent. To maximize the potential of genAI models in healthcare, it is important to establish rigorous, dynamic standards for its benchmarking. Consequently, this approach can help to improve clinical decision-making with enhancement in patient care, which will enhance the reliability of genAI applications in healthcare.

Issue

Vol. 2024 (2024)

Section

Articles

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

Benchmarking Generative AI: A Call for Establishing a Comprehensive Framework and a Generative AIQ Test (M. Sallam, R. Khalil, & M. Sallam , Trans.). (2024). Mesopotamian Journal of Artificial Intelligence in Healthcare, 2024, 69-75. https://doi.org/10.58496/MJAIH/2024/010

Article Sidebar

Main Article Content

Abstract

Article Details

Issue

Section

How to Cite

Similar Articles