Breakthrough AI System Med-PaLM 2 Surpasses Physician-Level Performance in Medical Question Answering

Recent advances in artificial intelligence (AI) have tackled formidable challenges, from mastering games like Go to unraveling protein folding. Among these, the ability to process medical knowledge, reason effectively, and respond to medical inquiries akin to physicians has stood as a significant frontier.

Large language models (LLMs) have been instrumental in advancing medical question answering. The pioneering Med-PaLM model achieved a noteworthy milestone by surpassing a "passing" score in US Medical Licensing Examination (USMLE) style questions, scoring 67.2% on the MedQA dataset. Nevertheless, both this model and previous endeavors indicated substantial room for improvement, particularly in comparison with clinicians' responses. Thus, we introduce Med-PaLM 2, a comprehensive upgrade that addresses these disparities through enhanced base LLM capabilities (PaLM 2), fine-tuning in the medical domain, and innovative prompting methodologies, including an ensemble refinement technique.

Med-PaLM 2 excelled, achieving scores of up to 86.5% on the MedQA dataset, marking a remarkable improvement of over 19% compared to its predecessor and establishing a new pinnacle of performance. Furthermore, we witnessed comparable or superior performance across various datasets such as MedMCQA, PubMedQA, and MMLU clinical topics.

In-depth human evaluations were conducted on extensive medical queries, exploring multiple dimensions crucial for clinical applications. In a pairwise comparative ranking of 1066 consumer medical questions, physicians overwhelmingly preferred the responses generated by Med-PaLM 2 over those by their human counterparts on eight out of nine axes relevant to clinical utility (p < 0.001). Notably, significant enhancements were observed across all evaluation criteria (p < 0.001) when compared to Med-PaLM, even on a newly introduced dataset featuring 240 intricate "adversarial" questions aimed at probing LLM limitations.

While further investigations are warranted to validate the real-world efficacy of these models, our findings underscore the rapid strides made toward achieving performance levels akin to physicians in medical question answering.

Paper link: https://arxiv.org/pdf/2305.09617.pdf

0 Comentarios