OpenAI’s ChatGPT Health, launched in , is rapidly gaining traction as a consumer-facing health tool. However, a recent rigorous evaluation reveals significant limitations in its ability to accurately assess patient urgency, particularly in high-risk situations. The study, published this week in Nature, raises concerns about potential delays in care and inconsistent responses to critical health conditions.
Researchers conducted a “stress test” of ChatGPT Health, analyzing 960 responses generated from 60 clinician-authored patient scenarios across 21 clinical areas, factoring in 16 different conditions. The findings demonstrate an “inverted U-shaped” pattern of performance, with the most substantial errors occurring at both ends of the urgency spectrum: cases presenting as non-urgent and those requiring immediate emergency attention.
The study found that the system under-triaged 52% of simulated emergency cases. This means that, in over half of these scenarios, ChatGPT Health recommended a lower level of care than what was medically appropriate. Specifically, patients presenting with conditions like diabetic ketoacidosis and impending respiratory failure were advised to seek evaluation within 24–48 hours, rather than being directed to the emergency department. This misclassification could have serious consequences for patients experiencing these life-threatening conditions.
Conversely, the AI correctly identified and prioritized classical emergencies such as stroke and anaphylaxis, demonstrating its ability to accurately assess certain critical conditions. This discrepancy highlights the variability in the AI’s assessment of critical conditions, suggesting that its performance is not uniformly reliable across all medical emergencies.
The influence of external factors on ChatGPT Health’s recommendations was also examined. The study found that when family or friends minimized a patient’s symptoms – a phenomenon known as anchoring bias – triage recommendations shifted significantly in edge cases. The odds of a less urgent recommendation increased by a factor of 11.7 (95% confidence interval 3.7-36.6). This suggests that the AI can be unduly influenced by information provided by individuals accompanying the patient, potentially leading to underestimation of the severity of the condition.
Perhaps more concerning, the activation of crisis intervention messages in cases of suicidal ideation was found to be unpredictable. The system sometimes triggered these messages when patients described no specific method for self-harm, while failing to do so when a specific method was mentioned. This inconsistency raises serious safety concerns, as timely intervention is crucial in preventing suicide.
Interestingly, the study found no significant effects related to patient race, gender, or barriers to care, although the confidence intervals did not entirely rule out clinically meaningful differences. This suggests that, at least in this simulated environment, the AI did not exhibit bias based on these demographic factors. However, the authors caution that further research is needed to confirm these findings and to assess potential biases in real-world settings.
These findings reveal missed high-risk emergencies and inconsistent activation of crisis safeguards, raising safety concerns that warrant prospective validation before consumer-scale deployment of artificial intelligence triage systems. The study underscores the importance of rigorous testing and validation before relying on AI-powered tools for medical triage.
The researchers emphasize that while AI has the potential to improve access to healthcare and streamline triage processes, It’s not yet ready to replace the judgment of trained medical professionals. The current limitations of ChatGPT Health highlight the need for caution and careful oversight in the implementation of AI in healthcare settings. Further research is needed to address the identified shortcomings and to ensure the safety and reliability of these systems before they are widely adopted.
The study’s findings are particularly relevant given the increasing popularity of consumer health tools and the growing reliance on AI in healthcare. As more patients turn to these tools for preliminary medical guidance, it is crucial to ensure that they are accurate, reliable, and safe. The authors recommend that developers of AI triage systems prioritize safety and accuracy, and that these systems undergo rigorous testing and validation before being deployed on a large scale.
