AI in Medicine: A Breakthrough or an Overhyped Experiment?

Reviewing the Superhuman Performance of a Large Language Model in Physician Reasoning

The intersection of artificial intelligence (AI) and medicine has long been a subject of fascination. Still, a recent study, Superhuman Performance of a Large Language Model on the Reasoning Tasks of a Physician by Brodeur et al. (2024), takes this discussion to a new level. Published in The Lancet, this study claims that OpenAI’s o1-preview large language model (LLM) surpasses human physicians in complex diagnostic reasoning tasks.

But is this truly a game-changer for healthcare? Or does it merely demonstrate AI’s ability to excel in structured test conditions while struggling with real-world complexity? This blog delves into this significant study’s findings, strengths, limitations, and future implications.

The Study at a Glance

Brodeur et al. (2024) evaluated o1-preview’s performance on five key clinical reasoning tasks:

  • Differential Diagnosis Generation – Identifying a list of possible diagnoses for given patient cases.

  • Diagnostic Reasoning Explanation – Providing a structured justification for its differential diagnosis.

  • Triage Decision-Making – Assessing which conditions require urgent medical attention.

  • Probabilistic Reasoning – Estimating the likelihood of diseases based on test results.

  • Management Reasoning – Recommending appropriate treatments and interventions.

The model’s outputs were compared against previous AI models (including GPT-4) and hundreds of real-life physicians, including medical students, residents, and attending doctors.

Key Findings

Key findings from the articles are:

  • o1-preview outperformed human doctors in diagnostic reasoning and differential diagnosis generation.

  • The model excelled in structured case-based evaluations, accurately diagnosing cases from the New England Journal of Medicine’s Clinicopathological Conferences (NEJM CPCs).

  • The model performed as well or better than GPT-4 in most tasks, particularly in management decision-making.

  • However, the model struggled with probabilistic reasoning and triage decisions, where uncertainty plays a major role in clinical practice.

The authors argue that o1-preview’s superior performance demonstrates AI’s potential to enhance clinical decision-making. However, they also caution against over-reliance on AI without real-world trials and regulatory oversight.

Visualizing AI’s Progress in Diagnostic Accuracy

The study by Brodeur et al. (2024) compares the o1-preview’s performance to other diagnostic tools and clinicians using NEJM Clinicopathologic Conferences (CPCs) as a benchmark. The figure below highlights the percentage of correct diagnoses included in the differential diagnosis for various tools and models tested between 2012 and 2024.

Physician versus AI

The above visual underscores three significant trends:

  1. Steady Improvement Over Time: Diagnostic tools, particularly LLMs, have consistently improved in accuracy since 2012.
  2. Superhuman Performance: o1-preview surpasses human clinicians (Google 2023) and earlier AI models like GPT-4 in identifying correct diagnoses.
  3. The Decline of Traditional Tools: Legacy diagnostic platforms (e.g., ISABEL and PEPID) lag far behind newer AI models, reinforcing the shift toward LLM-based solutions.

Comparing AI and Human Performance

The study also evaluated the performance of o1-preview in comparison to GPT-4 and physicians on two critical tasks:

  • 1

    Management Reasoning – Using Grey Matters Management Cases, o1-preview significantly outperformed GPT-4 and human physicians (with or without supplementary resources). The graph below shows its dominant scores across all metrics.

  • 2

    Diagnostic Reasoning – For Landmark Diagnostic Cases, o1-preview matched GPT-4 and physician performance, with no statistically significant differences observed. This suggests parity between current LLMs and trained professionals in structured diagnostic tasks.

A comparison of AI and Physician Performance is depicted in the plot below.

The above visual underscores:

  • o1-preview has a clear advantage in management reasoning, likely due to its ability to synthesize guidelines and prioritize treatment decisions.
  • The consistency across diagnostic reasoning tasks, indicating that AI and physicians may complement one another in such scenarios.

go to:  | Start | Top of This Section | End

The Pros: Why This Study Matters

  • 1. AI Can Boost Diagnostic Accuracy

    One of the biggest challenges in medicine is misdiagnosis, which contributes to poor patient outcomes and increased healthcare costs. The study found that o1-preview correctly included the final diagnosis in its differential diagnosis list in 78.3% of cases, compared to 72.9% for GPT-4.

    While the study does not explicitly quantify human clinician performance, an earlier study by McDuff et al. (2023) found that unassisted human physicians achieved a top-10 accuracy of just 33.6% in similar differential diagnosis tasks. This suggests that AI-powered clinical decision support tools could significantly reduce diagnostic errors, particularly in complex or rare cases where human bias and cognitive overload are major concerns.

  • 2. AI as a Knowledge Augmentation Tool

    The study highlights that AI models like o1-preview can provide structured and well-reasoned explanations for medical decisions. Unlike traditional decision-support tools that rely on rigid algorithms, LLMs can synthesize vast amounts of medical literature, guidelines, and case histories to offer nuanced reasoning.

    For medical students and early-career doctors, this could serve as an interactive learning tool, improving the way clinical reasoning is taught and practiced.

  • 3. Potential for Automating Routine Medical Tasks

    While AI replacing doctors is unlikely in the near future, o1-preview demonstrates promise in streamlining certain tasks, such as:

    ✔ Assisting with differential diagnosis generation in time-sensitive situations.

    ✔ Providing structured summaries of complex cases for physician review.

    ✔ Recommending next steps in patient management, reducing cognitive load for clinicians.

If integrated into electronic health record (EHR) systems, such AI models could help physicians focus on critical decision-making and patient interactions rather than administrative burdens.

go to:  | Start | Top of This Section | End

The Cons: Where AI Still Falls Short

  • 1. AI is Not Ready for Real-World Clinical Practice

    The study evaluates o1-preview in isolated, structured test conditions where all necessary patient information is provided upfront. But real-world clinical practice involves:

    🔹Unstructured and incomplete patient data

    🔹Ambiguous symptoms and conflicting test results

    🔹Dynamic decision-making based on evolving patient conditions

    Current AI models struggle when faced with uncertainty, missing data, or conflicting medical histories. Without real-world validation, o1-preview remains an experimental tool rather than a clinical solution.

  • 2. Lack of Human-AI Collaboration Studies

    The study compares AI vs. humans directly, rather than evaluating how AI enhances human decision-making. Prior research suggests that AI-physician collaboration often yields the best results, rather than AI or human doctors working alone.

    💡Key question: How does o1-preview improve the decisions of trained physicians? We don’t know yet because the study didn’t test AI-assisted workflows.

  • 3. Ethical & Regulatory Uncertainty

    If AI surpasses human doctors in diagnostic accuracy, who is responsible for errors? The study does not address key ethical and regulatory challenges, such as:

    🔹 Accountability – If AI suggests a misdiagnosis, is the physician, the hospital, or the AI developer liable?

    🔹 Bias and Fairness – Could AI models amplify existing healthcare disparities by being trained on biased datasets?

    🔹 Transparency – Should AI disclose how it arrived at its decision? Should patients be informed when AI is influencing their diagnosis or treatment plan?

    🔹 Regulatory Approval – Will AI in medicine require oversight from agencies like the FDA (US), MHRA (UK), or TGA (Australia) before widespread deployment? 

However, the question of how such a regulatory system would be monitored or enforced remains open, particularly as AI tools gain widespread, organic adoption among clinicians and the public without formal oversight. While Brodeur et al. (2024) focus on evaluating AI’s diagnostic reasoning capabilities, the study does not extend to discussions on the ethical or regulatory implications of AI-driven medical decision-making. As AI integration into healthcare progresses, questions around liability, fairness, and governance will need to be addressed by policymakers and regulatory bodies.

go to:  | Start | Top of This Section | End

Future Implications: What’s Next for AI in Medicine?

  • 1. AI-Human Collaboration Will Be the Future

    Rather than replacing doctors, AI will likely function as an advanced decision-support tool, enhancing clinical reasoning without overriding physician expertise.

  • 2. AI Needs Rigorous Clinical Trials

    Future research should:

    ✔ Test AI models in live hospital settings with real patient interactions.

    ✔ Compare AI-assisted vs. non-AI-assisted physician decision-making.

    ✔ Assess AI’s impact on treatment outcomes, patient safety, and healthcare costs.

  • 3. Ethical and Regulatory Frameworks Must Be Established

    Governments and medical institutions must proactively establish ethical AI deployment strategies to prevent unforeseen risks.

go to:  | Start | Top of This Section | End

Final Verdict: A Promising Step, But Not a Replacement for Doctors

The study by Brodeur et al. (2024) is a milestone in AI-driven medical reasoning, but it is not a definitive breakthrough for clinical practice—yet!

Bottom Line:

  • AI is not here to replace doctors but can augment clinical decision-making.

  • Ethical, regulatory, and real-world testing must catch up before clinical deployment.

  • The future of medicine will likely be AI-human collaboration, not AI dominance.

The shift toward AI-human collaboration in healthcare has already begun, but its evolutionary path and final destination remain uncertain. As AI models like o1-preview continue to improve, the question is no longer if AI will be integrated into medical practice—but how.

What Do You Think?

  • Should AI models like o1-preview be used in hospitals today, or is more testing needed?

  • How do you see AI transforming the role of doctors in the next decade?

Please email your thoughts to [email protected]   or scroll down and leave us a comment below.

References

Brodeur, P.G., Buckley, T.A., Kanjee, Z., Goh, E. and Rodman, A. (2024) ‘Superhuman performance of a large language model on the reasoning tasks of a physician’, The Lancet. Available at: https://doi.org/10.48550/arXiv.2412.10849 (Accessed:2-February-2025 ).

McDuff, D., Raghavan, P., Liang, P., Ong, E., D’Amour, A., Ramaswamy, S., Kelly, C.J. and Kornblith, S., 2023. Towards accurate differential diagnosis with large language models. arXiv preprint arXiv:2312.00164. Available at: https://arxiv.org/abs/2312.00164 (Accessed:3-February-2025).

go to:  | Start | Top of This Section | End