Benchmarking Multimodal Large Language Models for Forensic Science and Medicine: A Comprehensive Dataset and Evaluation Framework

Ashmaan Sohail*^1, Om M. Patel*^2, Jihwan Choi3, Jack C. S. Venditti3 and Addison J. Wu^4

    1Department of Electrical and Computer Engineering, Queen’s University, Kingston, Ontario, Canada

    2Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada

    3Loyalist Collegiate and Vocational Institute, Kingston, Ontario, Canada

    4Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America

    ^correspondence directed to Addison J. Wu at addisonwu{at}princeton.edu

    medRxiv preprint DOI: https://doi.org/10.1101/2025.07.06.25330972

      Posted: July 07, 2025, Version 1

      Copyright: * denotes equal contribution

      Abstract

      Background Multimodal large language models (MLLMs) have demonstrated substantial progress in medical and legal domains in recent years; however, their capabilities from the lens of forensic science—a field that is at the intersection of complex medical reasoning and legal interpretation, with conclusions critiqued by judicial scrutiny—remains largely unexplored. Forensic medicine uniquely depends on the accurate integration of often ambiguous text and visual information, yet systematic evaluations of MLLMs in this setting are lacking.

      Methods We conducted a comprehensive benchmarking study of eleven state-of-the-art MLLMs, including proprietary (GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash) and open-source (Llama 4, Qwen 2.5-VL) models. Models were evaluated on 847 examination-style forensic questions drawn from various academic literature, case studies, and clinical assessments, covering nine forensic subdomains. Both text-only and image-based questions were included. Model performance was assessed using direct and chain-of-thought prompting, with automated scoring verified through manual revision.

      Results Performance improved consistently with newer model generations. Chain-of-thought prompting improved accuracy on text-based and choice-based tasks for most models, though this trend did not hold for image-based and open-ended questions. Visual reasoning and complex inference tasks revealed persistent limitations, with models underperforming in image interpretation and nuanced forensic scenarios. Model performance remained stable across forensic subdomains, suggesting topic type alone did not drive variability.

      Conclusions MLLMs show emerging potential for forensic education and structured assessments, particularly for reinforcing factual knowledge. However, their limitations in visual reasoning, open-ended interpretation, and forensic judgment preclude independent application in live forensic practice. Future efforts should prioritize the development of multimodal forensic datasets, domain-targeted fine-tuning, and task-aware prompting to improve reliability and generalizability. These findings provide the first systematic baseline for MLLM performance in forensic science and inform pathways for their cautious integration into medico-legal workflows.

      Introduction

      Large language models (LLMs) are artificial intelligence systems that are capable of generating human-like textual responses to prompts through processing vast amounts of text data. Recent studies have shown that LLMs can nearly match the precision of diagnosis and treatment selection to that of specialized physicians. These models, including GPT-4o, Claude 4 Sonnet, Meta Llama 4 Maverick, and other proprietary and open-source variants, have become more capable at a variety of tasks over time 13. Notably, this advancement of LLMs has now been showcased across a range of societally-embedded knowledge-intensive fields, including medicine, education, and law, often evaluated through standardized examinations 49.

      Despite this progress, little research has explored how LLMs can be integrated into forensic science, particularly forensic medicine, which is a unique field encompassing the intersection of both human medicine and law. Existing studies focus heavily on the analysis of legal documents, clinical decision support, and other medical examinations, each task being investigated in isolation from one another 4,7,9. Forensic science presents unique challenges for artificial intelligence applications 10,11. As a cornerstone of public safety and justice systems, knowledge of forensic science is invaluable in informing investigations that can lead to societally impactful decisions. Unlike other medical fields, where diagnostic protocols and evidence-based treatments are navigated through comprehensive patient histories, forensic science often relies on the interpretation of ambiguous data within complex legal and judicial frameworks, where conclusions are critiqued by judicial scrutiny 1216.

      A differentiating aspect of forensic medicine is its especially heavy reliance on both visual and textual information. Tasks such as injury pattern recognition, postmortem changes assessment (lividity, decomposition stages, etc.), or trace evidence evaluation (fibres, ballistic markings, etc.) depend heavily on imaging 1720. Vision-language models (VLMs) extend on LLMs to process both text and image input, thereby being essential to the advancement of AI applications in this field 2124. Notably, the systematic evaluations of VLMs in this field are nearly non-existent.

      Moreover, existing evaluations of foundation models in medicine primarily relied on structured, single-source, text-based examinations, such as medical licensing tests or board certifications 25– 33. Forensic assessments, however, extend beyond factual recall or clinical decision-making and are highly variable in format, often incorporating case-based scenarios, applied knowledge, and visual interpretation 12,15,16. There is a clear need to understand how VLMs perform when exposed to complex, multimodal forensic investigations to determine their role in education, assessment, and practice.

      In this study, we evaluated the performance of eleven state-of-the-art open-source and closed-source multimodal LLMs (MLLMs) on a collection of 847 examination-style questions. Rather than relying on a single standardized test, we aggregated questions from publicly available academic resources and case studies in an attempt to reflect the real-world diversity of forensic assessments. We compared their accuracy, analyzed the ability of MLLMs in medical reasoning, legal and clinical guidelines, and other multimodal scenarios. To our knowledge, this is the first benchmarking study evaluating MLLM performance across a comprehensive, multimodal forensic science dataset. Specifically, our objective is to address the following questions:

      1. How do different MLLMs perform when answering different types of questions about forensic science?
      2. What are the potential applications and current limitations of using MLLMs to support forensic science education, assessment, and real-world integration?

      Methodology

      Section I: Dataset

      To benchmark the capabilities of various VLMs in forensics medicine, we developed a diverse question bank of 847 total questions, covering a variety of topics.

      We relied upon publicly available educational sources to construct this dataset. These included leading undergraduate and graduate-level forensics science textbooks, such as Forensic Pathology Review by Wayne, Schandl, and Presnell; and Forensic Science: From the Crime Scene to the Crime Lab by Saferstein 16,34. We also obtained copies of nationally-certified observed structured clinical examinations in forensic science by the University of Jordan Faculty of Medicine 35.

      Our dataset spans nine representative topics in forensic science: death investigation and autopsy; toxicology and substance usage; trace and scene evidence; injury analysis; asphyxia and special death mechanisms; firearms, toolmarks, and ballistics; clinical forensics; anthropology and skeletal analysis; and miscellaneous/other. As shown in Figure 1, the most represented topics in the dataset were death (n = 204), toxicology (n = 141), trace (n = 133), and injury (n = 124). Topics such as asphyxia and special death mechanisms (n = 70); firearms, toolmarks, and ballistics (n = 60); clinical forensics (n = 49); and anthropology and skeletal analysis (n = 38) had moderate representation.

      Figure 1.

      Distribution of topics of the curated dataset questions, segmented into all questions, image-based questions, and text-only questions.

      Our dataset includes both text-only and image-based questions. Of the 847 total questions, 225 included an image (26.6%), while the remaining 622 (73.4%) were text-only. Image-based questions were most concentrated in the death investigation and autopsy category, which often require visual assessments of wounds, lividity, decomposition stages, and similar features. Text-based questions were most concentrated in the trace and toxicology section, which frequently involve interpreting chemical reports, identifying substances, and reasoning through forensic procedures based solely on written descriptions.

      Several question formats were also included. Out of the 847 total questions, 781 questions (92.2%) were multiple choice questions with 4 or 5 option choices, or 2 – true and false, presented to the LLMs to choose from. The remaining 66 questions (7.8%) were non choice-based questions that often mimicked real-life forensic scenarios, requiring the LLMs to work through case narratives, interpret findings, and articulate conclusions in a manner reflective of professional forensic reporting or medico-legal consultation.

      Section II: Models and Evaluation

      We evaluate a variety of frontier open-source and proprietary models and their respective predecessors. The proprietary vision-language models we evaluated in our study were GPT-4o, Claude 4 Sonnet, Claude 3.5 Sonnet, Gemini 2.5 Flash, Gemini 2.0 Flash and Gemini 1.5 Flash. The open-source vision-language models we evaluated in our study were Llama 4 Maverick 17B-128E Instruct, Llama 4 Scout 17B-16E Instruct, Llama 3.2 90B, Llama 3.2 11B, and Qwen2.5-VL 72B Instruct, which were all hosted on Together AI.

      Both direct and chain-of-thought prompting were used in the evaluation of model performance of the question bank. In direct prompting, we steer the model to immediately provide its final answer without any intermediate thinking or reasoning. In contrast, in chain-of-thought prompting, we steer the model to reason about its thought process before providing its answer36. For all models, we set their temperatures to be their respective default throughout conducting all of the experiments to reflect a representative distribution of responses and capabilities. For the proprietary models – those in the OpenAI, Claude, and Gemini families – this is equivalent to a temperature of 1.0. For the open-source models hosted on together AI – those in the Llama and Qwen families – this is equivalent to a temperature of 0.7.

      We score each question on a scale from 0 to 1, where a score of 0 indicates that the question is completely incorrectly answered, and a score of 1 indicates that the question is completely correctly answered. For single-part questions, the answer can either only be a 0 or a 1 as there is only one part to the question. For multiple-part questions, we weigh each part in the entire question equally, with the final score being the proportion of parts answered correctly. We award no partial credit for any part (either incorrect or correct). Automatic numerical evaluation was conducted by employing LLM-as-a-judge using GPT-4o for all model responses. To validate the accuracy of GPT-4o as an evaluator, we manually evaluated 30 randomly sampled responses from each model. For each sampled response, we compared the GPT-4o-generated score against a human annotation. We found perfect agreement between GPT-4o and human judgments across all samples, confirming its reliability as an automated evaluator in our setting 37.

      Results

      For direct prompting, the accuracy observed for each language model exhibited considerable variation, ranging as low as 45.11% ± 3.27% for Llama 3.2 11B Vision Instruct Turbo to as high as 74.32% ± 2.90% for Gemini 2.5 Flash. When models are steered with chain-of-thought prompting, the accuracy range is higher overall, from an increased minimum of 51.0% ± 3.32% for Qwen 2.5-VL 72B, to an increased maximum of 79.0% ± 2.72% for Gemini 2.5 Flash. In all models tested, every model improved in accuracy when tested with chain-of-thought prompting compared to direct prompting (p < 0.005), with the single exception being Qwen 2.5-VL 72B, which demonstrated an 11.7% reduction in total accuracy when using chain-of-thought prompting (p < 0.001) (Fig. 2).

      Figure 2.

      Aggregate accuracy for each model on the curated dataset, for both direct and CoT prompting.

      Accuracy manifested a trend of improvement within model families when comparing older and newer versions. For the Gemini family models, we observed a 10.63% increase in accuracy (p < 0.001) for direct prompting and 8.72% (p < 0.001) for chain-of-thought prompting between Gemini 1.5 and 2. We also observed an increase in accuracy by 8.28% (p = 0.0001) for direct prompting and 8.07% (p < 0.001) for chain-of-thought prompting between Gemini 2 and 2.5 (Fig. 2).

      Llama experienced a 15.58% increase (p < 0.001) between Llama 3.2 11B and Llama 3.2 90B, and a 6.37% increase (p = 0.0019) between Llama 3.2 90B and Llama 4 17B. Claude experienced a 6.78% increase (p = 0.0025) between Claude 3.5 and Claude 4 for direct prompting, and a 4.86% increase (p = 0.019) for chain-of-thought prompting. These results indicated that model age or generation was consistently linked with higher performance under both direct and chain-of-thought prompting. Furthermore, the data indicated that variation of the question topic did not significantly affect accuracy outcomes, as accuracy scores remained relatively uniform throughout. Accuracy remained broadly stable across forensic science topics (Fig. 3). To quantify topic-driven performance differences, we computed eta squared (η2) values from one-way ANOVAs per model, measuring the proportion of accuracy variance explained by topic. Across all models, η2 values ranged from 0.012 to 0.063, with most models well below the conventional threshold for a medium effect (η2 > 0.06). For example, GPT-4o yielded η2 = 0.046, Claude 3.5 Sonnet η2 = 0.063, and Qwen2.5-VL η2 = 0.012. These results indicate that the topic explained only a small fraction of accuracy variance, suggesting that model performance was largely consistent across the range of forensic subdomains included in the dataset and that the topic was likely not the primary factor for consistent performance changes.

      Figure 3.

      Performance for each model segmented into per-topic accuracy.

      For text-based questions, chain-of-thought prompting improved accuracy for nearly all models (p

      < 0.001; no statistically significant effect observed for GPT-4o p = 0.103) except for Qwen 2.5-VL 72B, which recorded a decrease of 15.1% (p < 0.001) when chain-of-thought prompting was used for text-based questions instead of direct prompting (Fig. 4).

      Figure 4.

      Performance of each model on text-only and image-based questions, for both direct and CoT prompting.

      For image-based questions, we observed a different pattern in terms of the effectiveness of different prompting strategies. Compared to when models are evaluated on the text-based questions, we see fewer statistically significant gains in accuracy from chain-of-thought prompting compared to direct prompting. The only models that benefited from chain-of-thought prompting were Claude Sonnet 4 (p = 0.002), Gemini 1.5 Flash (p < 0.05), Gemini 2.5 Flash (p < 0.05), and GPT-4o (p < 0.01). Also, three models experienced decreased accuracy with chain-of-thought prompting on image-based questions, although none of these decreases were statistically significant. These models were Llama 3.2 90B, which experienced a 2% decrease, Llama 4 Scout 17B 16E, which experienced a .2% decrease and Qwen 2.5 72B which experienced a 2.3% decrease, each of which showed a lower accuracy when using chain-of-thought reasoning for image items than when using direct prompting alone (Fig. 4).

      For choice-based questions, chain-of-thought prompting displayed consistent improvement for nearly all models tested, yielding an average increase of 6.1% relative to the results from direct prompting, with the exception of Qwen 2.5 72B, which demonstrated a 12.7% decrease. For non-choice-based questions, the results showed a reduced pattern. Most models recorded improvements when chain-of-thought prompting was applied; however, the Llama 3.2 90B showed a minor decrease in this area, with accuracy dropping by approximately 3.3% when chain-of-thought prompting was used instead of direct prompting for non-choice-based items. The average increase for non-choice-based questions across models was 4.99%, which was lower than the average improvement for choice-based questions (Fig. 5).

      Figure 5.

      Performance of each model on choice-based and non-choice (open-response) questions, for both direct and CoT prompting.

      We also noted that models consistently attained higher accuracy rates on text-based questions than image-based questions. The largest difference we observed was with the Claude 3.5 Sonnet model, with a text-based and image-based accuracy rate of 77.42% and 43.13% (p < .001), respectively. Furthermore, choice-based questions consistently attained higher accuracy rates than non choice-based questions for every model. The largest difference can be observed in the Claude Sonnet 4 model with a choice-based accuracy rate of 81.82% and a non choice-based accuracy rate of 50.56% (p < .001) (Fig. 5).

      The results across all categories – direct versus chain-of-thought prompting, text-based versus image-based questions, and choice-based versus non-choice-based questions – consistently demonstrated measurable and interpretable differences. In general, accuracy improvements were greater for text-based and choice-based questions than for image-based and non-choice-based formats. The Qwen 2.5 72B model consistently performed with reduced accuracy in multiple categories when coupled with chain-of-thought prompting, whereas the other models presented consistent accuracy improvements on average. Within the same model families, newer models consistently performed better than older ones across the observed data. Finally, accuracy did not fluctuate significantly with variation of question topic, remaining relatively stable across the forensic science subjects included in the experiment.

      Discussion

      In this study, to the best of our knowledge, we conducted the first systematic evaluation of state-of-the-art proprietary and open-source MLLMs on a novel forensic science dataset incorporating a variety of topics and modes of question delivery. Our results revealed clear performance trends and suggestions for improvement – accuracy generally improved with newer model generations and chain-of-thought prompting, but this benefit was primarily exclusive to text-based and choice-based tasks. Performance on image-based and open-ended questions remained inconsistent, highlighting persistent limitations in applying current MLLMs to real-world forensic scenarios.

      Our study has several limitations. While the dataset we derived is diverse and drawn from credible academic and clinical literature, the distribution of topics was not balanced. Questions from several subfields such as forensic entomology, forensic odontology, and forensic toxicokinetics were not represented in our experiment, while others like death investigation and toxicology were overrepresented. The dataset, therefore, needs more additions to fully mirror the range or randomness of a real forensic case. Future research may also incorporate additional modalities in testing datasets such as audio and video to better benchmark the role of MLLMs in real-world forensics cases, which often involve dealing with evidence like gunshot recordings and CCTV camera footage 3840.

      Even with these constraints, it should be noted that this work represents novelty and value. We acknowledge the need to create better evaluations that reflect how foundation models would perform in the real world by including not just multiple choice questions, as was done with past medical benchmarks, but also including extended multi-party case study questions that require thorough reasoning in addition to immediate factual recall 3133,41. By providing a benchmark for forensic science that incorporates modes of delivery motivated by an advancement toward realistic usage, we provide a foundation which will spur further meaningful development within this specialized field. The potential advantages of using AI in forensic practice are immense, adding to medical and scientific know-how and augmenting legal process and social impact 10– 12,15,42. This study lays critical groundwork for additional research to redesign forensic operations, increase decision-making, and help justice and public security.

      Our evaluation revealed interpretable failure modes that point to clear directions of how MLLMs can be improved for them to meaningfully contribute to forensic tasks. Most models struggled with visual reasoning, specifically nuanced interpretation of forensic photographs. Others stumbled over open-ended queries, producing boilerplate responses or misinterpreting surface-level details. These issues hint at vulnerabilities in visual encoding, context retention, and multi-step inference—skills critical for application in forensics. Identification of these vulnerabilities directs future efforts toward model grounding, cross-modal correspondence, and case-specific inference improvement 4346.

      Better performance on multiple-choice items suggests models may be overfitting such in-training patterns common to testing conditions, such as those found on standard tests 3133.

      Generalizability to unstructured or novel forensic scenarios is thus made more difficult. Future benchmarking evaluations will need to include more open-ended tasks with imperfect information, better emulating actual challenges faced in forensic medicine.

      Existing MLLMs can be potentially useful instructional tools for forensic science. Their strengths in systematic multiple-choice and text-based tasks can assist in concept reinforcement, exam preparation, and guided review 5,47,48. Their weaknesses, however, exclude independent and autonomous use in actual forensic medicine workspaces at the current moment. Their uses in live procedures need to be strictly in combination with human experts. They may assist in hypothesis testing, documentation, or case comparison but they cannot be fully authoritative help tools. Current limitations of foundation models in exercising full judgment, nuance, and ethical accountability preclude meaningfully realistic forensic decision-making 49.

      Overall, state-of-the-art MLLMs are constrained in forensic reasoning but hold clear promise in improving their performance. Overcoming challenges in visual understanding and open-ended reasoning will require the development of more extensive, multimodal forensic datasets and domain-targeted pre- and post-training methods like fine-tuning 44. Uncertainty quantification, structured explanation, and task-aware prompting should also remain near-term priorities in advancing the reliability and transparency of model output 50,51. Through continued collaboration between forensic experts and machine learning experts, the models can potentially be of help in education, preliminary case screening, and administrative tasks. While not yet ready for standalone use, their combination with regulated human oversight could eventually increase the accuracy, uniformity, and availability of forensic operations.

      Data Availability

      All data produced will be available upon publication, or reasonable request to the authors

      This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at http://creativecommons.org/licenses/by/4.0/

      References

      51.Yuan X, Shen C, Yan S, et al. Instance-adaptive Zero-shot Chain-of-Thought Prompting. In: Advances in Neural Information Processing Systems; 2024. Accessed July 6, 2025. https://openreview.net/forum?id=31xWlIdxTm

      1.OpenAI, Hurst A, Lerer A, et al. GPT-4o System Card. Published online October 25, 2024. doi:10.48550/arXiv.2410.21276

      2.Claude Sonnet 4. Accessed July 5, 2025. https://www.anthropic.com/claude/sonnet

      3.The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. Meta AI. April 5, 2025. Accessed July 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

      4.Siino M, Falco M, Croce D, Rosso P. Exploring LLMs Applications in Law: A Literature Review on Current Legal NLP Approaches. IEEE Access. 2025;13:18253–18276. doi:10.1109/ACCESS.2025.3533217

      5.Mishra V, Lurie Y, Mark S. Accuracy of LLMs in medical education: evidence from a concordance test with medical teacher. BMC Med Educ. 2025;25(1):443. doi:10.1186/s12909-025-07009-w

      6.Katz U, Cohen E, Shachar E, et al. GPT versus Resident Physicians — A Benchmark Based on Official Board Scores. NEJM AI. 2024;1(5):AIdbp2300192. doi:10.1056/AIdbp2300192

      7.Guha N, Nyarko J, Ho DE, et al. LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models. In: Advances in Neural Information Processing Systems; 2023. Accessed July 5, 2025. https://openreview.net/forum?id=WqSPQFxFRC

      8.Hendrycks D, Burns C, Basart S, et al. Measuring Massive Multitask Language Understanding. In: International Conference on Learning Representations; 2020. Accessed July 5, 2025. https://openreview.net/forum?id=d7KBjmI3GmQ

      9.Zuo Y, Qu S, Li Y, et al. MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding. In: Forty-second International Conference on Machine Learning; 2025. Accessed July 5, 2025. https://openreview.net/forum?id=IyVcxU0RKI

      10.Morán-Torres R, Feld K, Hesser J, Taalab YM, Yen K. Artificial intelligence and computer vision in forensic sciences. Rechtsmedizin. Published online June 25, 2025. doi:10.1007/s00194-025-00775-3

      11.Alafer F. Emerging Imaging Technologies in Forensic Medicine: A Systematic Review of Innovations, Ethical Challenges, and Future Directions. Diagnostics. 2025;15(11):1410. doi:10.3390/diagnostics15111410

      12.Deadman WJ. FORENSIC MEDICINE: AN AID TO CRIMINAL INVESTIGATION. Can Med Assoc J. 1965;92(13):666–670.

      13.Evett I. The logical foundations of forensic science: towards reliable knowledge. Philos Trans R Soc B Biol Sci. 2015;370(1674):20140263. doi:10.1098/rstb.2014.0263

      14.Beran RG. What is legal medicine – Are legal and forensic medicine the same? J Forensic Leg Med. 2010;17(3):137–139. doi:10.1016/j.jflm.2009.09.011

      15.Legal Medicine and Forensic Science — An Exercise in Interdisciplinary Understanding. N Engl J Med. 1963;268(6):327–328. doi:10.1056/NEJM196302072680616

      16.Saferstein R. Forensic Science: From the Crime Scene to the Crime Lab. Fourth Edition. Pearson; 2019.

      17.Ubelaker DH, Khosrowshahi H. Estimation of age in forensic anthropology: historical perspective and recent methodological advances. Forensic Sci Res. 2019;4(1):1–9. doi:10.1080/20961790.2018.1549711

      18.Chung H, Choe S. Overview of Forensic Toxicology, Yesterday, Today and in the Future. Curr Pharm Des. 2017;23(36):5429–5436. doi:10.2174/1381612823666170622101633

      19.West MH, Hayne S, Barsley RE. Wound patterns: detection, documentation and analysis. J Clin Forensic Med. 1996;3(1):21–27. doi:10.1016/S1353-1131(96)90041-3

      20.Euteneuer J, Courts C. Ten years of molecular ballistics—a review and a field guide. Int J Legal Med. 2021;135(4):1121–1136. doi:10.1007/s00414-021-02523-0

      21.Yin S, Fu C, Zhao S, et al. A survey on multimodal large language models. Natl Sci Rev. 2024;11(12):wae403. doi:10.1093/nsr/nwae403

      22.Hartsock I, Rasool G. Vision-language models for medical report generation and visual question answering: a review. Front Artif Intell. 2024;7. doi:10.3389/frai.2024.1430984

      23.Liang PP, Goindani A, Chafekar T, et al. HEMM: Holistic Evaluation of Multimodal Foundation Models. In: Advances in Neural Information Processing Systems; 2024. Accessed July 5, 2025. https://openreview.net/forum?id=9tVn4f8aJO#discussion

      24.Agbareia R, Omar M, Zloto O, Glicksberg BS, Nadkarni GN, Klang E. Multimodal LLMs for retinal disease diagnosis via OCT: few-shot versus single-shot learning. Ther Adv Ophthalmol. 2025;17:25158414251340569. doi:10.1177/25158414251340569

      25.Hou Y, Patel J, Dai L, et al. Benchmarking of Large Language Models for the Dental Admission Test. Health Data Sci. 2025;5:0250. doi:10.34133/hds.0250

      26.Wu S, Koo M, Blum L, et al. Benchmarking Open-Source Large Language Models, GPT-4 and Claude 2 on Multiple-Choice Questions in Nephrology. NEJM AI. 2024;1(2):AIdbp2300092. doi:10.1056/AIdbp2300092

      27.Beam K, Sharma P, Kumar B, et al. Performance of a Large Language Model on Practice Questions for the Neonatal Board Examination. JAMA Pediatr. 2023;177(9):977–979. doi:10.1001/jamapediatrics.2023.2373

      28.Schubert MC, Wick W, Venkataramani V. Performance of Large Language Models on a Neurology Board–Style Examination. JAMA Netw Open. 2023;6(12):e2346721. doi:10.1001/jamanetworkopen.2023.46721

      29.Omar M, Hijazi K, Omar M, Nadkarni GN, Klang E. Performance of large language models on family medicine licensing exams. Fam Pract. 2025;42(4):cmaf035. doi:10.1093/fampra/cmaf035

      30.Longwell JB, Hirsch I, Binder F, et al. Performance of Large Language Models on Medical Oncology Examination Questions. JAMA Netw Open. 2024;7(6):e2417641. doi:10.1001/jamanetworkopen.2024.17641

      31.Alaa A, Hartvigsen T, Golchini N, et al. Position: Medical Large Language Model Benchmarks Should Prioritize Construct Validity. In: Forty-second International Conference on Machine Learning; 2025. Accessed July 5, 2025. https://openreview.net/forum?id=YuMEUNNpeb

      32.Raji ID, Daneshjou R, Alsentzer E. It’s Time to Bench the Medical Exam Benchmark. NEJM AI. 2025;2(2):AIe2401235. doi:10.1056/AIe2401235

      33.Palta S, Balepur N, Rankel P, Wiegreffe S, Carpuat M, Rudinger R. Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning. In: Al-Onaizan Y, Bansal M, Chen YN, eds. Findings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguistics; 2024:3451–3473. doi:10.18653/v1/2024.findings-emnlp.198

      34.Wayne JM, Schandl CA, Presnell SE. Forensic Pathology Review. CRC Press/Taylor & Francis Group; 2018.

      35.lojan0202. Toxicology & forensic – Doctor 2020. Doctor 2020 –. July 7, 2024. Accessed July 5, 2025. https://doctor2020.jumedicine.com/2024/07/08/toxicology-forensic/

      36.Wei J, Wang X, Schuurmans D, et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In: Advances in Neural Information Processing Systems; 2022. Accessed July 5, 2025. https://openreview.net/forum?id=_VjQlMeSB_J

      37.Zheng L, Chiang WL, Sheng Y, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In: Advances in Neural Information Processing Systems; 2023. Accessed July 6, 2025. https://openreview.net/forum?id=uccHPGDlao

      38.Lu J, Clark C, Lee S, et al. Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR); 2024:26439–26455. Accessed July 6, 2025. https://openaccess.thecvf.com/content/CVPR2024/html/Lu_Unified-IO_2_Scaling_Autoregressive_Multimodal_Models_with_Vision_Language_Audio_CVPR_2024_paper.html

      39.Sujkowski M, Kozuba J, Uchroński P, Banas A, Pulit P, Gryzewska L. Artificial Intelligence Systems for Supporting Video Surveillance Operators at International Airport. Transp Res Procedia. 2023;74:1284–1291. doi:10.1016/j.trpro.2023.11.273

      40.Teng Y, Zhang K, Lv X, et al. Gunshots detection, identification, and classification: Applications to forensic science. Sci Justice. 2024;64(6):625–636. doi:10.1016/j.scijus.2024.09.007

      41.Hager P, Jungmann F, Holland R, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med. 2024;30(9):2613–2622. doi:10.1038/s41591-024-03097-1

      42.Artificial Intelligence and Criminal Justice, Final Report, December 3, 2024.

      43.Hessel J, Lee L. Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think! In: Webber B, Cohn T, He Y, Liu Y, eds. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics; 2020:861–877. doi:10.18653/v1/2020.emnlp-main.62

      44.Lu W, Luu RK, Buehler MJ. Fine-tuning large language models for domain adaptation: exploration of training strategies, scaling, model merging and synergistic capabilities. Npj Comput Mater. 2025;11(1):84. doi:10.1038/s41524-025-01564-y

      45.Shahmohammadi H, Heitmeier M, Shafaei-Bajestan E, Lensch HPA, Baayen RH. Language with vision: A study on grounded word and sentence embeddings. Behav Res Methods. 2024;56(6):5622–5646. doi:10.3758/s13428-023-02294-z

      46.Kang S, Kim J, Kim J, Hwang SJ. Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR); 2025:9339–9350. Accessed July 6, 2025. https://openaccess.thecvf.com/content/CVPR2025/html/Kang_Your_Large_Vision-Language_Model_Only_Needs_A_Few_Attention_Heads_CVPR_2025_paper.html

      47.Córdova-Esparza DM. AI-Powered Educational Agents: Opportunities, Innovations, and Ethical Challenges. Information. 2025;16(6):469. doi:10.3390/info16060469

      48.Tomova M, Roselló Atanet I, Sehy V, Sieg M, März M, Mäder P. Leveraging large language models to construct feedback from medical multiple-choice Questions. Sci Rep. 2024;14(1):27910. doi:10.1038/s41598-024-79245-x

      49.Yan L, Sha L, Zhao L, et al. Practical and ethical challenges of large language models in education: A systematic scoping review. Br J Educ Technol. 2024;55(1):90–112. doi:10.1111/bjet.13370

      50.Xiong M, Santilli A, Kirchhof M, Golinski A, Williamson S. Efficient and Effective Uncertainty Quantification for LLMs. In: Advances in Neural Information Processing Systems; 2024. Accessed July 6, 2025. https://openreview.net/forum?id=QKRLH57ATT