论文概要
研究领域: 医学图像 作者: Perry E. Radau 发布时间: 2026-05-06 arXiv: 2605.05175中文摘要
背景:现有的MRI LLM基准主要依赖复习用书的多项选择题,其中顶级专有模型已经得分很高,限制了区分度。还没有系统性基准评估过研究MRI实践中核心的供应商特定扫描仪操作知识。目的:我们开发了MRI-Eval,一个用于MRI物理和GE扫描仪操作知识相对模型比较的分层基准,使用主要多项选择题(MCQ),并以仅题干和引导诊断条件作为补充分析。方法:MRI-Eval包含来自教科书、GE扫描仪手册、编程课程材料和专家生成问题的九个类别和三个难度等级的1,365个评分项目。评估了五个模型系列(GPT-5.4、Claude Opus 4.6、Claude Sonnet 4.6、Gemini 2.5 Pro、Llama 3.3 70B)。MCQ是主要的;仅题干移除了选项并使用独立的LLM评判;引导仅题干测试对错误用户声明的响应。结果:总体MCQ准确率为93.2%至97.1%。GE扫描仪操作是每个模型的最低类别(88.2%至94.6%)。在仅题干中,前沿模型准确率降至58.4%至61.1%,Llama 3.3 70B降至37.1%;GE扫描仪操作仅题干准确率为13.8%至29.8%。结论:高MCQ性能可能掩盖弱自由文本回忆,尤其是供应商特定操作知识。MRI-Eval作为相对比较基准而非绝对能力衡量最具信息量,并支持在使用原始LLM输出进行GE特定协议指导时保持谨慎。原文摘要
Background: Existing MRI LLM benchmarks rely mainly on review-book multiple-choice questions, where top proprietary models already score highly, limiting discrimination. No systematic benchmark has evaluated vendor-specific scanner operational knowledge central to research MRI practice. Purpose: We developed MRI-Eval, a tiered benchmark for relative model comparison on MRI physics and GE scanner operations knowledge using primary multiple-choice questions (MCQ), with stem-only and primed diagnostic conditions as complementary analyses. Methods: MRI-Eval includes 1365 scored items across nine categories and three difficulty tiers from textbooks, GE scanner manuals, programming course materials, and expert-generated questions. Five model families were evaluated (GPT-5.4, Claude Opus 4.6, Claude Sonnet 4.6, Gemini 2.5 Pro, Llama 3.3 70B). MCQ was primary; stem-only removed options and used an independent LLM judge; primed stem-only tested responses to incorrect user claims. Results: Overall MCQ accuracy was 93.2% to 97.1%. GE scanner operations was the lowest category for every model (88.2% to 94.6%). In stem-only, frontier-model accuracy fell to 58.4% to 61.1%, and Llama 3.3 70B fell to 37.1%; GE scanner operations stem-only accuracy was 13.8% to 29.8%. Conclusion: High MCQ performance can mask weak free-text recall, especially for vendor-specific operational knowledge. MRI-Eval is most informative as a relative comparison benchmark rather than an absolute competency measure and supports caution in using raw LLM outputs for GE-specific protocol guidance.--- *自动采集于 2026-05-08*
#论文 #arXiv #医学图像 #小凯