[论文] MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge

论文概要

研究领域: 医学图像作者: Perry E. Radau 发布时间: 2026-05-06 arXiv: 2605.05175

中文摘要

背景：现有的MRI LLM基准主要依赖复习用书的多项选择题，其中顶级专有模型已经得分很高，限制了区分度。还没有系统性基准评估过研究MRI实践中核心的供应商特定扫描仪操作知识。目的：我们开发了MRI-Eval，一个用于MRI物理和GE扫描仪操作知识相对模型比较的分层基准，使用主要多项选择题（MCQ），并以仅题干和引导诊断条件作为补充分析。方法：MRI-Eval包含来自教科书、GE扫描仪手册、编程课程材料和专家生成问题的九个类别和三个难度等级的1,365个评分项目。评估了五个模型系列（GPT-5.4、Claude Opus 4.6、Claude Sonnet 4.6、Gemini 2.5 Pro、Llama 3.3 70B）。MCQ是主要的；仅题干移除了选项并使用独立的LLM评判；引导仅题干测试对错误用户声明的响应。结果：总体MCQ准确率为93.2%至97.1%。GE扫描仪操作是每个模型的最低类别（88.2%至94.6%）。在仅题干中，前沿模型准确率降至58.4%至61.1%，Llama 3.3 70B降至37.1%；GE扫描仪操作仅题干准确率为13.8%至29.8%。结论：高MCQ性能可能掩盖弱自由文本回忆，尤其是供应商特定操作知识。MRI-Eval作为相对比较基准而非绝对能力衡量最具信息量，并支持在使用原始LLM输出进行GE特定协议指导时保持谨慎。

原文摘要

Background: Existing MRI LLM benchmarks rely mainly on review-book multiple-choice questions, where top proprietary models already score highly, limiting discrimination. No systematic benchmark has evaluated vendor-specific scanner operational knowledge central to research MRI practice. Purpose: We developed MRI-Eval, a tiered benchmark for relative model comparison on MRI physics and GE scanner operations knowledge using primary multiple-choice questions (MCQ), with stem-only and primed diagnostic conditions as complementary analyses. Methods: MRI-Eval includes 1365 scored items across nine categories and three difficulty tiers from textbooks, GE scanner manuals, programming course materials, and expert-generated questions. Five model families were evaluated (GPT-5.4, Claude Opus 4.6, Claude Sonnet 4.6, Gemini 2.5 Pro, Llama 3.3 70B). MCQ was primary; stem-only removed options and used an independent LLM judge; primed stem-only tested responses to incorrect user claims. Results: Overall MCQ accuracy was 93.2% to 97.1%. GE scanner operations was the lowest category for every model (88.2% to 94.6%). In stem-only, frontier-model accuracy fell to 58.4% to 61.1%, and Llama 3.3 70B fell to 37.1%; GE scanner operations stem-only accuracy was 13.8% to 29.8%. Conclusion: High MCQ performance can mask weak free-text recall, especially for vendor-specific operational knowledge. MRI-Eval is most informative as a relative comparison benchmark rather than an absolute competency measure and supports caution in using raw LLM outputs for GE-specific protocol guidance.

--- *自动采集于 2026-05-08*

#论文 #arXiv #医学图像 #小凯

[论文] MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线