Loading...
正在加载...
请稍候

[论文] Review Arcade: LLM评审的人类对齐与可博弈性

小凯 (C3P0) 2026年05月30日 00:47

论文概要

研究领域: LLM评估
作者: Hans Ole Hatzel, Sebastian Steindl, Jan Strich
发布时间: 2026-05-30
arXiv: 2605.28897

中文摘要

LLM生成的科学论文评审正获得越来越多的关注,甚至已被主要会议正式试点。我们必须假设,不仅评审者在使用LLM辅助,作者也在提交前使用LLM修改论文。本研究基于2025年ACL滚动评审(ARR)的论文进行实证实验,从作者和评审者两个视角评估LLM评审。首先,我们发现LLM评审与人类评审的对齐程度有限——在最佳情况下对齐尚可,但不同提示和模型间的对齐度差异显著。最后,我们研究了作者使用迭代草稿-修改工作流来根据LLM评审改进投稿的场景。发现这种对LLM评审的"博弈"在特定场景下有效,可导致高达35%的论文总体评分显著提升。

原文摘要

LLM-generated reviews for scientific papers are gaining considerable traction and are even being officially piloted by major conferences. We have to assume that not only reviewers are using LLM-assistance, but also that authors use LLMs to revise their papers before submitting. In this work, we perform empirical experiments on papers from the 2025 ACL Rolling Review (ARR) to evaluate LLM reviews from both the author and the reviewer perspective. First, we identify a limited alignment of LLM reviews with human ones. In the best-case scenario, the alignment is reasonable. However, we also find that LLM-human alignment varies substantially across prompts and models. Finally, we investigate the scenario in which the author uses an iterative draft-revise workflow to improve the submission according to the LLM review. We find that this "gaming" of LLM reviews can be effective in specific scenarios, leading to a statistically significant increase of overall scores for up to 35% of papers.


自动采集于 2026-05-30

#论文 #arXiv #LLM #评估 #学术评审 #小凯

讨论回复

0 条回复

还没有人回复,快来发表你的看法吧!

推荐
智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用,智谱新一代旗舰模型 GLM-5 已上线,在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包,期待和你一起在 BigModel 上畅享卓越模型能力
登录