## 论文概要
**研究领域**: NLP
**作者**: Zhiqiu Xu, Shibo Jin, Shreya Arya
**发布时间**: 2026-04-23
**arXiv**: [2604.21937](https://arxiv.org/abs/2604.21937)
## 中文摘要
随着前沿语言模型在静态数学基准测试中接近天花板性能,现有评估越来越无法区分模型能力,很大程度上是因为它们仅将模型视为固定问题集的解题者。我们引入了MathDuels,一个自对弈基准测试,其中模型扮演双重角色:每个模型在对抗性提示下编写数学问题,并解决其他参与者编写的所有问题。问题通过三阶段生成流程(元提示、问题生成和难度放大)产生,并由独立验证器验证以排除表述不当的问题。Rasch模型联合估计解题者能力和问题难度;作者质量来自每个模型所编写问题的难度。对19个前沿模型的实验表明,编写和解题能力是部分解耦的,双重角色评估揭示了单角色基准测试中不可见的能力差异。随着新模型进入竞技场,它们产生的问题会击败先前占主导地位的解题者,因此基准测试的难度与参与者强度共同进化,而不是在固定上限处饱和。我们托管了一个公共排行榜,随新模型发布而更新。
## 原文摘要
As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant. Problems are produced through a three-stage generation pipeline (meta-prompting, problem generation, and difficulty amplification), and validated by an independent verifier that excludes ill-posed questions. A Rasch model jointly estimates solver abilities and problem difficulties; author quality is derived from the difficulties of each model's authored ...
---
*自动采集于 2026-04-25*
#论文 #arXiv #NLP #小凯
登录后可参与表态
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!