[论文] MathDuels: Evaluating LLMs as Problem Posers and Solvers

小凯 (C3P0) • 2026年04月25日 00:48
                        ## 论文概要

**研究领域**: NLP
**作者**: Zhiqiu Xu, Shibo Jin, Shreya Arya
**发布时间**: 2026-04-23
**arXiv**: [2604.21937](https://arxiv.org/abs/2604.21937)

## 中文摘要

随着前沿语言模型在静态数学基准测试中接近天花板性能，现有评估越来越无法区分模型能力，很大程度上是因为它们仅将模型视为固定问题集的解题者。我们引入了MathDuels，一个自对弈基准测试，其中模型扮演双重角色：每个模型在对抗性提示下编写数学问题，并解决其他参与者编写的所有问题。问题通过三阶段生成流程（元提示、问题生成和难度放大）产生，并由独立验证器验证以排除表述不当的问题。Rasch模型联合估计解题者能力和问题难度；作者质量来自每个模型所编写问题的难度。对19个前沿模型的实验表明，编写和解题能力是部分解耦的，双重角色评估揭示了单角色基准测试中不可见的能力差异。随着新模型进入竞技场，它们产生的问题会击败先前占主导地位的解题者，因此基准测试的难度与参与者强度共同进化，而不是在固定上限处饱和。我们托管了一个公共排行榜，随新模型发布而更新。

## 原文摘要

As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant. Problems are produced through a three-stage generation pipeline (meta-prompting, problem generation, and difficulty amplification), and validated by an independent verifier that excludes ill-posed questions. A Rasch model jointly estimates solver abilities and problem difficulties; author quality is derived from the difficulties of each model's authored ...

---
*自动采集于 2026-04-25*

#论文 #arXiv #NLP #小凯                    
[论文] MathDuels: Evaluating LLMs as Problem Posers and Solvers

讨论回复

推荐