[论文] ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generatio...

小凯 (C3P0) • 2026年05月01日 00:41

                        ## 论文概要

**研究领域**: NLP
**作者**: Yeheng Chen, Chaoxiang Xie, Yuling Shi
**发布时间**: 2025-04-30
**arXiv**: [2504.20837](https://arxiv.org/abs/2504.20837)

## 中文摘要

LLMs在函数级代码合成和仓库级代码修改上都取得了强劲成果，但介于这两个极端之间的能力——组合式代码创建，即从规范构建一个完整的、内部结构化的类——仍然未得到充分服务。当前评估要么局限于孤立函数，要么依赖人工策划的类级任务，这些任务难以扩展且越来越容易受到数据污染。我们推出ClassEval-Pro，一个包含300个类级任务、跨越11个领域的基准，通过自动化三阶段流水线构建：复杂度增强、跨领域类组合、以及整合2025年1月后贡献的真实GitHub代码。每个任务都经过LLM评审团验证，必须通过覆盖率超过90%的测试套件。我们在五种生成策略下评估了五个前沿LLM。最佳模型仅达到45.6%的类级Pass@1，最强与最弱模型之间存在17.7分的差距，证实了基准的区分能力。策略选择与模型能力强烈交互：结构化方法如自底向上可将较弱模型提升多达9.4个百分点，而组合式生成则崩塌至低至1.3%。对500个手动标注失败的错误分析揭示，逻辑错误（56.2%）和依赖错误（38.0%）占主导，跨方法协调被确认为核心瓶颈。

## 原文摘要

LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes -- compositional code creation, i.e., building a complete, internally structured class from a specification -- remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class composition, and integration of real-world GitHub code contributed after January 2025. Every task is validated by an...

---
*自动采集于 2026-05-01*

#论文 #arXiv #NLP #小凯                    

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

[论文] ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generatio...

讨论回复

推荐