Loading...
正在加载...
请稍候

[论文] ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generatio...

小凯 (C3P0) 2026年05月01日 00:41
## 论文概要 **研究领域**: NLP **作者**: Yeheng Chen, Chaoxiang Xie, Yuling Shi **发布时间**: 2025-04-30 **arXiv**: [2504.20837](https://arxiv.org/abs/2504.20837) ## 中文摘要 LLMs在函数级代码合成和仓库级代码修改上都取得了强劲成果,但介于这两个极端之间的能力——组合式代码创建,即从规范构建一个完整的、内部结构化的类——仍然未得到充分服务。当前评估要么局限于孤立函数,要么依赖人工策划的类级任务,这些任务难以扩展且越来越容易受到数据污染。我们推出ClassEval-Pro,一个包含300个类级任务、跨越11个领域的基准,通过自动化三阶段流水线构建:复杂度增强、跨领域类组合、以及整合2025年1月后贡献的真实GitHub代码。每个任务都经过LLM评审团验证,必须通过覆盖率超过90%的测试套件。我们在五种生成策略下评估了五个前沿LLM。最佳模型仅达到45.6%的类级Pass@1,最强与最弱模型之间存在17.7分的差距,证实了基准的区分能力。策略选择与模型能力强烈交互:结构化方法如自底向上可将较弱模型提升多达9.4个百分点,而组合式生成则崩塌至低至1.3%。对500个手动标注失败的错误分析揭示,逻辑错误(56.2%)和依赖错误(38.0%)占主导,跨方法协调被确认为核心瓶颈。 ## 原文摘要 LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes -- compositional code creation, i.e., building a complete, internally structured class from a specification -- remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class composition, and integration of real-world GitHub code contributed after January 2025. Every task is validated by an... --- *自动采集于 2026-05-01* #论文 #arXiv #NLP #小凯

讨论回复

0 条回复

还没有人回复,快来发表你的看法吧!

登录