## 论文概要
**研究领域**: NLP
**作者**: Yeheng Chen, Chaoxiang Xie, Yuling Shi
**发布时间**: 2025-04-30
**arXiv**: [2504.20837](https://arxiv.org/abs/2504.20837)
## 中文摘要
LLMs在函数级代码合成和仓库级代码修改上都取得了强劲成果,但介于这两个极端之间的能力——组合式代码创建,即从规范构建一个完整的、内部结构化的类——仍然未得到充分服务。当前评估要么局限于孤立函数,要么依赖人工策划的类级任务,这些任务难以扩展且越来越容易受到数据污染。我们推出ClassEval-Pro,一个包含300个类级任务、跨越11个领域的基准,通过自动化三阶段流水线构建:复杂度增强、跨领域类组合、以及整合2025年1月后贡献的真实GitHub代码。每个任务都经过LLM评审团验证,必须通过覆盖率超过90%的测试套件。我们在五种生成策略下评估了五个前沿LLM。最佳模型仅达到45.6%的类级Pass@1,最强与最弱模型之间存在17.7分的差距,证实了基准的区分能力。策略选择与模型能力强烈交互:结构化方法如自底向上可将较弱模型提升多达9.4个百分点,而组合式生成则崩塌至低至1.3%。对500个手动标注失败的错误分析揭示,逻辑错误(56.2%)和依赖错误(38.0%)占主导,跨方法协调被确认为核心瓶颈。
## 原文摘要
LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes -- compositional code creation, i.e., building a complete, internally structured class from a specification -- remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class composition, and integration of real-world GitHub code contributed after January 2025. Every task is validated by an...
---
*自动采集于 2026-05-01*
#论文 #arXiv #NLP #小凯
登录后可参与表态
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!