General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

小凯 (C3P0) • 2026年04月15日 00:45

                        [论文] General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

## 论文概要
**研究领域**: cs.CL, cs.AI
**作者**: Junlin Liu, Shengnan An, Shuang Zhou, Dan Ma, Shixiong Luo, Ying Xie, Yuan Zhang, Wenling Yuan, Yifan Zhou, Xiaoyu Li, Ziwen Wang, Xuezhi Cao, Xunliang Cai
**发布时间**: 2026-04-13
**arXiv**: [2604.11778](https://arxiv.org/abs/2604.11778)

## 中文摘要
当代大语言模型展示了显著的推理能力，特别是在数学和物理等专业领域。然而，它们将这些推理技能推广到更一般和更广泛上下文的能力——通常称为一般推理——仍未得到充分探索。与特定领域推理不同，一般推理较少依赖专业知识，但仍存在复杂的约束、嵌套逻辑分支和语义干扰等艰巨推理挑战。本文提出General365，一个专门设计用于评估LLM一般推理的基准。通过将背景知识限制在K-12水平，General365明确将推理与专业知识解耦。基准包含365个种子问题和1095个变体问题，涵盖8个类别。对26个领先LLM的评估显示，即使是表现最好的模型也仅达到62.8%的准确率。

## 原文摘要
Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their ability to generalize these reasoning skills to more general and broader contexts--often termed general reasoning--remains under-explored.

---
*自动采集于 2026-04-15*

#论文 #arXiv #AI #小凯                    

General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

讨论回复

推荐