递归语言模型的无限回响：当AI学会“翻书”而非“死记硬背”

方法 / 模型	S-NIAH (%)	BrowseComp+ (%)	OOLONG (%)	OOLONG-Pairs (F1)	平均成本 ($)
基线 GPT-5	失效 (>262K)	0.00	12.50	0.00	N/A
Summary Agent (GPT-5)	85.00	45.67	34.00	28.50	8.98
CodeAct + BM25 (GPT-5)	78.00	52.33	41.00	35.20	5.12
RLM (GPT-5)	92.00	91.33	56.50	58.00	0.99
RLM 无子调用 (GPT-5)	88.00	78.00	45.00	17.34	0.75
基线 Qwen3-Coder-480B	失效	0.00	10.00	0.00	N/A
RLM (Qwen3-Coder-480B)	89.00	85.67	52.00	54.50	1.15

方法 / 模型

S-NIAH (%)

BrowseComp+ (%)

OOLONG (%)

OOLONG-Pairs (F1)

平均成本 ($)

基线 GPT-5

失效 (>262K)

0.00

12.50

0.00

N/A

Summary Agent (GPT-5)

85.00

45.67

34.00

28.50

8.98

CodeAct + BM25 (GPT-5)

78.00

52.33

41.00

35.20

5.12

RLM (GPT-5)

**92.00**

**91.33**

**56.50**

**58.00**

**0.99**

RLM 无子调用 (GPT-5)

88.00

78.00

45.00

17.34

0.75

基线 Qwen3-Coder-480B

失效

0.00

10.00

0.00

N/A

RLM (Qwen3-Coder-480B)

**89.00**

**85.67**

**52.00**

**54.50**

**1.15**

The Paradigm Shift

Traditional LLMs struggle with Context Rot — performance degradation when inputs exceed standard context windows (e.g., GPT-5's 262K tokens).

Recursive Language Models (RLMs) propose a new inference paradigm. Instead of cramming prompts into a context window, RLMs treat the input as a programmable environment (Python REPL). This enables LLMs to interact with massive inputs symbolically via code, handling up to 10 Million tokens while improving accuracy and reducing costs.

How RLMs Work

The REPL-based recursive workflow:

Store Context: Full input loaded as a variable (e.g., context) in Python REPL.

Generate Code: Root LLM writes code to slice, filter, or search the context.

Recursive Calls: Sub-LLMs invoked on relevant subsets (llm_query).

Aggregate: Results collected and final answer output via FINAL().

Key Benefits

📏

Scalability

Handles inputs 100x larger than standard limits (10M+ tokens).

🎯

Accuracy

Double-digit improvements on long-context benchmarks.

💰

Cost Efficiency

Median costs equal to or lower than base models.

✅

Reduced Hallucination

Self-verification via code execution and iterative refinement.

Performance Benchmarks

Comparison on GPT-5 across long-context tasks. RLMs significantly outperform baselines.

Method / Model	S-NIAH (%)	BrowseComp+ (%)	OOLONG (%)	Avg. Cost ($)
Base GPT-5	Fails (>262K)	0.00	12.50	N/A
Summary Agent	85.00	45.67	34.00	8.98
CodeAct + BM25	78.00	52.33	41.00	5.12
RLM (GPT-5)	92.00	91.33	56.50	0.99

Applications & Future Outlook

🚀 Long-Horizon Agents

Enables autonomous agents to operate over massive document sets and codebases without memory loss.

📚 Document Analysis

Perfect for semantic aggregation, multi-hop QA, and deep search in legal, medical, and financial texts.

🔓 Open Source Ecosystem

Available on GitHub (alexzhang13/rlm). Integrated with Prime Intellect for parallelization and RL training.

递归语言模型的无限回响：当AI学会“翻书”而非“死记硬背”

参考文献

讨论回复

Recursive Language Models

Scaling AI Beyond Context Windows

推荐