# GEPA_AIME教程:使用GEPA优化数学问题求解的完整指南
## 目录
1. [教程概述](#教程概述)
2. [环境配置](#环境配置)
3. [数据集准备](#数据集准备)
4. [基础程序设计](#基础程序设计)
5. [评估指标定义](#评估指标定义)
6. [GEPA优化详解](#gepa优化详解)
7. [优化结果分析](#优化结果分析)
8. [核心技术原理](#核心技术原理)
9. [实践指南与技巧](#实践指南与技巧)
10. [完整代码示例](#完整代码示例)
## 教程概述
本教程展示了如何使用DSPy框架中的GEPA(Generative Evolutionary Prompt Adaptation)优化器来提升大语言模型在数学竞赛问题(AIME)上的表现。
### 核心成果
- **模型**: GPT-4.1 Mini
- **任务**: AIME数学问题求解
- **优化前准确率**: 46.6%
- **优化后准确率**: 56.6%
- **性能提升**: +10%
### 技术特点
- **自适应提示优化**: GEPA自动生成和优化提示词
- **反馈驱动**: 利用错误分析和解决方案反馈
- **结构化推理**: 包含详细的解题策略和模式
## 环境配置
### 基础配置
```python
# 导入必要的库
import dspy
from datasets import load_dataset
import random
# 配置OpenAI API
api_key = input("Enter your OpenAI API key: ")
lm = dspy.LM("openai/gpt-4.1-mini", temperature=1, api_key=api_key, max_tokens=32000)
dspy.configure(lm=lm)
```
### MLflow集成(可选)
```python
import mlflow
# 设置MLflow跟踪
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("DSPy")
# 启用自动日志
mlflow.dspy.autolog(
log_compiles=True, # 记录优化过程
log_evals=True, # 记录评估结果
log_traces=True # 记录模块执行轨迹
)
```
## 数据集准备
### 数据加载和划分
```python
def init_dataset():
# 加载AIME训练数据(2022-2024年)
train_split = load_dataset("AI-MO/aimo-validation-aime")['train']
train_split = [
dspy.Example({
"problem": x['problem'],
'solution': x['solution'],
'answer': x['answer'],
}).with_inputs("problem")
for x in train_split
]
# 随机打乱
random.Random(0).shuffle(train_split)
tot_num = len(train_split)
# 加载测试数据(AIME 2025)
test_split = load_dataset("MathArena/aime_2025")['train']
test_split = [
dspy.Example({
"problem": x['problem'],
'answer': x['answer'],
}).with_inputs("problem")
for x in test_split
]
# 数据划分
train_set = train_split[:int(0.5 * tot_num)] # 45个样本
val_set = train_split[int(0.5 * tot_num):] # 45个样本
test_set = test_split * 5 # 150个样本(重复5次以提高统计稳定性)
return train_set, val_set, test_set
# 加载数据
train_set, val_set, test_set = init_dataset()
print(f"训练集: {len(train_set)}, 验证集: {len(val_set)}, 测试集: {len(test_set)}")
```
### 数据样本示例
```python
# 查看一个典型的AIME问题
example = train_set[0]
print("问题:")
print(example['problem'])
print("\n解答:")
print(example['solution'])
print("\n答案:")
print(example['answer'])
```
示例输出:
```
问题:
In isosceles trapezoid $ABCD$, parallel bases $\overline{AB}$ and $\overline{CD}$ have lengths $500$ and $650$, respectively, and $AD=BC=333$. The angle bisectors of $\angle{A}$ and $\angle{D}$ meet at $P$, and the angle bisectors of $\angle{B}$ and $\angle{C}$ meet at $Q$. Find $PQ$.
答案:
242
```
## 基础程序设计
### 定义签名(Signature)
```python
class GenerateResponse(dspy.Signature):
"""Solve the problem and provide the answer in the correct format."""
problem = dspy.InputField()
answer = dspy.OutputField()
# 创建基础的链式思考程序
program = dspy.ChainOfThought(GenerateResponse)
```
### 基本评估指标
```python
def metric(example, prediction, trace=None, pred_name=None, pred_trace=None):
"""简单的精确匹配评估"""
correct_answer = int(example['answer'])
try:
llm_answer = int(prediction.answer)
except ValueError:
return 0
return int(correct_answer == llm_answer)
```
### 初始性能评估
```python
import dspy
evaluate = dspy.Evaluate(
devset=test_set,
metric=metric,
num_threads=32,
display_table=True,
display_progress=True
)
# 评估未优化的程序
baseline_result = evaluate(program)
print(f"基线性能: {baseline_result.score:.1f}%")
```
## 评估指标定义
### 增强的反馈指标
GEPA的关键优势在于能够利用丰富的反馈信息:
```python
def metric_with_feedback(example, prediction, trace=None, pred_name=None, pred_trace=None):
"""带有详细反馈的评估指标,为GEPA提供学习信号"""
correct_answer = int(example['answer'])
written_solution = example.get('solution', '')
try:
llm_answer = int(prediction.answer)
except ValueError as e:
# 处理解析错误
feedback_text = f"最终答案必须是一个有效的整数。你回答了'{prediction.answer}',这无法解析为Python整数。请确保你的答案是一个有效的整数,没有任何额外的文字或格式。"
feedback_text += f" 正确答案是'{correct_answer}'。"
if written_solution:
feedback_text += f" 这里是完整的逐步解答:\n{written_solution}\n\n请思考从这个解答中可以学到什么,以改进你未来的答案和解决类似问题的方法,并确保你的最终答案是一个有效的整数。"
return dspy.Prediction(score=0, feedback=feedback_text)
score = int(correct_answer == llm_answer)
# 生成反馈文本
if score == 1:
feedback_text = f"你的答案是正确的。正确答案是'{correct_answer}'。"
else:
feedback_text = f"你的答案是错误的。正确答案是'{correct_answer}'。"
if written_solution:
feedback_text += f" 这里是完整的逐步解答:\n{written_solution}\n\n请思考从这个解答中可以学到什么,以改进你未来的答案和解决类似问题的方法。"
return dspy.Prediction(score=score, feedback=feedback_text)
```
### 反馈机制的优势
1. **错误分析**: 提供详细的错误类型分析
2. **解决方案学习**: 包含完整的标准解答
3. **策略改进**: 引导模型学习解题模式
4. **格式规范**: 强调答案格式要求
## GEPA优化详解
### GEPA配置
```python
from dspy import GEPA
optimizer = GEPA(
metric=metric_with_feedback, # 使用增强反馈指标
auto="light", # 轻量级优化配置
num_threads=32, # 并行线程数
track_stats=True, # 跟踪统计信息
reflection_minibatch_size=3, # 反思批次大小
reflection_lm=dspy.LM( # 专用反思模型
model="gpt-5",
temperature=1.0,
max_tokens=32000,
api_key=api_key
)
)
```
### 关键参数解释
- **auto="light"**: 预设的轻量级优化配置,适合快速验证
- **reflection_minibatch_size=3**: 每次反思使用3个样本进行分析
- **reflection_lm**: 使用更强大的模型进行反思和策略生成
- **track_stats**: 启用详细的优化过程跟踪
### 执行优化
```python
# 执行GEPA优化
optimized_program = optimizer.compile(
program,
trainset=train_set,
valset=val_set,
)
```
### 优化过程分析
GEPA优化过程包含多个迭代:
1. **初始评估**: 在验证集上评估基线性能
2. **策略生成**: 基于错误案例生成改进策略
3. **候选测试**: 在小批次上测试新策略
4. **性能验证**: 在完整验证集上验证性能
5. **迭代改进**: 重复以上步骤直到收敛
## 优化结果分析
### 生成的优化提示
```python
# 查看GEPA生成的优化提示
print(optimized_program.predict.signature.instructions)
```
优化后的提示包含:
1. **格式要求**: 明确的输出格式规范
2. **解题策略**: 针对不同数学问题类型的具体方法
3. **领域知识**: 数学竞赛中常见的技巧和陷阱
4. **验证方法**: 答案检查和验证的步骤
### 性能对比
```python
# 评估优化后的程序
optimized_result = evaluate(optimized_program)
print("性能对比:")
print(f"基线性能: {baseline_result.score:.1f}%")
print(f"优化后性能: {optimized_result.score:.1f}%")
print(f"改进幅度: +{optimized_result.score - baseline_result.score:.1f}%")
```
预期输出:
```
性能对比:
基线性能: 46.7%
优化后性能: 56.7%
改进幅度: +10.0%
```
## 核心技术原理
### GEPA工作机制
1. **反思式优化**: 分析失败案例,识别模式和问题
2. **策略进化**: 基于反馈逐步改进解题策略
3. **帕累托前沿**: 维护多个候选策略,选择最优组合
4. **自适应调整**: 根据问题类型动态调整策略
### 关键创新点
#### 1. 结构化反馈学习
```python
# GEPA如何利用反馈
def analyze_failure_patterns(failed_examples):
"""分析失败模式并生成改进策略"""
patterns = extract_error_patterns(failed_examples)
strategies = generate_improvement_strategies(patterns)
return strategies
```
#### 2. 多策略融合
```python
# 不同数学问题类型的专门策略
strategies = {
"base_conversion": "使用模运算和位置记数法",
"geometry": "优先使用幂点定理和相似三角形",
"combinatorics": "注意有序vs无序,避免重复计数",
"number_theory": "利用同余和素数分解"
}
```
#### 3. 自适应提示生成
GEPA生成的提示包含:
- 问题类型识别指导
- 领域特定的解题技巧
- 常见错误预防措施
- 答案验证步骤
### 技术深度分析
#### 反思机制
```python
class ReflectionEngine:
def __init__(self):
self.error_patterns = {}
self.success_patterns = {}
def analyze_performance(self, examples, predictions):
"""分析性能并提取模式"""
for example, prediction in zip(examples, predictions):
if prediction.score == 0:
self.extract_error_pattern(example, prediction)
else:
self.extract_success_pattern(example, prediction)
def generate_improvements(self):
"""基于分析生成改进建议"""
improvements = []
for pattern in self.error_patterns:
improvement = self.pattern_to_strategy(pattern)
improvements.append(improvement)
return improvements
```
#### 策略进化
GEPA使用进化算法的思想:
1. **变异**: 基于反馈修改现有策略
2. **选择**: 保留表现最好的策略
3. **杂交**: 组合不同策略的优点
4. **适应**: 针对新问题调整策略
## 实践指南与技巧
### 1. 数据准备最佳实践
```python
# 确保数据质量
def validate_dataset(dataset):
"""验证数据集质量"""
for example in dataset:
assert 'problem' in example
assert 'answer' in example
assert isinstance(example['answer'], (int, str))
print(f"数据集验证通过: {len(dataset)} 个样本")
# 数据增强技巧
def augment_training_data(train_set, augmentation_factor=2):
"""通过轻微变化增强训练数据"""
augmented = []
for example in train_set:
augmented.append(example)
# 可以添加问题的变体版本
for i in range(augmentation_factor - 1):
variant = create_problem_variant(example)
augmented.append(variant)
return augmented
```
### 2. 超参数调优建议
```python
# 不同场景的推荐配置
configs = {
"quick_test": {
"auto": "light",
"reflection_minibatch_size": 3,
"max_iterations": 10
},
"production": {
"auto": "medium",
"reflection_minibatch_size": 5,
"max_iterations": 20
},
"research": {
"auto": "heavy",
"reflection_minibatch_size": 10,
"max_iterations": 50
}
}
```
### 3. 性能监控
```python
def monitor_optimization_progress(optimizer):
"""监控优化过程"""
iteration = 0
best_score = 0
for score in optimizer.get_iteration_scores():
iteration += 1
if score > best_score:
best_score = score
print(f"迭代 {iteration}: 新的最佳分数 {score:.1f}%")
# 早停机制
if iteration > 10 and score < best_score * 0.95:
print("性能下降,考虑早停")
break
```
### 4. 错误诊断
```python
def diagnose_failures(failed_examples):
"""诊断失败案例"""
error_types = {
"parsing_error": 0,
"logic_error": 0,
"calculation_error": 0,
"format_error": 0
}
for example in failed_examples:
error_type = classify_error(example)
error_types[error_type] += 1
print("错误类型分布:")
for error_type, count in error_types.items():
print(f" {error_type}: {count}")
```
### 5. 结果验证
```python
def validate_optimization_results(baseline_score, optimized_score, test_set):
"""验证优化结果的统计显著性"""
from scipy import stats
# 计算置信区间
n = len(test_set)
improvement = optimized_score - baseline_score
std_error = np.sqrt(optimized_score * (1 - optimized_score) / n)
ci_lower = improvement - 1.96 * std_error
ci_upper = improvement + 1.96 * std_error
print(f"改进幅度: {improvement:.1f}%")
print(f"95%置信区间: [{ci_lower:.1f}%, {ci_upper:.1f}%]")
if ci_lower > 0:
print("改进具有统计显著性")
else:
print("改进可能不具有统计显著性")
```
## 完整代码示例
### 端到端完整流程
```python
#!/usr/bin/env python3
"""
GEPA AIME优化完整示例
使用GEPA优化数学问题求解性能
"""
import dspy
from datasets import load_dataset
import random
from typing import List, Dict, Any
class AIMEOptimizer:
"""AIME问题求解优化器"""
def __init__(self, api_key: str):
self.api_key = api_key
self.setup_models()
self.train_set = None
self.val_set = None
self.test_set = None
def setup_models(self):
"""配置模型"""
self.lm = dspy.LM(
"openai/gpt-4.1-mini",
temperature=1,
api_key=self.api_key,
max_tokens=32000
)
dspy.configure(lm=self.lm)
def load_datasets(self):
"""加载和准备数据集"""
print("正在加载数据集...")
# 加载训练数据
train_split = load_dataset("AI-MO/aimo-validation-aime")['train']
train_split = [
dspy.Example({
"problem": x['problem'],
'solution': x['solution'],
'answer': x['answer'],
}).with_inputs("problem")
for x in train_split
]
# 随机打乱
random.Random(0).shuffle(train_split)
tot_num = len(train_split)
# 加载测试数据
test_split = load_dataset("MathArena/aime_2025")['train']
test_split = [
dspy.Example({
"problem": x['problem'],
'answer': x['answer'],
}).with_inputs("problem")
for x in test_split
]
# 数据划分
self.train_set = train_split[:int(0.5 * tot_num)]
self.val_set = train_split[int(0.5 * tot_num):]
self.test_set = test_split * 5
print(f"数据加载完成: 训练{len(self.train_set)}, 验证{len(self.val_set)}, 测试{len(self.test_set)}")
def create_program(self):
"""创建基础程序"""
class GenerateResponse(dspy.Signature):
"""Solve the problem and provide the answer in the correct format."""
problem = dspy.InputField()
answer = dspy.OutputField()
return dspy.ChainOfThought(GenerateResponse)
def create_metrics(self):
"""创建评估指标"""
def basic_metric(example, prediction, trace=None, pred_name=None, pred_trace=None):
correct_answer = int(example['answer'])
try:
llm_answer = int(prediction.answer)
except ValueError:
return 0
return int(correct_answer == llm_answer)
def feedback_metric(example, prediction, trace=None, pred_name=None, pred_trace=None):
correct_answer = int(example['answer'])
written_solution = example.get('solution', '')
try:
llm_answer = int(prediction.answer)
except ValueError as e:
feedback_text = f"最终答案必须是有效整数。你回答了'{prediction.answer}',无法解析。正确答案是'{correct_answer}'。"
if written_solution:
feedback_text += f" 标准解答:\n{written_solution}\n\n请学习解题模式并确保答案格式正确。"
return dspy.Prediction(score=0, feedback=feedback_text)
score = int(correct_answer == llm_answer)
if score == 1:
feedback_text = f"答案正确:'{correct_answer}'。"
else:
feedback_text = f"答案错误。正确答案是'{correct_answer}'。"
if written_solution:
feedback_text += f" 标准解答:\n{written_solution}\n\n请分析解题方法以改进未来表现。"
return dspy.Prediction(score=score, feedback=feedback_text)
return basic_metric, feedback_metric
def evaluate_program(self, program, metric, dataset=None):
"""评估程序性能"""
if dataset is None:
dataset = self.test_set
evaluator = dspy.Evaluate(
devset=dataset,
metric=metric,
num_threads=32,
display_table=False,
display_progress=True
)
return evaluator(program)
def optimize_with_gepa(self, program, feedback_metric):
"""使用GEPA优化程序"""
print("开始GEPA优化...")
optimizer = dspy.GEPA(
metric=feedback_metric,
auto="light",
num_threads=32,
track_stats=True,
reflection_minibatch_size=3,
reflection_lm=dspy.LM(
model="gpt-5",
temperature=1.0,
max_tokens=32000,
api_key=self.api_key
)
)
optimized_program = optimizer.compile(
program,
trainset=self.train_set,
valset=self.val_set,
)
print("GEPA优化完成!")
return optimized_program
def run_complete_experiment(self):
"""运行完整实验"""
# 1. 加载数据
self.load_datasets()
# 2. 创建程序和指标
program = self.create_program()
basic_metric, feedback_metric = self.create_metrics()
# 3. 评估基线性能
print("\n=== 基线性能评估 ===")
baseline_result = self.evaluate_program(program, basic_metric)
baseline_score = baseline_result.score
print(f"基线准确率: {baseline_score:.1f}%")
# 4. GEPA优化
print("\n=== GEPA优化 ===")
optimized_program = self.optimize_with_gepa(program, feedback_metric)
# 5. 评估优化后性能
print("\n=== 优化后性能评估 ===")
optimized_result = self.evaluate_program(optimized_program, basic_metric)
optimized_score = optimized_result.score
print(f"优化后准确率: {optimized_score:.1f}%")
# 6. 结果总结
improvement = optimized_score - baseline_score
print(f"\n=== 优化结果总结 ===")
print(f"基线性能: {baseline_score:.1f}%")
print(f"优化后性能: {optimized_score:.1f}%")
print(f"性能提升: +{improvement:.1f}%")
print(f"相对改进: {improvement/baseline_score*100:.1f}%")
# 7. 展示优化策略
print(f"\n=== 生成的优化策略 ===")
print("优化后的提示词长度:", len(optimized_program.predict.signature.instructions))
print("前500字符预览:")
print(optimized_program.predict.signature.instructions[:500] + "...")
return {
'baseline_score': baseline_score,
'optimized_score': optimized_score,
'improvement': improvement,
'optimized_program': optimized_program
}
def main():
"""主函数"""
# 获取API密钥
api_key = input("请输入OpenAI API密钥: ")
# 创建优化器
optimizer = AIMEOptimizer(api_key)
# 运行完整实验
results = optimizer.run_complete_experiment()
print("\n实验完成!")
return results
if __name__ == "__main__":
results = main()
```
### 高级配置示例
```python
# 自定义GEPA配置用于不同场景
def create_custom_gepa_config(scenario: str):
"""为不同场景创建自定义GEPA配置"""
configs = {
"research": {
"metric": metric_with_feedback,
"auto": "heavy",
"num_threads": 64,
"track_stats": True,
"reflection_minibatch_size": 10,
"max_iterations": 50,
"exploration_factor": 0.3,
"reflection_lm": dspy.LM(model="gpt-5", temperature=1.2, max_tokens=32000)
},
"production": {
"metric": metric_with_feedback,
"auto": "medium",
"num_threads": 32,
"track_stats": True,
"reflection_minibatch_size": 5,
"max_iterations": 20,
"exploration_factor": 0.2,
"reflection_lm": dspy.LM(model="gpt-4", temperature=1.0, max_tokens=16000)
},
"quick_test": {
"metric": metric_with_feedback,
"auto": "light",
"num_threads": 16,
"track_stats": False,
"reflection_minibatch_size": 3,
"max_iterations": 10,
"exploration_factor": 0.1,
"reflection_lm": dspy.LM(model="gpt-4", temperature=0.8, max_tokens=8000)
}
}
return configs.get(scenario, configs["production"])
# 使用自定义配置
config = create_custom_gepa_config("research")
optimizer = dspy.GEPA(**config)
```
## 总结
本教程详细介绍了如何使用GEPA优化数学问题求解性能。核心要点包括:
### 关键成功要素
1. **高质量反馈**: 提供详细的错误分析和解决方案
2. **适当的数据划分**: 合理的训练/验证/测试分割
3. **迭代优化**: 允许GEPA进行多轮改进
4. **专门的反思模型**: 使用更强大的模型进行策略生成
### 技术优势
- **自动化**: 无需手动调整提示词
- **自适应**: 根据具体问题类型调整策略
- **可解释**: 生成的策略具有明确的逻辑
- **通用性**: 可应用于各种数学问题类型
### 实际应用建议
1. **从轻量级配置开始**: 使用`auto="light"`进行初步验证
2. **监控优化过程**: 关注每次迭代的性能变化
3. **验证统计显著性**: 确保改进不是随机波动
4. **保存最佳模型**: 及时保存性能最好的版本
GEPA代表了提示优化技术的重要进展,通过自动化的反思和进化机制,能够显著提升大语言模型在复杂推理任务上的表现。本教程提供的方法和代码可以直接应用于其他类似的优化任务。