静态缓存页面 · 查看动态版本 · 登录
智柴论坛 登录 | 注册
← 返回列表

A Cookbook for Building Self-Evolving Agents

✨步子哥 @steper · 2025-11-15 10:42 · 41浏览

A Cookbook for Building Self-Evolving Agents

Abstract representation of an AI agent improvement loop

A Cookbook for Building Self-Evolving Agents

A Framework for Continuous Improvement in Production

AI Systems Continuous Learning

What You'll Learn

  • Diagnose why autonomous agents fall short of production readiness
  • Compare three prompt-optimization strategies
  • Assemble a self-healing workflow with human review and LLM evals

This cookbook provides a practical framework for building self-evolving agents that can learn from their mistakes and improve their performance over time. By combining human feedback, automated evaluation using an "LLM-as-a-judge," and iterative prompt optimization, you can move beyond brittle proof-of-concept demos to create robust, production-ready systems.

ML/AI Engineers

Move beyond toy demos with executable artifacts for production pipelines

Product Teams

Adapt internal tooling with accuracy, auditability, and rapid iteration

Solution Architects

Design systems that learn and improve autonomously in production

1. The Self-Evolving Agent Framework

1.1 The Core Challenge: Overcoming the Post-Proof-of-Concept Plateau

A significant and recurring challenge in the development of agentic systems is the plateau in performance and reliability that often follows an initial proof-of-concept. While early demonstrations can showcase the potential of Large Language Models (LLMs) to automate complex tasks, these systems frequently fall short of production readiness.

The Critical Gap

The core issue lies in their inability to autonomously diagnose and correct failures, particularly the edge cases that emerge when exposed to the full complexity and variability of real-world data.

This dependency on human intervention for continuous diagnosis and correction creates a bottleneck, hindering scalability and long-term viability. The self-evolving loop addresses this critical gap by introducing a repeatable and structured retraining loop designed to capture failures, learn from feedback, and iteratively promote improvements back into the production workflow.

1.2 The Self-Evolving Loop: An Iterative Cycle of Feedback and Refinement

The Self-Evolving Loop Architecture

graph TD A["Baseline Agent"] --> B["Generate Output"] B --> C["Human Feedback"] B --> D["LLM-as-Judge"] C --> E["Evals & Aggregated Score"] D --> E E --> F{"Score > Threshold?"} F -->|"No"| G["Prompt Optimization"] F -->|"Yes"| H["Update Baseline Agent"] G --> I["Generate New Prompt"] I --> A H --> A

style A fill:#fefefe,stroke:#0d9488,stroke-width:3px,color:#1a1a1a style B fill:#f0f9ff,stroke:#0369a1,stroke-width:2px,color:#1a1a1a style C fill:#f0fdf4,stroke:#16a34a,stroke-width:2px,color:#1a1a1a style D fill:#f0fdf4,stroke:#16a34a,stroke-width:2px,color:#1a1a1a style E fill:#fffbeb,stroke:#d97706,stroke-width:2px,color:#1a1a1a style F fill:#fef3c7,stroke:#d97706,stroke-width:3px,color:#1a1a1a style G fill:#fdf2f8,stroke:#be185d,stroke-width:2px,color:#1a1a1a style H fill:#ecfdf5,stroke:#059669,stroke-width:3px,color:#1a1a1a style I fill:#f0f9ff,stroke:#0369a1,stroke-width:2px,color:#1a1a1a

The central innovation of this cookbook is the "self-evolving loop," a systematic and iterative process designed to enable continuous, autonomous improvement of an AI agent. This loop is engineered to move agentic systems beyond static, pre-programmed behaviors and into a state of dynamic learning and adaptation.

1. Baseline Agent

Establish the initial benchmark with a deliberately simple agent

2. Feedback Collection

Gather structured feedback from humans and LLM-as-a-judge

3. Evaluation & Scoring

Measure performance using specialized graders

4. Prompt Optimization

Generate improved instructions based on feedback

5. Updated Agent

Promote the best-performing version to production

1.3 Use Case: Healthcare Regulatory Documentation

Pharmaceutical regulatory documents on a desk

To ground the abstract concepts in a concrete, real-world scenario, this cookbook focuses on a challenging and high-stakes use case: the drafting of regulatory documents for the pharmaceutical industry. This domain demands an exceptionally high degree of accuracy, precision, and compliance.

Baseline Agent Architecture

  • Summarizer: Creates scientific and concise summaries
  • Compliance Checker: Evaluates against FDA 21 CFR Part 11

Dataset

  • Source: Sample CMC Section for Hyperpolarized Pyruvate (13C) Injection
  • Size: ~70 sections of technical documentation

2. Manual Prompt Optimization with OpenAI Evals

2.1 Workflow Overview

The OpenAI Evals platform provides a powerful and intuitive web-based interface for the manual optimization and evaluation of prompts. This approach is particularly well-suited for rapid prototyping and close collaboration with subject matter experts.

OpenAI Evals platform user interface

Key Features

  • Dataset upload and exploration
  • Prompt configuration with variables
  • Batch output generation

Optimization Tools

  • Structured feedback collection
  • Automated prompt optimization
  • Performance comparison across versions

2.2 Step-by-Step Process

Step Action Description
1 Upload Dataset Upload CSV containing inputs for the agent
2 Explore Data Verify data is properly formatted and complete
3 Configure Prompt Define system prompt, user template, and model settings
4 Generate Outputs Run prompt against dataset to create baseline
5 Review & Evaluate Provide structured feedback with ratings and comments
6 Optimize Prompt Use automated optimization based on feedback
7 Iterate & Compare Repeat cycle until performance is satisfactory

Pro Tip

Start with a very simple prompt like "summarize" to clearly demonstrate the power of the optimization process. The platform's ability to evolve from minimal starting points is remarkable.

3. Automated Self-Healing Loop

3.1 System Architecture

This section introduces a fully automated, programmatic approach to the self-evolving loop, eliminating the need for any user interface. This API-driven workflow is designed for scalability and is well-suited for integration into production pipelines and CI/CD environments.

Summarization Agent

Primary agent performing the document summarization task

Metaprompt Agent

Separate agent responsible for prompt optimization

Evaluation Suite

Collection of specialized graders for quality assessment

Orchestration Logic

Python functions managing the feedback loop workflow

3.2 Building the Evaluation Suite

Grader Type Pass Threshold What It Checks
Chemical Name Preservation Python 0.8 Ensures all chemical names appear in summary
Summary Length Adherence Python 0.85 Measures deviation from 100-word target
Semantic Similarity Cosine Similarity 0.85 Calculates semantic overlap with source
Holistic Quality Assessment LLM-as-a-Judge 0.85 Rubric-driven score from evaluator model

Evaluation Process Flow

graph LR A["Agent Output"] --> B["Chemical Grader"] A --> C["Length Grader"] A --> D["Similarity Grader"] A --> E["LLM Judge"]

B --> F["Chemical Score: 0.8"] C --> G["Length Score: 0.85"] D --> H["Similarity Score: 0.9"] E --> I["Quality Score: 0.85"]

F --> J["Aggregate Score: 0.85"] G --> J H --> J I --> J

style A fill:#fefefe,stroke:#0d9488,stroke-width:3px,color:#1a1a1a style J fill:#f0f9ff,stroke:#0369a1,stroke-width:3px,color:#1a1a1a style B fill:#f0fdf4,stroke:#16a34a,stroke-width:2px,color:#1a1a1a style C fill:#f0fdf4,stroke:#16a34a,stroke-width:2px,color:#1a1a1a style D fill:#f0fdf4,stroke:#16a34a,stroke-width:2px,color:#1a1a1a style E fill:#f0fdf4,stroke:#16a34a,stroke-width:2px,color:#1a1a1a style F fill:#ecfdf5,stroke:#059669,stroke-width:2px,color:#1a1a1a style G fill:#ecfdf5,stroke:#059669,stroke-width:2px,color:#1a1a1a style H fill:#ecfdf5,stroke:#059669,stroke-width:2px,color:#1a1a1a style I fill:#ecfdf5,stroke:#059669,stroke-width:2px,color:#1a1a1a

3.3 Orchestration and Monitoring

The orchestration logic brings together all components and coordinates their actions to create a seamless, automated workflow. This includes agent versioning, feedback translation, and promotion decisions.

Monitoring dashboard with metrics and graphs

Observability Features

  • Dashboard Tracing: Real-time workflow visualization
  • Version History: Complete prompt evolution tracking
  • Performance Metrics: Latency and throughput monitoring

Production Monitoring

  • Continuous Monitoring: Scheduled re-evaluation
  • Drift Detection: Performance degradation alerts
  • Auto-Recovery: Automatic rollback to stable versions

4. Advanced Optimization Strategies

4.1 Model Evaluation and Selection

The self-evolving loop can be extended beyond prompt optimization to include the evaluation and selection of different model candidates, automatically finding the optimal balance between performance and cost.

Model Comparison Workflow

graph TD A["Improved Prompt"] --> B["Evaluate with GPT-5"] A --> C["Evaluate with GPT-5-mini"] A --> D["Evaluate with GPT-5-nano"]

B --> E["Score: 0.92"] C --> F["Score: 0.88"] D --> G["Score: 0.85"]

E --> H{"Select Best Model"} F --> H G --> H

H --> I["GPT-5 Selected"] H --> J["Cost Analysis: $0.12/query"] H --> K["Performance: +8% improvement"]

style A fill:#fefefe,stroke:#0d9488,stroke-width:3px,color:#1a1a1a style I fill:#ecfdf5,stroke:#059669,stroke-width:3px,color:#1a1a1a style B fill:#f0f9ff,stroke:#0369a1,stroke-width:2px,color:#1a1a1a style C fill:#f0f9ff,stroke:#0369a1,stroke-width:2px,color:#1a1a1a style D fill:#f0f9ff,stroke:#0369a1,stroke-width:2px,color:#1a1a1a style E fill:#ecfdf5,stroke:#16a34a,stroke-width:2px,color:#1a1a1a style F fill:#fef3c7,stroke:#d97706,stroke-width:2px,color:#1a1a1a style G fill:#fee2e2,stroke:#dc2626,stroke-width:2px,color:#1a1a1a style H fill:#f0f9ff,stroke:#0369a1,stroke-width:3px,color:#1a1a1a style J fill:#f0f9ff,stroke:#0369a1,stroke-width:2px,color:#1a1a1a style K fill:#f0f9ff,stroke:#0369a1,stroke-width:2px,color:#1a1a1a

4.2 Prompt Optimization with Genetic-Pareto (GEPA)

The Genetic-Pareto (GEPA) framework represents a more advanced approach to prompt optimization, employing an evolutionary process with reflective, language-based updates to find robust, generalized prompts.

Abstract representation of evolutionary algorithm concept

GEPA Framework Benefits

Reflective Evolution: Analyzes performance and proposes intelligent improvements
Generalization: Uses training/validation sets to prevent overfitting
Evolutionary Approach: Samples trajectories and reflects on feedback
Empirical Evidence: Clear performance validation across datasets

5. Appendix

5.1 Example Prompts from Each Optimization Method

Initial Baseline Prompt

You are a summarization assistant.
Given a section of text, produce a summary.

OpenAI Platform Optimizer Output

You are a summarization assistant.
Task: Summarize the provided text concisely and accurately.
Output requirements:
  • Output only the summary. Do not add titles, labels (e.g., "Summary:"), prefaces, or commentary.
  • Preserve the document's structure. If multiple sections/subsections appear, summarize each one.
  • Use a numbered list for sections/subsections (use their numbers/titles when present).
  • Under each, use short dash bullets for key points.
  • If there is only a single short section, return a brief bullet list or 1-2 concise sentences.
  • Split any inline lists into separate bullets.
  • Use plain, simple language. Keep bullets tight (ideally one line each). Remove redundancy.
  • Include important quantitative details (values, units, conditions) and constraints. Do not invent information.
  • Keep formatting simple: plain text, "1." numbering and "-" bullets only. No tables or special markup.
  • Retain exact technical terms/notation from the source (e.g., chemical names, isotopic labels).
  • If a section is explicitly marked "Not applicable," include that status; otherwise do not add it.

Static Metaprompt Output

You are a technical summarization assistant for scientific and regulatory documentation. Your task is to generate a concise, comprehensive, and fully detailed summary of any scientific, technical, or regulatory text provided. Strictly adhere to the following instructions:
---
1. Complete and Exact Information Inclusion  
  • Capture *every* explicit fact, technical value, specification, quantity, measurement, regulatory reference, entity, process, site, and contextual detail verbatim from the source text.
  • Do not omit or generalize any explicit information, no matter how minor.
2. Precise Terminology and Named Entity Retention
  • Reproduce all names of chemicals, drugs, mixtures, buffer components, devices, companies, institutions, regulatory standards, section numbers, and procedural labels *exactly as stated*.
  • Report all quantities, measurements, concentrations, ratios, masses, volumes, compositions, pH values, and units precisely as given.
  • Do not paraphrase, rename, substitute, or simplify any term or value.
... [additional detailed instructions] ...

GEPA Optimizer Output

You are a domain-aware summarization assistant for technical pharmaceutical texts. Given a "section" of text, produce a concise, single-paragraph summary that preserves key technical facts and exact nomenclature.
Length and format
  • Write 1–3 sentences totaling about 45–70 words (target ~60; never exceed 90).
  • Use one paragraph; no bullets, headings, tables, or heavy formatting.
Exact names and notation
  • Include every chemical name that appears in the section at least once, using the exact original spelling, capitalization, punctuation, isotopic labels, brackets, hyphens, salts, buffer names, and parenthetical qualifiers...
... [highly detailed domain-specific instructions] ... Self-check before finalizing
  • Does the paragraph contain every distinct chemical name exactly as written in the section?
  • Is the summary 45–70 words (≤90), in a single paragraph?
  • Are the most critical process/regulatory/testing details preserved?

讨论回复 (0)