A Cookbook for Building Self-Evolving Agents: A Framework for Continuous Improvement in Production

✨步子哥 (steper) • 2025年11月15日 10:57 • 0 次浏览

1. The Self-Evolving Agent Framework: From Concept to Production

1.1. The Core Challenge: Overcoming the Post-Proof-of-Concept Plateau

A significant and recurring challenge in the development of agentic systems is the plateau in performance and reliability that often follows an initial proof-of-concept. While early demonstrations can showcase the potential of Large Language Models (LLMs) to automate complex tasks, these systems frequently fall short of production readiness. The core issue lies in their inability to autonomously diagnose and correct failures, particularly the edge cases that emerge when exposed to the full complexity and variability of real-world data. This dependency on human intervention for continuous diagnosis and correction creates a bottleneck, hindering scalability and long-term viability. The initial excitement of a successful demo gives way to the reality of a brittle system that requires constant manual oversight, preventing it from achieving true operational autonomy. This cookbook addresses this critical gap by introducing a repeatable and structured retraining loop designed to capture these failures, learn from the feedback provided, and iteratively promote improvements back into the production workflow. The framework is designed to transform a static, human-dependent agent into a dynamic, self-evolving system that can progressively enhance its own performance over time.

The proposed solution moves beyond simple, one-time prompt engineering or fine-tuning. Instead, it establishes a continuous cycle of evaluation and refinement that mirrors the iterative nature of software development and quality assurance. By instrumenting the agent with measurable feedback signals, the system can objectively identify areas of weakness, whether they be factual inaccuracies, stylistic inconsistencies, or failures to adhere to specific domain constraints. This feedback can be sourced from human experts, who provide nuanced, qualitative assessments, or from automated "LLM-as-a-judge" systems that offer scalable, quantitative scoring. This dual-source feedback mechanism ensures that the learning process is both comprehensive and efficient. The ultimate goal is to create a system that not only performs its designated task but also learns from its mistakes, gradually shifting the burden of detailed correction from human operators to high-level strategic oversight. This evolution is crucial for deploying agentic systems in high-stakes environments where accuracy, auditability, and rapid iteration are not just desirable but essential for success.

1.2. The Self-Evolving Loop: An Iterative Cycle of Feedback and Refinement

The central innovation of this cookbook is the "self-evolving loop," a systematic and iterative process designed to enable continuous, autonomous improvement of an AI agent. This loop is engineered to move agentic systems beyond static, pre-programmed behaviors and into a state of dynamic learning and adaptation. The process is structured as a continuous cycle that integrates agent execution, multi-faceted evaluation, and automated prompt refinement. It begins with a baseline agent, which generates an initial output. This output is then subjected to a rigorous evaluation process that combines the nuanced judgment of human reviewers with the scalable, consistent scoring of an automated LLM-as-a-judge. The feedback gathered from this evaluation is then used to generate an improved prompt, which is tested and scored. If the new prompt achieves a performance threshold, it replaces the original, becoming the new baseline for the next iteration. This closed-loop system ensures that the agent is constantly learning from its performance, refining its behavior, and adapting to new data or requirements without requiring constant manual intervention from engineers or domain experts. The loop is designed to be robust, with built-in mechanisms for handling failures and ensuring that only demonstrably superior versions of the agent are promoted to production.

The self-evolving loop is composed of five distinct, sequential stages that together form a complete cycle of improvement. Each stage plays a critical role in transforming raw agent outputs into actionable insights and, ultimately, into a more effective agent. The process is designed to be modular, allowing for different components to be swapped or upgraded as needed. For instance, the evaluation suite can be expanded with new graders to address specific failure modes, or the prompt optimization strategy can be enhanced with more sophisticated techniques. The loop's architecture is also designed for observability, with detailed logging and tracing at each stage to provide a clear audit trail of the agent's evolution. This transparency is crucial for debugging, understanding the impact of changes, and ensuring the reliability of the system in production environments. The following subsections will detail each of the five stages of the self-evolving loop, providing a comprehensive overview of how this framework enables the creation of truly adaptive and self-improving agentic systems.

1.2.1. Baseline Agent: Establishing the Initial Benchmark

The first step in the self-evolving loop is the establishment of a baseline agent, which serves as the initial benchmark for all subsequent evaluation and refinement. This agent does not need to be perfect; in fact, it can be deliberately simple to effectively illustrate the power of the iterative improvement process. In the context of this cookbook, the baseline agent is a summarization assistant tasked with condensing sections of regulatory documents. Its initial prompt is intentionally generic, such as "You are a summarization assistant. Given a section of text, produce a summary." This simplicity allows the optimization loop to demonstrate its ability to evolve a system from a minimal starting point to a highly specialized and effective tool. The outputs generated by this baseline agent, while potentially flawed, provide the raw material for the evaluation stage. They represent the starting point of the agent's performance curve and are the first set of data points that will be used to identify areas for improvement. The baseline agent's role is to produce a consistent stream of outputs that can be systematically evaluated, scored, and used to drive the learning process forward.

The architecture of the baseline agent can vary depending on the complexity of the task and the production environment. In this cookbook, a simplified version of a regulatory authoring agent is used, focusing specifically on the summarization task. In a more complex, real-world scenario, the baseline agent could be a composite of multiple specialized sub-agents, each responsible for a different aspect of the workflow, such as data analysis, compliance checking, or citation generation. Regardless of its complexity, the baseline agent's primary function within the loop is to serve as the initial point of comparison. Its performance is measured against a set of predefined criteria, and its outputs are the subject of both human and automated evaluation. The key is that the baseline agent is a stable, reproducible starting point. The loop is designed to improve upon this foundation, and the initial prompt and its corresponding outputs are the first iteration in a long series of continuous enhancements. The simplicity of the initial agent also underscores a key principle of the framework: that significant performance gains can be achieved not just through complex initial engineering, but through a systematic and data-driven process of iterative refinement.

1.2.2. Feedback Collection: Human Review and LLM-as-a-Judge

Once the baseline agent has generated its initial outputs, the next critical stage in the self-evolving loop is the collection of structured feedback. This feedback is the lifeblood of the entire system, providing the necessary signals to identify weaknesses and guide the optimization process. The framework employs a dual-pronged approach to feedback collection, leveraging both human expertise and the scalable power of automated evaluation. This hybrid model ensures a balance between nuanced, qualitative judgment and consistent, quantitative scoring. The choice between human review and an LLM-as-a-judge, or a combination of both, depends on the specific context of the evaluation. For instance, during the initial development and prototyping phase, or in production environments where subject matter experts (SMEs) are available, human feedback is invaluable for uncovering subtle edge cases and providing rich, contextual insights. The OpenAI Evals platform provides a user-friendly interface for this purpose, allowing reviewers to provide both binary (thumbs up/down) ratings and detailed textual feedback on the agent's outputs.

In parallel, the framework utilizes an "LLM-as-a-judge" to automate the evaluation process, which is particularly useful for rapid, iterative development and for monitoring model performance at scale. This approach involves using a separate, powerful LLM to act as an evaluator, scoring the agent's outputs against a predefined rubric. This automated judge can assess a wide range of criteria, from factual accuracy and stylistic adherence to the presence of specific keywords or the correct formatting of the output. The LLM-as-a-judge is not just a simple scorer; it can also provide a rationale for its evaluation, offering actionable feedback that can be fed directly into the prompt optimization stage. This automated approach enables fast feedback loops without requiring the constant attention of human experts, making it ideal for continuous integration and deployment pipelines. By combining the strengths of both human and automated evaluation, the self-evolving loop ensures that the feedback it receives is both comprehensive and scalable, providing a solid foundation for the subsequent stages of evaluation and optimization.

1.2.3. Evaluation and Scoring: Measuring Performance with Graders

The feedback collected from both human reviewers and the LLM-as-a-judge is then processed in the evaluation and scoring stage. This is where the qualitative and quantitative feedback is transformed into a structured, measurable assessment of the agent's performance. The core of this stage is a suite of "graders," which are specialized evaluation functions designed to assess the agent's output against specific, predefined criteria. Each grader is responsible for a different aspect of the output's quality, and together they form a comprehensive evaluation suite. For the regulatory document summarization use case, this cookbook defines four distinct graders, each with a specific pass threshold and a clear rationale for its inclusion. This multi-grader approach ensures that the evaluation is robust and multi-faceted, capturing a wide range of potential failure modes and quality signals. The scores from each grader are then aggregated into a single, composite score that represents the overall performance of the agent for a given input.

The evaluation process is not just about assigning a single number; it's about providing a detailed breakdown of performance across different dimensions. This granular feedback is crucial for the subsequent prompt optimization stage, as it allows the system to understand not just that the agent failed, but why it failed. For example, if the agent's summary is factually accurate but too verbose, the length grader will flag this issue, providing a specific signal for the metaprompt agent to address. Similarly, if the summary is concise but omits critical chemical names, the chemical name grader will provide a clear indication of what needs to be improved. This detailed, multi-faceted scoring system is what enables the self-evolving loop to make targeted, effective improvements to the agent's instructions. The aggregated score is then compared against a target threshold to determine whether the agent's performance is acceptable or if further optimization is required. This systematic, data-driven approach to evaluation is the key to transforming the agent from a static tool into a dynamic, learning system.

1.2.4. Prompt Optimization: Generating and Testing Improved Instructions

When the evaluation stage determines that the agent's performance is below the desired threshold, the prompt optimization stage is triggered. This is the heart of the self-evolving loop, where the system takes the feedback from the graders and uses it to generate a new, improved set of instructions for the agent. This process is not a simple, one-time fix; it is an iterative search for a better prompt. The cookbook explores three distinct strategies for prompt optimization, ranging from quick manual iteration to fully automated loops, each suited for different stages of development and production. The most basic approach involves using the OpenAI Evals platform's "Optimize" button, which uses the structured human feedback to generate a new prompt. This is ideal for rapid prototyping and for scenarios where human-in-the-loop oversight is preferred. The platform's visual interface makes it easy to see the impact of the changes and to compare the performance of the new prompt against the old one.

For a more automated and scalable approach, the cookbook introduces a "metaprompt agent." This is a separate LLM agent whose sole purpose is to act as a prompt optimizer. It takes the original prompt, the agent's output, the source text, and the consolidated feedback from the graders as input, and it generates a new, improved prompt as output. This metaprompt agent is guided by a detailed template that instructs it to produce a prompt that is more specific, more directive, and better aligned with the desired performance criteria. This automated approach enables the system to explore a wide range of prompt variations without requiring manual intervention, making it ideal for continuous integration and deployment. The most advanced strategy presented is the use of the Genetic-Pareto (GEPA) framework, which employs a more sophisticated, evolutionary approach to prompt optimization. GEPA uses a combination of quantitative scores and qualitative feedback to reflect on the agent's performance and propose revisions, leading to more robust and generalized prompts. Regardless of the specific strategy used, the goal of the prompt optimization stage is the same: to use the rich, structured feedback from the evaluation stage to systematically and iteratively improve the agent's instructions, driving its performance closer to the desired target.

1.2.5. Updated Agent: Promoting the Best-Performing Version

The final stage of the self-evolving loop is the promotion of the updated agent. Once a new, improved prompt has been generated and tested, its performance is compared against the baseline. If the new version achieves a higher aggregated score and meets the predefined pass thresholds, it is promoted to become the new baseline agent. This updated agent then becomes the foundation for the next iteration of the loop, creating a continuous cycle of learning and optimization. This process of promotion is not automatic; it is a deliberate decision based on empirical evidence of superior performance. The system maintains a history of all prompt versions, along with their associated performance metrics, allowing for a clear audit trail of the agent's evolution. This versioning system is crucial for traceability and for ensuring that the system can be rolled back to a previous, stable version if a new prompt introduces unexpected regressions.

The promotion of the updated agent is the culmination of the entire loop. It represents the successful application of feedback-driven learning, where the system has not only identified its own weaknesses but has also taken concrete steps to address them. This continuous cycle of evaluation, optimization, and promotion is what enables the agent to evolve over time, gradually improving its performance and adapting to new challenges. The loop is designed to run continuously, either on a schedule or in response to new data, ensuring that the agent remains accurate, compliant, and effective in the face of changing requirements. By closing the loop in this way, the framework transforms a static, brittle agent into a dynamic, resilient, and self-improving system, capable of achieving and maintaining a high level of performance in even the most demanding production environments. This final stage is not an end point, but rather a new beginning, as the updated agent is immediately subjected to the next round of evaluation and refinement, perpetuating the cycle of continuous improvement.

1.3. Use Case: A Regulatory Document Summarization Agent for Healthcare

To ground the abstract concepts of the self-evolving loop in a concrete, real-world scenario, this cookbook focuses on a challenging and high-stakes use case: the drafting of regulatory documents for the pharmaceutical industry. This domain is an ideal testbed for the framework because it demands an exceptionally high degree of accuracy, precision, and compliance. The documents produced in this field, such as those submitted to the U.S. Food and Drug Administration (FDA), are subject to rigorous scrutiny, and any errors or omissions can have significant consequences, including delays in the approval of life-saving treatments. The process of creating these documents is traditionally labor-intensive, requiring deep expertise in science, medicine, and regulatory law. Agentic systems offer a compelling solution to this challenge by assisting with tasks such as research synthesis, content generation, and document structuring. However, the critical nature of these documents means that human experts must remain in the loop to ensure factual accuracy and regulatory compliance. The self-evolving loop is perfectly suited to this "human-in-the-loop" scenario, as it is designed to gradually shift the human effort from detailed, line-by-line correction to high-level strategic oversight, thereby improving efficiency without compromising on quality.

The use case is centered around a regulatory authoring agent that is tasked with summarizing sections of a Chemistry, Manufacturing, and Controls (CMC) document. This is a highly complex and iterative process that requires the agent to not only understand the scientific content but also to adhere to strict formatting and content guidelines. The agent must be able to accurately identify and preserve critical information, such as chemical names, molecular formulas, and regulatory citations, while also producing a concise and readable summary. The self-evolving loop is used to continuously improve the agent's ability to perform this task. By providing the agent with a steady stream of feedback from both human reviewers and automated graders, the system can iteratively refine its summarization instructions, leading to progressively better performance. The following subsections will provide a more detailed overview of the problem definition, the architecture of the baseline agent, and the dataset used for evaluation, illustrating how the self-evolving loop can be applied to solve a real-world problem in a highly regulated industry.

1.3.1. Problem Definition: The Need for Accuracy and Speed in Pharmaceutical Submissions

The core problem addressed in this use case is the immense challenge of producing accurate and timely regulatory submissions for new pharmaceutical products. Pharmaceutical companies are required to prepare and submit extensive documentation to regulatory authorities like the FDA to obtain approval for new drugs. The speed and accuracy of these submissions are of paramount importance, as they directly impact the timeline for getting new, potentially life-saving treatments to patients. The process of drafting these documents is notoriously complex, iterative, and precision-driven. It requires a deep understanding of scientific and medical principles, as well as a thorough knowledge of the intricate web of regulatory requirements. Despite the availability of advanced authoring tools, the process remains highly labor-intensive and is prone to human error. This creates a significant bottleneck in the drug development pipeline, consuming valuable time and resources that could be better spent on research and innovation.

Agentic systems, powered by LLMs, present a transformative opportunity to address this challenge. These systems can provide substantial leverage by automating many of the more tedious aspects of the document drafting process, such as synthesizing research findings, generating initial drafts of content, and structuring documents according to predefined templates. However, the critical nature of these documents means that they cannot be fully automated. Human experts, with their deep domain knowledge and understanding of the regulatory landscape, are still essential for ensuring the factual accuracy and compliance of the final submissions. The key challenge, therefore, is to design a system that can effectively combine the speed and scalability of an agentic system with the precision and expertise of a human reviewer. The self-evolving loop provides a powerful solution to this challenge by creating a feedback-driven system that can learn from the corrections and guidance of human experts, gradually improving its performance and reducing the burden of manual review. This allows the human experts to focus their attention on the most critical and complex aspects of the submission, while the agent handles the more routine tasks, ultimately leading to faster, more accurate, and more efficient regulatory submissions.

1.3.2. Baseline Agent Architecture: A Summarizer and a Compliance Checker

To demonstrate the self-evolving loop in a self-contained and easily reproducible manner, the cookbook defines a simplified version of a regulatory authoring agent. In a full-scale production environment, such an agent would likely be a complex system composed of multiple specialized sub-agents, each responsible for a different part of the workflow, such as drafting, data analysis, compliance checking, citation generation, and fact verification. However, for the purposes of this guide, the scope is narrowed to focus on the core self-healing aspect of the system. The baseline agent is therefore composed of two primary sub-agents: a summarizer and a compliance checker. The summarizer is responsible for the core task of reading a section of a regulatory document and producing a concise, accurate summary. The compliance checker, in turn, evaluates the generated summary to ensure that it adheres to key regulatory requirements, such as those outlined in the FDA's 21 CFR Part 11. This two-agent architecture, while simplified, captures the essential elements of a real-world regulatory authoring workflow and provides a clear demonstration of how the self-evolving loop can be applied to improve the performance of a task-specific agent.

The summarizer agent is the primary focus of the optimization loop. It is configured with a simple, initial prompt and is tasked with summarizing sections of the provided CMC document. The compliance checker agent serves as an additional layer of validation, providing a binary assessment of whether the summary meets a specific regulatory standard. While the compliance checker is not directly optimized in this example, it illustrates how the framework can be extended to include multiple, interdependent agents, each with its own set of evaluation criteria. The prompts and parameters for these agents are explicitly defined in the cookbook, allowing for easy reproduction of the baseline system. For example, the summarizer agent is configured to use the file search tool to access the CMC PDF, and its prompt is a simple instruction to summarize a given section. The compliance checker agent's prompt is more specific, instructing it to verify the summary against FDA 21 CFR Part 11 and return a simple "Compliant" or "This section needs to be manually summarized" response. This clear and transparent definition of the baseline agent's architecture provides a solid foundation for the subsequent stages of the self-evolving loop.

1.3.3. Dataset: Sample CMC Section for Hyperpolarized Pyruvate (13C) Injection

To provide a realistic and domain-specific testbed for the self-evolving agent, the cookbook utilizes a dataset comprising approximately 70 sections extracted from a publicly available Sample CMC Section for Hyperpolarized Pyruvate (13C) Injection. This dataset is particularly well-suited for the task because it contains the kind of dense, technical, and highly specific language that is characteristic of regulatory documents in the pharmaceutical industry. The content covers a range of topics, from the chemical properties and nomenclature of the drug substance to the details of the manufacturing process and the results of stability studies. This rich and varied content provides an excellent opportunity to test the agent's ability to not only understand complex scientific information but also to accurately identify and preserve critical details, such as chemical names, molecular formulas, and regulatory citations. The use of a real-world document, rather than a synthetic or simplified dataset, ensures that the evaluation is both rigorous and relevant to the challenges faced in a production environment.

The dataset is provided in a CSV format, making it easy to load and process within the notebook environment. Each row in the dataset represents a single section of the CMC document, and the content of each section is used as the input for the summarization agent. The dataset is also used to train and validate the different prompt optimization strategies, including the GEPA framework. By using a consistent and well-defined dataset, the cookbook is able to provide a clear and reproducible demonstration of the self-evolving loop in action. The performance of the agent is evaluated against this dataset, and the results are used to drive the iterative improvement process. The choice of this specific dataset is not arbitrary; it is a deliberate decision to ground the abstract concepts of the self-evolving loop in a concrete, challenging, and highly relevant real-world problem. This approach not only makes the concepts easier to understand but also provides a clear and compelling demonstration of the practical value of the framework in a high-stakes, regulated industry.

2. Manual Prompt Optimization with the OpenAI Evals Platform

2.1. Workflow Overview: A Visual Interface for Iterative Improvement

The OpenAI Evals platform provides a powerful and intuitive web-based interface for the manual optimization and evaluation of prompts. This section of the cookbook demonstrates a complete, end-to-end workflow for using the platform to iteratively improve a prompt based on structured human feedback. This approach is particularly well-suited for the early stages of development, where rapid prototyping and close collaboration with subject matter experts are essential. The platform's visual interface makes it easy to see the impact of changes and to understand the optimization process, providing an excellent foundation for the more automated approaches described later in the cookbook. The workflow begins with the upload of a dataset and proceeds through the configuration of an initial prompt, the generation of outputs, the provision of structured feedback, and the use of the platform's automated optimization feature. The platform's tabbed interface allows for easy comparison of performance across different iterations, making it simple to track the evolution of the prompt and to identify the most effective changes.

The core of the platform's value lies in its ability to facilitate a tight, human-in-the-loop feedback cycle. By providing a simple and intuitive way for reviewers to rate outputs and provide detailed comments, the platform captures the nuanced, qualitative feedback that is often missing from purely automated evaluation systems. This structured feedback is then used to power the platform's automated prompt optimization feature, which generates a new, improved prompt based on the collective input of the reviewers. This combination of human judgment and automated optimization creates a powerful synergy, allowing for the rapid development of high-quality prompts with minimal manual effort. The platform also provides a clear and transparent view of the entire process, from the initial dataset to the final optimized prompt, making it easy to understand and reproduce the results. This section of the cookbook provides a detailed, step-by-step guide to using the OpenAI Evals platform, illustrating how it can be used to quickly and effectively optimize a prompt for a specific task.

2.2. Step-by-Step Process

The process of manually optimizing a prompt using the OpenAI Evals platform is broken down into a series of clear, sequential steps. This structured approach ensures that the user can easily follow the workflow and achieve the desired results. The process begins with the preparation and upload of a dataset, which serves as the input for the agent. This is followed by the configuration of the initial prompt, which defines the agent's task and behavior. Once the prompt is configured, the platform is used to generate outputs for the entire dataset, creating a baseline for evaluation. The core of the process is the review and evaluation stage, where human reviewers provide structured feedback on the generated outputs. This feedback is then used to power the platform's automated optimization feature, which generates a new, improved prompt. The final step is to iterate on this process, using the new prompt to generate a new set of outputs and evaluating them to measure the improvement. This iterative cycle can be repeated as many times as necessary until the desired level of performance is achieved. The following subsections will provide a more detailed description of each of these steps, offering a comprehensive guide to the manual prompt optimization workflow.

Step	Action	Description
1	Upload Dataset	Upload a CSV file containing the inputs for the agent (e.g., document sections to be summarized).
2	Explore Data	Verify the uploaded data is correctly formatted and contains the expected content.
3	Configure Initial Prompt	Define the system prompt and user prompt template. Select the model and configure parameters like temperature.
4	Generate Outputs	Run the configured prompt against all samples in the dataset to create a baseline of outputs.
5	Review and Evaluate	Add evaluation columns (Rating, Feedback) and provide structured feedback on each generated output.
6	Optimize Prompt	Use the "Optimize" button to automatically generate a new, improved prompt based on the collected feedback.
7	Iterate and Compare	Generate outputs with the new prompt, evaluate them, and repeat the cycle until performance is satisfactory.

Table 1: A summary of the step-by-step process for manual prompt optimization using the OpenAI Evals platform.

2.2.1. Dataset Upload and Exploration

The first step in the manual prompt optimization workflow is to upload the dataset that will be used for evaluation. The OpenAI Evals platform provides a simple and intuitive interface for this task. The user begins by clicking the "+ Create" button, which initiates the process of creating a new evaluation run. The user is then prompted to define a name for the dataset and to upload a CSV file containing the data. The platform allows the user to select which columns from the CSV file should be included in the evaluation, providing flexibility in how the data is structured. The dataset should contain the inputs that will be processed by the agent; in the case of the regulatory document summarization task, each row of the dataset represents a section of the document that needs to be summarized. Once the dataset is uploaded, the user can explore the data to verify that it has been properly formatted and that it contains the expected content. This exploration step is important for ensuring that the evaluation is based on a clean and accurate dataset, which is essential for obtaining reliable results.

The platform's data exploration features allow the user to view the uploaded data in a tabular format, making it easy to scan for any potential issues. The user can review the content of each row and column to ensure that the data is complete and correctly structured. This step is particularly important when working with complex or messy datasets, as it allows for the identification and correction of any errors before proceeding with the evaluation. The ability to preview the data before running the evaluation is a key feature of the platform, as it helps to prevent wasted time and effort on flawed or incomplete datasets. Once the user is satisfied that the dataset is correct, they can proceed to the next step of the workflow, which is the configuration of the initial prompt. The clear and straightforward process for uploading and exploring the dataset makes it easy to get started with the evaluation and ensures that the subsequent steps are based on a solid foundation of high-quality data.

2.2.2. Initial Prompt Configuration

After the dataset has been uploaded and explored, the next step is to configure the initial prompt that will be used to guide the agent's behavior. This is a critical step in the process, as the quality of the initial prompt will have a significant impact on the quality of the generated outputs and the effectiveness of the subsequent optimization. The OpenAI Evals platform provides a user-friendly interface for prompt configuration, allowing the user to define both the system prompt and the user prompt template. The system prompt is a high-level instruction that defines the agent's role and overall task, while the user prompt template is a more specific instruction that is populated with data from the dataset for each individual run. The platform supports the use of variables in the user prompt template, which are replaced with the actual values from the dataset at runtime. This allows for a high degree of flexibility in how the agent is instructed to process the data.

The platform also provides options for configuring the underlying model that will be used for generation. The user can select from a range of available models, such as GPT-4.1 or GPT-5, and can also adjust parameters such as the temperature, which controls the balance between creativity and determinism in the model's output. For the purposes of this cookbook, the initial prompt is deliberately kept simple, such as "summarize," to demonstrate the power of the optimization process to evolve the prompt from a minimal starting point. However, in a real-world scenario, the initial prompt would likely be more detailed and specific to the task at hand. The platform's intuitive interface for prompt configuration makes it easy to experiment with different prompts and model settings, allowing the user to quickly find a good starting point for the optimization process. Once the prompt is configured, the user can proceed to the next step, which is to generate the initial set of outputs for evaluation.

2.2.3. Generating and Reviewing Outputs

With the initial prompt configured, the next step is to generate outputs for the entire dataset. This is done by clicking the "Generate Output" button in the OpenAI Evals platform. The platform will then run the prompt against each row in the dataset, replacing the template variables with the actual values from the dataset and calling the model with the configured system prompt. The results of this process are displayed in a new column in the data table, providing a clear and easy-to-review record of the agent's outputs. This step creates the baseline of outputs that will be used for the subsequent evaluation and optimization. The generated outputs can be reviewed directly within the platform's interface, allowing the user to get a quick sense of the agent's initial performance and to identify any obvious issues or areas for improvement. This initial review is an important part of the process, as it helps to inform the feedback that will be provided in the next step.

The platform's interface for reviewing outputs is designed for efficiency and ease of use. The user can quickly scan through the generated summaries, comparing them to the original source text to assess their accuracy and completeness. The tabular layout of the data makes it easy to see the input and output side-by-side, facilitating a quick and intuitive review process. This initial review is not intended to be a comprehensive evaluation, but rather a high-level assessment of the agent's performance. The detailed, structured feedback will be provided in the next step of the workflow. However, this initial review is a valuable opportunity to get a feel for the data and to identify any patterns or trends in the agent's outputs. Once the user has reviewed the initial outputs, they can proceed to the next step, which is to provide structured feedback to guide the prompt optimization process.

2.2.4. Providing Structured Feedback with Ratings and Comments

The core of the manual optimization workflow is the provision of structured feedback on the agent's outputs. This is where the human-in-the-loop aspect of the process comes into play, and it is this feedback that drives the subsequent optimization. The OpenAI Evals platform provides a simple and intuitive interface for providing this feedback, allowing reviewers to add evaluation columns to the data table. These columns can be configured to capture different types of feedback, such as a binary rating (e.g., good/bad), a numeric score, or a free-text comment. For the purposes of this cookbook, two evaluation columns are used: a "Rating" column for a binary assessment and a "Feedback" column for detailed textual comments. This structured approach to feedback collection ensures that the input to the optimization process is both consistent and actionable.

The reviewer is tasked with assessing each generated output and providing a rating and a comment based on how the output could be improved. For example, a reviewer might give a "Bad" rating and a comment such as "The information is good, but it should be presented as bullet points to improve readability." This specific, actionable feedback is exactly what the optimization process needs to generate a better prompt. The platform allows for the easy entry of this feedback, and the annotations are saved with the evaluation run, creating a permanent record of the reviewer's assessment. This structured feedback becomes the foundation for the automated prompt optimization, providing the system with the clear and specific guidance it needs to improve the agent's performance. The quality and detail of the feedback provided at this stage are directly correlated with the quality of the optimized prompt, making this a critical step in the overall workflow.

2.2.5. Automated Prompt Optimization

Once a sufficient amount of structured feedback has been collected, the next step is to use the OpenAI Evals platform's automated prompt optimization feature. This feature takes the feedback provided by the reviewers and uses it to generate a new, improved prompt. The process is simple and straightforward: the user clicks the "Optimize" button, and the platform automatically generates a new prompt version in a new tab. The user can then click "View Prompt" to see the improved version. This automated optimization is a powerful feature, as it leverages the collective intelligence of the human reviewers to generate a prompt that is more specific, more directive, and better aligned with the desired performance criteria. The platform's optimization algorithm is designed to interpret the feedback and translate it into clear and actionable instructions for the agent.

The optimized prompt generated by the platform is often significantly more detailed and specific than the initial prompt. For example, the initial prompt might have been a simple "summarize," while the optimized prompt might include detailed instructions on the desired format, tone, and content of the summary, such as "Use bullet points when answering to improve readability" or "Summarize each sub-section individually." This level of detail is a direct result of the specific feedback provided by the reviewers. The platform's ability to automatically generate such a detailed and well-structured prompt is a key advantage of this approach, as it saves a significant amount of manual effort and ensures that the resulting prompt is of high quality. The optimized prompt is then ready to be tested and evaluated in the next step of the workflow, completing the first iteration of the optimization cycle.

2.2.6. Iterating and Comparing Performance Across Versions

The final step in the manual prompt optimization workflow is to iterate on the process. With the new, optimized prompt in hand, the user can start a new iteration to measure the improvement in performance. This is done by clicking "Generate Output" again, which will run the new prompt against the entire dataset. The user can then review the new results and provide feedback on any remaining issues. If further improvement is needed, the user can click "Optimize" again to generate another new prompt. This iterative cycle of generating outputs, providing feedback, and optimizing the prompt can be repeated as many times as necessary until the desired level of performance is achieved. The platform's tabbed interface makes it easy to compare the performance of different prompt versions, allowing the user to see how the outputs have evolved from the initial prompt to the optimized versions.

This iterative approach is the key to achieving significant and sustained improvements in performance. Each iteration of the cycle provides new data and new feedback, which can be used to further refine the prompt and address any remaining weaknesses. The platform's ability to track and compare the performance of different versions is a crucial feature, as it allows the user to see the impact of their changes and to make informed decisions about which version of the prompt to use. The process is designed to be continued until a certain quality threshold is reached, such as when more than 80% of the outputs receive positive feedback, or when new iterations show diminishing returns. This systematic, data-driven approach to prompt optimization is what enables the user to quickly and effectively develop a high-performing agent for their specific task.

2.3. When to Use This Approach: Rapid Prototyping and Human-in-the-Loop Scenarios

The manual prompt optimization approach using the OpenAI Evals platform is best suited for specific stages of the development lifecycle and for certain types of use cases. Its primary strength lies in its ability to facilitate rapid prototyping and close collaboration with subject matter experts. In the early stages of a project, when the requirements are still being defined and the desired behavior of the agent is not yet fully understood, this approach provides an excellent way to quickly explore different prompt strategies and to gather feedback from stakeholders. The visual interface and the tight feedback loop make it easy to experiment with different ideas and to see the results in real-time, which is invaluable for refining the concept and building a shared understanding of the desired outcome. This approach is also ideal for use cases where human-in-the-loop oversight is a requirement, such as in highly regulated industries like healthcare and finance. In these scenarios, the ability to incorporate the nuanced judgment of human experts into the optimization process is essential for ensuring the accuracy, safety, and compliance of the agent's outputs.

However, the manual approach is not without its limitations. The reliance on human reviewers makes it less scalable than the fully automated approaches described later in the cookbook. It is not well-suited for scenarios where the agent needs to be continuously retrained on a large volume of data or where rapid, iterative development is required. In these cases, the automated, API-driven approach is a more appropriate choice. The manual approach is also not ideal for production environments where the agent is expected to operate autonomously without constant human supervision. In summary, the manual prompt optimization approach is a powerful tool for the early stages of development, for rapid prototyping, and for use cases that require a high degree of human oversight. It provides an excellent foundation for understanding the principles of prompt optimization and for building a high-quality baseline agent, which can then be further refined and scaled using the more automated approaches described in the subsequent sections of the cookbook.

讨论回复

2 条回复

✨步子哥 (steper) #1

11-15 10:58

3. Automated Self-Healing with LLM-as-a-Judge

3.1. System Architecture: A Fully Programmatic Feedback Loop

This section of the cookbook introduces a fully automated, programmatic approach to the self-evolving loop, eliminating the need for any user interface. This API-driven workflow is designed for scalability and is well-suited for integration into production pipelines and continuous integration/continuous deployment (CI/CD) environments. The architecture of this automated system is centered around a set of Python scripts that orchestrate the entire feedback loop, from generating summaries with the agent to evaluating them with a suite of graders and updating the agent's prompt based on the results. This programmatic approach enables the system to process a large volume of data without requiring manual intervention, making it ideal for continuous monitoring and improvement of agent performance in a production setting. The system is built on top of the OpenAI API, which provides the necessary tools for both generating text with the agent and creating and running evaluations with the graders.

The core components of the automated system are the summarization agent, the metaprompt agent, the evaluation suite, and the orchestration logic. The summarization agent is the primary agent that performs the task of summarizing the regulatory document sections. The metaprompt agent is a separate agent that is responsible for optimizing the summarization agent's prompt based on the feedback from the graders. The evaluation suite is a collection of four distinct graders that assess the quality of the summaries from different perspectives. The orchestration logic is a set of Python functions that tie all of these components together, managing the flow of data between them and controlling the iterative optimization process. This modular architecture makes the system flexible and extensible, allowing for new graders to be added or for the optimization logic to be modified as needed. The following subsections will provide a detailed description of each of these components, illustrating how they work together to create a fully automated, self-healing agent.

3.2. Building the Evaluation Suite: A Multi-Grader Approach

A critical component of the automated self-healing system is the evaluation suite, which is responsible for providing the objective, quantitative feedback that drives the optimization process. This cookbook defines a multi-grader approach, using four complementary graders that balance deterministic checks with semantic judgment. Each grader is designed to assess a specific aspect of the summary's quality, and together they provide a comprehensive and robust evaluation of the agent's performance. The use of multiple graders is a key design choice, as it ensures that the evaluation is not biased towards a single metric and that it captures a wide range of potential failure modes. The scores from each grader are combined into an aggregated score, which is then used to determine whether the agent's performance is acceptable and to guide the prompt optimization process. The following subsections will provide a detailed description of each of the four graders, explaining their purpose, their implementation, and their role in the overall evaluation strategy.

Grader	Type	Pass Threshold	What It Checks	Why It's Important
Chemical Name Preservation	Python	0.8	Ensures all exact chemical names from the source text appear in the summary.	Forces preservation of critical domain entities, ensuring scientific and regulatory accuracy.
Summary Length Adherence	Python	0.85	Measures deviation from a target 100-word length.	Keeps summaries concise and comparable, preventing verbosity from masking poor content.
Semantic Similarity	Cosine Similarity	0.85	Calculates the cosine similarity between the source text and the summary.	Ensures the summary stays semantically anchored to the source, preventing drift or hallucination.
Holistic Quality Assessment	LLM-as-a-Judge	0.85	Provides a rubric-driven score from a model acting as an evaluator.	Captures nuanced quality signals that rule-based metrics miss, improving overall robustness.

Table 2: A summary of the four graders used in the automated evaluation suite.

3.2.1. Grader 1: Chemical Name Preservation (Python)

The first grader in the evaluation suite is the Chemical Name Preservation Grader. This grader is implemented as a Python function and is designed to ensure that the agent's summary accurately preserves all of the chemical names that appear in the source text. This is a critical requirement for the regulatory document summarization task, as the precise and accurate use of chemical nomenclature is essential for scientific and regulatory clarity. The grader works by first defining a master list of chemical names that are relevant to the dataset. This list includes a wide range of chemical compounds, from the active drug substance to various reagents and solvents used in the manufacturing process. The grader then scans the source section to identify which of these chemical names are present. Finally, it checks the generated summary to see if all of the identified chemical names from the source are also present in the summary. The grader returns a score between 0 and 1, representing the proportion of chemical names from the source that were correctly preserved in the summary.

The implementation of this grader is a good example of how deterministic, rule-based checks can be used to enforce specific domain constraints. The grader's logic is simple and transparent, making it easy to understand and debug. The pass threshold for this grader is set to 0.8, meaning that at least 80% of the chemical names from the source must be present in the summary for the grader to pass. This threshold can be adjusted based on the specific requirements of the task. The grader's focus on a single, well-defined aspect of the summary's quality makes it a powerful tool for ensuring the factual accuracy of the agent's outputs. By including this grader in the evaluation suite, the system can quickly identify and flag any summaries that fail to meet this critical requirement, providing a clear and actionable signal for the prompt optimization process. The grader's output is a key component of the aggregated score, and its failure is a strong indicator that the agent's instructions need to be refined to better emphasize the importance of preserving specific domain entities.

3.2.2. Grader 2: Summary Length Adherence (Python)

The second grader in the evaluation suite is the Summary Length Adherence Grader. This grader is also implemented as a Python function and is designed to ensure that the agent's summaries are concise and adhere to a specified length constraint. For the regulatory document use case, the target length for the summaries is set to 100 words. The grader calculates the word count of the generated summary and then computes the relative deviation from the target length. The grader's scoring function is designed to be lenient, allowing for a 20% tolerance band around the target length. If the summary's length falls within this tolerance band, the grader returns a perfect score of 1.0. If the length falls outside of this band, the score decays linearly, with the score decreasing as the deviation from the target length increases. This approach ensures that summaries that are close to the target length are not penalized too harshly, while still discouraging excessively long or short summaries.

The purpose of this grader is to enforce a stylistic constraint on the agent's outputs, ensuring that the summaries are not only accurate but also well-structured and easy to read. In the context of regulatory documents, where clarity and conciseness are highly valued, this is an important quality to measure. The grader's pass threshold is set to 0.85, which is a relatively high bar, reflecting the importance of this constraint. By including this grader in the evaluation suite, the system can ensure that the agent is not only capturing the necessary information but is also presenting it in a clear and concise manner. The feedback from this grader can be used to guide the prompt optimization process, encouraging the metaprompt agent to generate instructions that emphasize brevity and clarity. The grader's simple, deterministic implementation makes it a reliable and efficient tool for measuring this important aspect of the summary's quality.

3.2.3. Grader 3: Semantic Similarity to Source (Cosine Similarity)

The third grader in the evaluation suite is the Semantic Similarity Grader. This grader uses a text similarity metric, specifically cosine similarity, to measure the semantic overlap between the source section and the generated summary. The purpose of this grader is to ensure that the summary is not just a collection of keywords but is a faithful and accurate representation of the source text's meaning. The grader works by converting both the source text and the summary into high-dimensional vector representations, often referred to as embeddings. The cosine similarity between these two vectors is then calculated, which provides a measure of their semantic similarity. A score of 1.0 indicates that the two texts are semantically identical, while a score of 0.0 indicates that they are completely dissimilar. The pass threshold for this grader is set to 0.85, which is a relatively high bar, reflecting the importance of semantic fidelity in the summarization task.

The use of a semantic similarity metric is a key component of the evaluation suite, as it provides a way to measure the quality of the summary that goes beyond simple keyword matching or length constraints. This grader is particularly effective at catching summaries that are superficially well-formed but that have drifted away from the core meaning of the source text. For example, a summary that paraphrases the source text in a way that introduces subtle inaccuracies or changes in meaning would be penalized by this grader. By including this grader in the evaluation suite, the system can ensure that the agent is not just extracting information from the source text but is also understanding its meaning and preserving it in the summary. The feedback from this grader can be used to guide the prompt optimization process, encouraging the metaprompt agent to generate instructions that emphasize the importance of semantic accuracy and faithfulness to the source.

3.2.4. Grader 4: Holistic Quality Assessment (LLM-as-a-Judge)

The fourth and final grader in the evaluation suite is the Holistic Quality Assessment Grader, which is implemented using an LLM-as-a-judge. This grader is designed to provide a comprehensive, qualitative assessment of the summary's quality, capturing the nuanced signals that the other, more deterministic graders might miss. The grader works by sending the source section and the generated summary to a separate, powerful LLM, which is instructed to act as an expert technical summarization evaluator. The LLM is provided with a detailed rubric that guides its evaluation, asking it to assess the summary's comprehensiveness, faithfulness, and technical accuracy. The rubric provides a clear scoring guideline, with scores ranging from 0 to 1, and detailed descriptions of what each score level represents. The LLM is instructed to respond with only a single number, representing its overall assessment of the summary's quality.

The use of an LLM-as-a-judge is a powerful technique for evaluating the quality of text generation, as it allows for a level of nuance and understanding that is difficult to achieve with rule-based metrics. This grader can assess aspects of the summary's quality that are hard to quantify, such as the clarity of the writing, the logical flow of the information, and the overall effectiveness of the summary in conveying the key points of the source text. The pass threshold for this grader is set to 0.85, which is consistent with the other graders in the suite. The feedback from this grader, which can include a textual rationale for its score, is a valuable input for the prompt optimization process. It provides a high-level, holistic assessment of the agent's performance, which can be used to guide the metaprompt agent in generating more effective and well-rounded instructions. This grader serves as a final failsafe, ensuring that the overall quality of the agent's outputs is high, even if they pass all of the other, more specific checks.

3.3. Orchestrating the Self-Evolving Loop

The orchestration of the self-evolving loop is the process of bringing together all of the individual components—the summarization agent, the metaprompt agent, the evaluation suite, and the versioning system—and coordinating their actions to create a seamless, automated workflow. This is achieved through a set of Python functions that manage the flow of data between the components and control the iterative optimization process. The orchestration logic is responsible for a number of key tasks, including managing the versions of the prompts, translating the feedback from the graders into actionable instructions for the metaprompt agent, and deciding when to promote a new version of the agent to become the new baseline. This orchestration is the glue that holds the entire system together, and it is what enables the creation of a truly autonomous, self-healing agent. The following subsections will provide a detailed description of the key aspects of the orchestration logic, illustrating how the different components are integrated to create a cohesive and effective system.

The orchestration logic is designed to be robust and resilient, with built-in mechanisms for handling failures and ensuring that the system does not get stuck in an infinite loop of unsuccessful optimization attempts. For example, the system keeps track of the best-performing prompt candidate and can revert to it if a new optimization attempt fails to improve performance. The orchestration logic also includes features for observability, such as detailed logging and tracing, which provide a clear view of the system's operation and make it easier to debug any issues that may arise. The modular design of the orchestration logic makes it easy to extend and customize the system to meet the specific needs of a given use case. For example, new graders can be added to the evaluation suite, or the logic for selecting the best prompt candidate can be modified to use a different scoring function. This flexibility is a key advantage of the framework, as it allows it to be adapted to a wide range of different tasks and domains.

3.3.1. Agent and Prompt Versioning for Traceability

A critical aspect of the orchestration logic is the management of agent and prompt versions. In a system that is constantly evolving, it is essential to have a clear and reliable way to track the changes that are made to the agent's instructions. This is important for a number of reasons, including traceability, reproducibility, and the ability to roll back to a previous version in case of a regression. The cookbook introduces a set of Python classes, PromptVersionEntry and VersionedPrompt, to handle this task. The PromptVersionEntry class is a data model that represents a single version of a prompt, including the prompt text, the model version, and any associated metadata. The VersionedPrompt class is a utility that manages a collection of these prompt versions, providing methods for adding new versions, retrieving the current version, and reverting to a previous version.

The use of a formal versioning system is a key design choice that ensures the robustness and reliability of the self-evolving loop. By keeping a complete history of all prompt versions, the system can provide a clear audit trail of the agent's evolution, which is essential for debugging and for understanding the impact of changes. The ability to revert to a previous version is also a crucial safety feature, as it provides a way to recover from a failed optimization attempt or from a new prompt that introduces unexpected problems. The versioning system is integrated into the orchestration logic, with each new optimization attempt resulting in a new version of the prompt being created and added to the history. The system also keeps track of the performance of each prompt version, which allows it to make an informed decision about which version to promote to become the new baseline. This systematic approach to versioning is a key enabler of the system's ability to learn and improve over time, while also ensuring its stability and reliability in a production environment.

3.3.2. The Metaprompt Agent: Translating Feedback into New Instructions

The metaprompt agent is a key component of the automated self-healing system. It is a separate LLM agent whose sole purpose is to act as a prompt optimizer, translating the structured feedback from the graders into a new, improved set of instructions for the summarization agent. The metaprompt agent is guided by a detailed template, METAPROMPT_TEMPLATE, which provides it with the context it needs to perform this task. The template includes the original prompt, the source section, the generated summary, and the consolidated feedback from the graders. The template instructs the metaprompt agent to generate a new prompt that is more specific, more directive, and better aligned with the desired performance criteria. The output of the metaprompt agent is a new, improved prompt that is then used to create a new version of the summarization agent.

The metaprompt agent is a powerful tool for automating the prompt optimization process. It allows the system to explore a wide range of prompt variations without requiring manual intervention, making it ideal for continuous integration and deployment. The use of a separate agent for this task is a key design choice, as it allows for a clear separation of concerns between the task-performing agent and the optimization agent. This makes the system more modular and easier to maintain. The metaprompt agent's ability to interpret the feedback from the graders and translate it into actionable instructions is a key enabler of the system's ability to learn and improve over time. By automating the process of prompt optimization, the metaprompt agent frees up human experts to focus on more strategic tasks, such as defining the overall goals of the system and reviewing its performance at a high level. The metaprompt agent is a crucial link in the self-evolving loop, bridging the gap between the evaluation of the agent's performance and the generation of a new, improved agent.

3.3.3. The Optimization Loop: Evaluating, Scoring, and Updating

The core of the orchestration logic is the optimization loop itself, which is the process of repeatedly evaluating the agent's performance, scoring it against the defined criteria, and updating the prompt based on the feedback. This loop is implemented as an asynchronous Python function, self_evolving_loop, which simulates a stream of incoming requests for summarization by iterating over the rows of the dataset. For each section of the document, the loop performs a series of steps. First, it uses the current version of the summarization agent to generate a summary. Then, it calls the evaluation suite to get the scores from the four graders. The scores are then aggregated, and the loop checks if the performance meets the lenient pass criteria. If it does, the loop moves on to the next section. If it does not, the loop calls the metaprompt agent to generate a new, improved prompt, and the process is repeated for a maximum number of retries.

The optimization loop is designed to be robust and efficient. It uses caching to avoid redundant calls to the evaluation suite for the same section-summary pair. It also keeps track of the best-performing prompt candidate, so that it can be promoted to become the new baseline at the end of the loop. The loop's logic for selecting the best prompt is based on the cumulative performance across all sections, which helps to ensure that the final prompt is the strongest performer overall. The loop also includes detailed logging and print statements, which provide a clear view of its operation and make it easier to debug any issues that may arise. The optimization loop is the engine of the self-evolving system, driving the iterative process of evaluation and refinement that enables the agent to learn and improve over time. It is a complex piece of logic, but it is designed to be modular and easy to understand, with each step of the process clearly defined and separated from the others.

3.4. Observability and Monitoring

In a production environment, it is essential to have a clear and comprehensive view of the operation of the self-evolving system. This is where observability and monitoring come into play. The cookbook demonstrates two key approaches to observability: dashboard tracing and continuous monitoring. Dashboard tracing provides a real-time, visual representation of the workflow and the individual agent calls, making it easy to see how the system is performing and to identify any bottlenecks or errors. Continuous monitoring, on the other hand, involves setting up a scheduled process to periodically re-evaluate the agent's performance on new data, ensuring that the agent remains accurate and compliant as the data distribution evolves. These two approaches work together to provide a complete picture of the system's health and performance, enabling operators to proactively identify and address any issues that may arise. The following subsections will provide a more detailed description of these two approaches, illustrating how they can be used to ensure the reliability and effectiveness of the self-evolving system in a production environment.

The importance of observability and monitoring cannot be overstated. In a system that is constantly changing and evolving, it is crucial to have the tools and processes in place to understand what is happening and why. Without proper observability, it would be impossible to debug the system, to understand the impact of changes, or to ensure that the system is meeting its performance goals. The cookbook's focus on observability is a reflection of its practical, production-oriented approach. It recognizes that a successful self-evolving system is not just one that can learn and improve, but also one that can be effectively managed and operated in a real-world environment. The tools and techniques described in this section are essential for achieving this goal, and they provide a solid foundation for building robust, reliable, and observable self-evolving agents.

3.4.1. Dashboard Tracing for Workflow and Agent Calls

The OpenAI dashboard provides a powerful tool for observing the operation of the self-evolving system. The dashboard's tracing feature allows for the visualization of the entire optimization workflow, from the initial call to the summarization agent to the final update of the prompt. The traces provide a detailed, step-by-step view of the process, showing the inputs and outputs of each agent call, the scores from the graders, and the decisions made by the orchestration logic. This level of detail is invaluable for debugging the system, as it allows developers to see exactly what happened at each stage of the process. The traces can also be used to monitor the performance of the system, providing insights into the latency of each step and the overall throughput of the workflow.

The dashboard's tracing feature is particularly useful for understanding the behavior of the individual agents. By drilling down into a specific agent call, developers can see the exact prompt that was used, the model that was called, and the full text of the generated output. This level of detail is essential for understanding why an agent might have produced a particular output and for identifying any issues with the prompt or the model. The traces also provide a clear view of the flow of data between the different agents, making it easy to see how the feedback from the graders is being used to inform the actions of the metaprompt agent. The dashboard's tracing feature is a key component of the system's observability, providing a real-time, visual representation of the self-evolving loop in action. It is an essential tool for anyone who is building or operating a self-evolving agent, and it provides a level of insight that is simply not possible with traditional logging and monitoring tools.

3.4.2. Continuous Monitoring with Scheduled Re-evaluation

In a production environment, the data that an agent processes is often not static. New data is constantly being generated, and the distribution of that data can change over time. This can lead to a phenomenon known as model drift, where the agent's performance degrades as it is exposed to data that is different from what it was trained on. To address this challenge, it is essential to have a process for continuous monitoring and re-evaluation. The cookbook provides a pseudo-code example of how this can be achieved, using a simple scheduler to periodically check for new data and to trigger the evaluation and optimization loop when new data is detected. This approach ensures that the agent remains accurate and compliant as the data distribution evolves, which is a key requirement for maintaining high-quality, real-world performance.

The continuous monitoring process can be implemented using a variety of tools and techniques, such as a cron job, a lightweight scheduler, or a message queue. The basic idea is to have a process that runs in the background, periodically checking for updates in the data source. When new data is detected, the process automatically triggers the self-evolving loop, which then evaluates the agent's performance on the new data and updates the prompt if necessary. This automated approach to monitoring and retraining is a key advantage of the self-evolving framework, as it allows the system to adapt to changing conditions without requiring manual intervention. This is particularly important in a production environment, where the ability to respond quickly to new data is essential for maintaining a high level of service. The cookbook's discussion of continuous monitoring is a reflection of its practical, production-oriented approach, and it provides a clear and actionable guide for implementing this important capability.

4. Advanced Optimization Strategies

4.1. Model Evaluation and Selection

4.1.1. Comparing Model Candidates (e.g., GPT-5, GPT-5-mini)

The self-evolving loop can be extended beyond prompt optimization to include the evaluation and selection of different model candidates. This is a powerful technique for ensuring that the system is using the most effective and efficient model for a given task. The cookbook provides an example of how to implement this by creating a compare_model_candidates function. This function takes the improved prompt from the metaprompt agent and uses it to generate outputs with two or more different models, such as GPT-5 and GPT-5-mini. The outputs from each model are then evaluated using the same suite of graders, and the model that achieves the highest score is selected as the winner. This approach allows the system to automatically find the optimal balance between performance and cost, as different models may have different strengths and weaknesses, as well as different pricing structures.

The process of comparing model candidates is integrated into the main optimization loop. When a prompt fails to meet the performance threshold, the system first uses the metaprompt agent to generate an improved prompt. It then passes this new prompt to the compare_model_candidates function, which evaluates it across the different model candidates. If one of the models achieves a passing score, that model is selected, and the prompt is updated with the new model version. This automated approach to model selection is a key advantage of the framework, as it allows the system to adapt to the specific requirements of the task without requiring manual tuning. It also provides a way to future-proof the system, as new and improved models can be easily added to the list of candidates and automatically evaluated. The cookbook's discussion of model evaluation and selection is a reflection of its comprehensive, production-oriented approach, and it provides a clear and actionable guide for implementing this advanced optimization strategy.

4.1.2. Integrating Model Selection into the Optimization Loop

The integration of model selection into the main optimization loop is a key feature of the advanced self-evolving system. This is achieved by modifying the self_evolving_loop function to call the compare_model_candidates function when a prompt fails to meet the performance threshold. The modified loop, self_evolving_loop_with_model_comparison, first attempts to improve the prompt using the current model. If that fails, it then calls the compare_model_candidates function to see if a different model can achieve a passing score with the improved prompt. This creates a more robust and flexible optimization process, as it allows the system to explore both prompt and model space in search of the best possible performance. The integration is seamless, with the compare_model_candidates function returning the best-performing agent, which is then used for the next iteration of the loop.

This integrated approach to optimization is a significant improvement over a system that only optimizes the prompt. By considering both the prompt and the model, the system can find solutions that would not be possible with prompt optimization alone. For example, a more powerful model might be able to achieve a passing score with a simpler prompt, while a less powerful model might require a more detailed and specific prompt to achieve the same level of performance. The ability to automatically find the right combination of prompt and model is a key advantage of this approach, as it allows the system to be both effective and efficient. The cookbook's code example provides a clear and practical demonstration of how to implement this integration, making it easy for readers to adapt the technique to their own use cases. This advanced optimization strategy is a powerful tool for building high-performing, production-ready agentic systems.

4.2. Prompt Optimization with Genetic-Pareto (GEPA)

4.2.1. The GEPA Framework: Reflective Evolution for Generalization

The Genetic-Pareto (GEPA) framework represents a more advanced and sophisticated approach to prompt optimization. Unlike the static metaprompt agent, which uses a fixed template to generate new prompts, GEPA employs a more dynamic and reflexive process. It samples agent trajectories, reflects on them in natural language, proposes prompt revisions, and evolves the system through iterative feedback loops. This evolutionary approach is designed to find more robust and generalized prompts that perform well across a wide range of inputs. The GEPA method, as described in the paper by Agrawal et al. , offers a compelling blueprint for continuous, self-improving prompt optimization. The framework is designed to avoid overfitting to a specific dataset or set of graders, which can be a risk with simpler optimization methods.

The core of the GEPA framework is its use of a reflection LM to analyze the performance of the agent and to propose improvements. This reflection LM is given the inputs, outputs, and feedback from the evaluation suite, and it is tasked with identifying the root causes of the agent's failures and with proposing specific changes to the prompt to address them. This reflective process is more nuanced and insightful than the simple template-based approach of the metaprompt agent, and it can lead to more significant and lasting improvements in performance. The GEPA framework also uses a training and validation set to ensure that the evolved prompts are not just memorizing the training data but are actually learning to generalize to new, unseen examples. This focus on generalization is a key advantage of the GEPA framework, and it makes it a particularly good choice for use cases where the agent will be exposed to a wide variety of different inputs.

4.2.2. Adapting the Evaluation Suite for GEPA

To use the GEPA framework, the existing evaluation suite needs to be adapted to work with the GEPA API. This is done by creating an adapter class, EvalsBackedSummarizationAdapter, which implements the three required hooks for the GEPA framework: evaluate, get_components_to_update, and make_reflective_dataset. The evaluate hook runs the summarization and grading pipeline for a given set of inputs and a candidate prompt. The get_components_to_update hook specifies which part of the prompt should be evolved by GEPA (in this case, the system_prompt). The make_reflective_dataset hook packages the inputs, outputs, and feedback into a format that can be read by the reflection LM. This adapter acts as a bridge between the existing evaluation suite and the GEPA framework, allowing the two to work together seamlessly.

The creation of this adapter is a straightforward process that involves wrapping the existing evaluation logic in the required GEPA interface. The evaluate hook calls the run_eval and parse_eval_run_output functions to get the scores from the graders, and it then packages these scores, along with the outputs and feedback, into an EvaluationBatch object. The get_components_to_update hook simply returns a list containing the string "system_prompt". The make_reflective_dataset hook creates a list of examples, where each example contains the input, the generated output, and the feedback for a single evaluation. This adapter allows the GEPA framework to use the same robust evaluation suite that was developed for the automated self-healing loop, ensuring that the optimization process is based on a comprehensive and reliable assessment of the agent's performance.

4.2.3. Running GEPA for Robust, Generalized Prompts

Once the adapter is in place, the GEPA framework can be run using the gepa.optimize function. This function takes the seed candidate (the initial prompt), the training and validation sets, and the adapter as input. It then runs the optimization process, which involves repeatedly evaluating candidate prompts, reflecting on the results, and proposing new, improved prompts. The process continues until a maximum number of metric calls is reached or until the performance on the validation set stops improving. The output of the gepa.optimize function is the best-evolved prompt, which can then be used to create a new, improved version of the summarization agent. The GEPA framework is a powerful tool for finding robust, generalized prompts, and it is a significant step up from the simpler optimization methods described earlier in the cookbook.

The GEPA framework's focus on reflection and generalization makes it a particularly good choice for complex, high-stakes use cases where the agent needs to perform well across a wide range of different inputs. The use of a training and validation set helps to prevent overfitting, and the reflective process of the reflection LM can lead to more insightful and effective prompt improvements. The cookbook's example of running GEPA on the regulatory document summarization task demonstrates the power of this approach. The resulting prompt is highly detailed and specific, with a clear focus on preserving the key technical facts and nomenclature from the source text. This level of detail and specificity is a direct result of the GEPA framework's ability to learn from the feedback and to propose targeted improvements to the prompt. The GEPA framework is a valuable addition to the toolkit of anyone who is building self-evolving agents, and it provides a clear path towards creating more robust, reliable, and high-performing systems.

4.3. Comparing the Three Optimization Strategies

The cookbook presents three distinct strategies for prompt optimization, each with its own strengths and weaknesses. The choice of which strategy to use depends on the specific requirements of the use case, including the need for speed, automation, and generalization. The following table provides a summary of the three strategies, highlighting their key characteristics and when they are most appropriate to use.

Strategy	Approach	Strengths	Weaknesses	Best For
OpenAI Platform Optimizer	Manual feedback via UI, automated optimization.	Speed and ease of use. Excellent for rapid prototyping and gathering human insights.	Not scalable or automated. Requires manual effort for feedback and iteration.	Rapid prototyping, human-in-the-loop scenarios, early-stage development.
Static Metaprompt Loop	Automated loop with a fixed metaprompt for optimization.	Lightweight automation. Enables fast feedback loops without human intervention.	Risk of overfitting. Limited exploration space due to a static metaprompt.	Development phase, continuous integration, scenarios with clear, defined graders.
GEPA	Evolutionary optimization with reflective, language-based updates.	Systematic generalization. Produces robust, high-performing prompts with strong empirical evidence.	More complex and computationally intensive. Requires a training and validation set.	Production environments, high-stakes tasks, scenarios requiring robust generalization.

Table 3: A comparison of the three prompt optimization strategies presented in the cookbook.

4.3.1. OpenAI Platform Optimizer: Speed and Human Feedback

The OpenAI Platform Optimizer is the simplest and most straightforward of the three strategies. It is designed for speed and ease of use, making it an excellent choice for rapid prototyping and for scenarios where human-in-the-loop oversight is a requirement. The platform's visual interface makes it easy to provide structured feedback, and the automated optimization feature can quickly generate a high-quality prompt based on that feedback. The main strength of this approach is its ability to leverage the nuanced judgment of human experts, which is particularly valuable in the early stages of development when the desired behavior of the agent is not yet fully understood. The platform's tight feedback loop also makes it a great tool for exploring different prompt strategies and for building a shared understanding of the desired outcome among stakeholders.

However, the manual nature of this approach is also its main weakness. The reliance on human reviewers makes it less scalable than the other two strategies, and it is not well-suited for production environments where the agent needs to operate autonomously. The process of providing feedback can also be time-consuming, which can be a bottleneck in a fast-paced development environment. In summary, the OpenAI Platform Optimizer is a powerful tool for the early stages of development and for use cases that require a high degree of human oversight. It provides an excellent foundation for understanding the principles of prompt optimization and for building a high-quality baseline agent, which can then be further refined and scaled using the more automated approaches.

4.3.2. Static Metaprompt Loop: Lightweight Automation

The static metaprompt loop is a step up from the manual approach, providing a lightweight and automated solution for prompt optimization. This strategy uses a separate LLM agent, the metaprompt agent, to automatically generate new prompts based on the feedback from the graders. This eliminates the need for manual feedback, making the process more scalable and suitable for integration into a CI/CD pipeline. The main strength of this approach is its ability to enable fast, iterative development without requiring constant human intervention. The use of a fixed metaprompt template makes the process simple and easy to implement, and the modular architecture of the system makes it flexible and extensible.

However, the use of a static metaprompt is also a potential weakness of this approach. The fixed template may limit the exploration space, and the system may be prone to overfitting to the specific feedback from the graders. The evaluation is also performed on a section-by-section basis, which can lead to a prompt that is optimized for individual examples but does not generalize well to the overall dataset. Despite these limitations, the static metaprompt loop is a powerful and practical solution for many use cases. It provides a good balance between automation and simplicity, and it is a significant improvement over a system that relies solely on manual prompt engineering. This approach is a good choice for the development phase and for scenarios where a lightweight, automated solution is needed.

4.3.3. GEPA: Systematic Generalization and Robustness

The GEPA framework is the most advanced and sophisticated of the three strategies. It uses an evolutionary approach with reflective, language-based updates to find robust, generalized prompts that perform well across a wide range of inputs. The main strength of this approach is its focus on generalization, which is achieved through the use of a training and validation set and a more nuanced, reflective optimization process. The GEPA framework is less likely to overfit to a specific dataset, and it can provide clearer empirical evidence of its performance. This makes it a particularly good choice for high-stakes, production environments where the agent needs to be both accurate and reliable.

The main weakness of the GEPA framework is its complexity and computational intensity. The framework is more difficult to implement than the other two strategies, and it requires a larger dataset and more computational resources to run. However, for use cases where performance and robustness are paramount, the benefits of the GEPA framework far outweigh its costs. The framework's ability to find highly effective, generalized prompts makes it a powerful tool for building state-of-the-art agentic systems. In summary, the GEPA framework is the best choice for production environments and for any use case where the highest level of performance and reliability is required. It represents the cutting edge of prompt optimization, and it provides a clear path towards creating truly intelligent and adaptive agentic systems.

5. Appendix: Reusable Artifacts and Examples

5.1. Example Prompts from Each Optimization Method

This section provides concrete examples of the prompts generated by each of the three optimization strategies, starting from the same simple baseline. These examples illustrate the different levels of detail and specificity that can be achieved with each method, and they provide a clear and tangible demonstration of the power of the self-evolving loop.

5.1.1. Initial Baseline Prompt

The starting point for all three optimization methods was a deliberately simple and generic prompt.

You are a summarization assistant.
Given a section of text, produce a summary.

This minimal prompt serves as a blank slate, allowing the optimization process to demonstrate its ability to evolve a system from a basic instruction to a highly specialized and effective tool.

5.1.2. OpenAI Platform Optimizer Output

The prompt generated by the OpenAI Platform Optimizer, based on structured human feedback, is significantly more detailed. It provides clear instructions on formatting, structure, and content, demonstrating the platform's ability to translate human preferences into a well-defined set of instructions.

You are a summarization assistant.
Task: Summarize the provided text concisely and accurately.
Output requirements:
- Output only the summary. Do not add titles, labels (e.g., "Summary:"), prefaces, or commentary.
- Preserve the document's structure. If multiple sections/subsections appear, summarize each one.
- Use a numbered list for sections/subsections (use their numbers/titles when present).
- Under each, use short dash bullets for key points.
- If there is only a single short section, return a brief bullet list or 1-2 concise sentences.
- Split any inline lists into separate bullets.
- Use plain, simple language. Keep bullets tight (ideally one line each). Remove redundancy.
- Include important quantitative details (values, units, conditions) and constraints. Do not invent information.
- Keep formatting simple: plain text, "1." numbering and "-" bullets only. No tables or special markup.
- Retain exact technical terms/notation from the source (e.g., chemical names, isotopic labels).
- If a section is explicitly marked "Not applicable," include that status; otherwise do not add it.

5.1.3. Static Metaprompt Output

The prompt generated by the static metaprompt agent is even more exhaustive, reflecting its focus on capturing every possible detail from the source text. It is highly structured and directive, leaving little room for interpretation.

You are a technical summarization assistant for scientific and regulatory documentation. Your task is to generate a concise, comprehensive, and fully detailed summary of any scientific, technical, or regulatory text provided. Strictly adhere to the following instructions:
---
**1. Complete and Exact Information Inclusion**  
- Capture *every* explicit fact, technical value, specification, quantity, measurement, regulatory reference, entity, process, site, and contextual detail verbatim from the source text.
- Do not omit or generalize any explicit information, no matter how minor.
**2. Precise Terminology and Named Entity Retention**  
- Reproduce all names of chemicals, drugs, mixtures, buffer components, devices, companies, institutions, regulatory standards, section numbers, and procedural labels *exactly as stated*.
- Report all quantities, measurements, concentrations, ratios, masses, volumes, compositions, pH values, and units precisely as given.
- Do not paraphrase, rename, substitute, or simplify any term or value.
**3. All Procedural Details and Justifications**  
- Explicitly include all described procedures, technical processes (e.g., terminal sterilization, aseptic processing), operational constraints, process justifications, compliance requirements, and standards references.
- Clearly state all reasons provided for choosing or omitting particular methods or processes.
**4. Regulatory and Compliance References**  
- Accurately cite all regulations, standards (e.g., USP <797>), compliance statements, section numbers, and cross-references as in the original.
- Include all explicit mentions of compliance, applicability, and site location details.
**5. Explicit Statements of Absence, Limitations, and Applicability**  
- Clearly state any declarations of absence, inapplicability (“Not applicable”), or limitations exactly as written in the source.
**6. Structural and Organizational Fidelity**  
- Precisely reflect the original document’s section and subsection hierarchy, using clear section labels and indentation.
- Present all enumerations, lists, and tabulated data in structured bullet-point or numbered format, organized in accordance with the source document’s arrangement.
**7. No Paraphrasing, Summarizing, or Reinterpretation**  
- Do *not* paraphrase, summarize contextually, reinterpret, or alter the meaning or sequence of any content.
- Remove only literal repetitions or redundant phrasing; otherwise, preserve all explicit statements, technical details, and contextual notes.
---
**Summary Output Objective:**  
Produce a summary that delivers the full technical, factual, and regulatory content and structure of the original text, reformatted by eliminating only redundant language. The summary must enable audit, regulatory review, or peer reference without loss of any explicit information or terminology from the source.
---
*Apply these instructions rigorously to every provided document section to ensure scientific and regulatory accuracy and completeness.*

5.1.4. GEPA Optimizer Output

The prompt generated by the GEPA optimizer is the most detailed and specific of all. It is highly tailored to the domain of pharmaceutical regulatory documents, with precise instructions on length, format, and content prioritization. This level of specificity is a direct result of GEPA's reflective and evolutionary approach.

You are a domain-aware summarization assistant for technical pharmaceutical texts. Given a “section” of text, produce a concise, single-paragraph summary that preserves key technical facts and exact nomenclature.
Length and format
- Write 1–3 sentences totaling about 45–70 words (target ~60; never exceed 90).
- Use one paragraph; no bullets, headings, tables, or heavy formatting.
Exact names and notation
- Include every chemical name that appears in the section at least once, using the exact original spelling, capitalization, punctuation, isotopic labels, brackets, hyphens, salts, buffer names, and parenthetical qualifiers. Treat distinct case/format variants as distinct names (e.g., [1-13C]pyruvic acid and [1-13C]Pyruvic acid are separate and each must appear once).
- Examples you must preserve verbatim when present: Hyperpolarized Pyruvate (13C) Injection; non-polarized Pyruvate Injection; Pyruvate (13C) Injection; hyperpolarized [1-13C]pyruvate; Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt; TRIS/EDTA buffer solution; TRIS; NaOH; Na2EDTA; [1-13C]pyruvic acid; AH111501 sodium salt.
- Also preserve exact study identifiers, batch codes, section numbers, regulatory citations, and instrument parameters as written (e.g., GE-101-001, GE-101-003, USP <797>, 3.2.P.5.2.5, FFF106/140-806, FFF106/142-806, 3T MRI, 5 degree RF pulse, TR=3s, 90 degree pulse, 64 averages, TR=10s, 10 μl Gd/ml solution).
Content prioritization (if space is tight)
1) What the section is about (topic/purpose).
2) All named chemical entities and compositions (list all chemical names at least once; include concentrations/amounts if given).
3) Critical process/handling facts (e.g., aseptic processing vs terminal sterilization; ISO classifications; filtration specs; compounding/filling steps; temperatures/times/volumes; storage/administration limits).
4) Container/packaging specifics (e.g., cryovials, “sterile fluid path”).
5) Microbiological/testing/regulatory details (e.g., sterility/pyrogenicity testing timing; USP <797>; state board compliance; site/manufacturer if stated).
6) Overages/single-dose formulas and key quantities.
Numerical fidelity
- Preserve all critical numbers and units exactly (e.g., 1.44 g, 27.7 mg, 15 mM, 18 mL, 1.47 g, two 0.2 μm filters, ISO 7, ISO 5, 38 mL).
- Include testing/analysis parameters when present (e.g., polarization/relaxation time (T1); number of spectra; pulse angles; TR values; MRI location relative to clean room).
Style and compression
- Be neutral and factual; do not infer unstated information.
- Consolidate repeated statements; compress lists with commas/semicolons to save words.
- Mention tables/figures only to convey key data; do not reproduce them.
- If many chemicals are present, ensure each distinct name appears once; group them succinctly.
- Avoid symbols or special formatting not in the source text.
Common domain cues to include when present
- Aseptic processing vs terminal sterilization and the rationale/timing (e.g., “tested for sterility and pyrogenicity subsequent to patient administration”).
- Environmental/processing controls (ISO 7/ISO 5; LAF unit; filtration; filling/weight targets per cryovial).
- Site/regulatory context (e.g., USP <797>; California State Board of Pharmacy; University of California, San Francisco Department of Clinical Pharmacy).
- Study/kit equivalence statements (e.g., equivalence to GE-101-001/GE-101-003 formulations).
- QC/measurement methods (e.g., capacitive threshold at Administration syringe nominal 38 mL).
Self-check before finalizing
- Does the paragraph contain every distinct chemical name exactly as written in the section (including case and notation variants)?
- Is the summary 45–70 words (≤90), in a single paragraph?
- Are the most critical process/regulatory/testing details and all key numbers preserved without unnecessary verbosity?

5.2. Key Configuration Templates and Code Snippets

This section provides key configuration templates and code snippets that can be reused and adapted for different use cases. These artifacts are the building blocks of the self-evolving system, and they provide a practical starting point for anyone who wants to implement the framework in their own environment.

5.2.1. Evaluation Suite Configuration

The following code snippet shows the configuration for the four-grader evaluation suite. This configuration can be adapted to different use cases by modifying the pass thresholds, the target length for the summary length grader, or the rubric for the LLM-as-a-judge.

testing_criteria = [
    {
        "type": "python",
        "name": "chemical_name_grader",
        "image_tag": "2025-05-08",
        "pass_threshold": 0.8,
        "source": r"""def grade(sample: dict, item: dict) -> float:
    # ... (grader logic) ...
    return correct / len(present)""",
    },
    {
        "type": "python",
        "name": "word_length_deviation_grader",
        "image_tag": "2025-05-08",
        "pass_threshold": 0.85,
        "source": r"""def grade(sample: dict, item: dict) -> float:
    # ... (grader logic) ...
    return max(0.0, score)""",
    },
    {
        "name": "cosine_similarity",
        "type": "text_similarity",
        "input": "{{ item.summary }}",
        "reference": "{{ item.section }}",
        "evaluation_metric": "cosine",
        "pass_threshold": 0.85,
    },
    {
        "name": "llm_as_judge",
        "type": "score_model",
        "model": "gpt-4.1",
        "input": [
            {
                "role": "system",
                "content": (
                    "You are an expert technical summarization evaluator. "
                    # ... (rubric) ...
                ),
            },
            {
                "role": "user",
                "content": (
                    "Section:\n{{item.section}}\n"
                    "Summary:\n{{sample.output_text}}"
                ),
            },
        ],
        "range": [0, 1],
        "pass_threshold": 0.85,
    },
]

5.2.2. Metaprompt Template

The following is the METAPROMPT_TEMPLATE used to guide the metaprompt agent. This template can be customized to provide different instructions to the metaprompt agent, depending on the desired characteristics of the optimized prompt.

METAPROMPT_TEMPLATE = """
# Context:
## Original prompt:
{original_prompt}

## Section:
{section}

## Summary:
{summary}

## Reason to improve the prompt:
{reasoning}

# Task:
Write a new summarization prompt that is significantly improved and more specific than the original.  
The new prompt should instruct the model to produce concise yet comprehensive technical summaries that precisely preserve all explicit information from the source text. It should emphasize the inclusion of all named entities, quantities, compounds, and technical terminology without paraphrasing or omission. The resulting prompt should read like a clear, directive system message for a technical summarization assistant—structured, unambiguous, and generalizable across scientific or regulatory document sections.
"""

5.2.3. GEPA Adapter Implementation

The following is a simplified version of the EvalsBackedSummarizationAdapter class, which is used to integrate the evaluation suite with the GEPA framework. This adapter can be adapted to different use cases by modifying the evaluate method to use a different set of graders or a different evaluation logic.

class EvalsBackedSummarizationAdapter:
    propose_new_texts = None

    def __init__(self, client, eval_id: str, gen_model: str = "gpt-5"):
        self.client = client
        self.eval_id = eval_id
        self.gen_model = gen_model

    def _summarize(self, system_prompt: str, section: str) -> str:
        # ... (summarization logic) ...
        return resp.choices[0].message.content.strip()

    def evaluate(self, inputs: list[dict], candidate: dict, capture_traces: bool = True) -> EvaluationBatch:
        system_prompt = candidate["system_prompt"]
        scores: list[float] = []
        outputs: list[str] = []
        trajectories: list[dict] = []

        for item in inputs:
            section = item["content"]
            summary = self._summarize(system_prompt, section)
            outputs.append(summary)

            # 2) Grade using previous evals pipeline
            run = run_eval(eval_id=self.eval_id, section=section, summary=summary)
            out_items = poll_eval_run(eval_id=self.eval_id, run_id=run.id)
            grader_scores = parse_eval_run_output(out_items)

            # 3) Score + actionable feedback
            scalar = calculate_grader_score(grader_scores)
            feedback = collect_grader_feedback(grader_scores) or "All graders passed; keep precision and coverage."

            scores.append(float(scalar))
            trajectories.append({
                "inputs": {"section": section},
                "generated_output": summary,
                "metrics": {"combined": float(scalar), "by_grader": grader_scores},
                "feedback": feedback,
            })

        return EvaluationBatch(scores=scores, outputs=outputs, trajectories=trajectories)

    def get_components_to_update(self, candidate: dict) -> list[str]:
        return ["system_prompt"]

    def make_reflective_dataset(self, candidate: dict, eval_batch: EvaluationBatch, components_to_update: list[str]) -> dict:
        examples = []
        for traj in (eval_batch.trajectories or []):
            examples.append({
                "Inputs": {"section": traj["inputs"]["section"]},
                "Generated Outputs": traj["generated_output"],
                "Feedback": traj["feedback"],
            })
        return {"system_prompt": examples}

: GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning by Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, Omar Khattab - https://arxiv.org/abs/2507.19457

✨步子哥 (steper) #2

11-15 12:06

GLM-4.5-Air EvalScope 智商情商&&并发评测

想要参与讨论？

登录注册

Step	Action	Description
1	Upload Dataset	Upload a CSV file containing the inputs for the agent (e.g., document sections to be summarized).
2	Explore Data	Verify the uploaded data is correctly formatted and contains the expected content.
3	Configure Initial Prompt	Define the system prompt and user prompt template. Select the model and configure parameters like temperature.
4	Generate Outputs	Run the configured prompt against all samples in the dataset to create a baseline of outputs.
5	Review and Evaluate	Add evaluation columns (Rating, Feedback) and provide structured feedback on each generated output.
6	Optimize Prompt	Use the "Optimize" button to automatically generate a new, improved prompt based on the collected feedback.
7	Iterate and Compare	Generate outputs with the new prompt, evaluate them, and repeat the cycle until performance is satisfactory.