A significant and recurring challenge in the development of agentic systems is the plateau in performance and reliability that often follows an initial proof-of-concept. While early demonstrations can showcase the potential of Large Language Models (LLMs) to automate complex tasks, these systems frequently fall short of production readiness. The core issue lies in their inability to autonomously diagnose and correct failures, particularly the edge cases that emerge when exposed to the full complexity and variability of real-world data. This dependency on human intervention for continuous diagnosis and correction creates a bottleneck, hindering scalability and long-term viability. The initial excitement of a successful demo gives way to the reality of a brittle system that requires constant manual oversight, preventing it from achieving true operational autonomy. This cookbook addresses this critical gap by introducing a repeatable and structured retraining loop designed to capture these failures, learn from the feedback provided, and iteratively promote improvements back into the production workflow. The framework is designed to transform a static, human-dependent agent into a dynamic, self-evolving system that can progressively enhance its own performance over time.
The proposed solution moves beyond simple, one-time prompt engineering or fine-tuning. Instead, it establishes a continuous cycle of evaluation and refinement that mirrors the iterative nature of software development and quality assurance. By instrumenting the agent with measurable feedback signals, the system can objectively identify areas of weakness, whether they be factual inaccuracies, stylistic inconsistencies, or failures to adhere to specific domain constraints. This feedback can be sourced from human experts, who provide nuanced, qualitative assessments, or from automated "LLM-as-a-judge" systems that offer scalable, quantitative scoring. This dual-source feedback mechanism ensures that the learning process is both comprehensive and efficient. The ultimate goal is to create a system that not only performs its designated task but also learns from its mistakes, gradually shifting the burden of detailed correction from human operators to high-level strategic oversight. This evolution is crucial for deploying agentic systems in high-stakes environments where accuracy, auditability, and rapid iteration are not just desirable but essential for success.
The central innovation of this cookbook is the "self-evolving loop," a systematic and iterative process designed to enable continuous, autonomous improvement of an AI agent. This loop is engineered to move agentic systems beyond static, pre-programmed behaviors and into a state of dynamic learning and adaptation. The process is structured as a continuous cycle that integrates agent execution, multi-faceted evaluation, and automated prompt refinement. It begins with a baseline agent, which generates an initial output. This output is then subjected to a rigorous evaluation process that combines the nuanced judgment of human reviewers with the scalable, consistent scoring of an automated LLM-as-a-judge. The feedback gathered from this evaluation is then used to generate an improved prompt, which is tested and scored. If the new prompt achieves a performance threshold, it replaces the original, becoming the new baseline for the next iteration. This closed-loop system ensures that the agent is constantly learning from its performance, refining its behavior, and adapting to new data or requirements without requiring constant manual intervention from engineers or domain experts. The loop is designed to be robust, with built-in mechanisms for handling failures and ensuring that only demonstrably superior versions of the agent are promoted to production.
The self-evolving loop is composed of five distinct, sequential stages that together form a complete cycle of improvement. Each stage plays a critical role in transforming raw agent outputs into actionable insights and, ultimately, into a more effective agent. The process is designed to be modular, allowing for different components to be swapped or upgraded as needed. For instance, the evaluation suite can be expanded with new graders to address specific failure modes, or the prompt optimization strategy can be enhanced with more sophisticated techniques. The loop's architecture is also designed for observability, with detailed logging and tracing at each stage to provide a clear audit trail of the agent's evolution. This transparency is crucial for debugging, understanding the impact of changes, and ensuring the reliability of the system in production environments. The following subsections will detail each of the five stages of the self-evolving loop, providing a comprehensive overview of how this framework enables the creation of truly adaptive and self-improving agentic systems.
The first step in the self-evolving loop is the establishment of a baseline agent, which serves as the initial benchmark for all subsequent evaluation and refinement. This agent does not need to be perfect; in fact, it can be deliberately simple to effectively illustrate the power of the iterative improvement process. In the context of this cookbook, the baseline agent is a summarization assistant tasked with condensing sections of regulatory documents. Its initial prompt is intentionally generic, such as "You are a summarization assistant. Given a section of text, produce a summary." This simplicity allows the optimization loop to demonstrate its ability to evolve a system from a minimal starting point to a highly specialized and effective tool. The outputs generated by this baseline agent, while potentially flawed, provide the raw material for the evaluation stage. They represent the starting point of the agent's performance curve and are the first set of data points that will be used to identify areas for improvement. The baseline agent's role is to produce a consistent stream of outputs that can be systematically evaluated, scored, and used to drive the learning process forward.
The architecture of the baseline agent can vary depending on the complexity of the task and the production environment. In this cookbook, a simplified version of a regulatory authoring agent is used, focusing specifically on the summarization task. In a more complex, real-world scenario, the baseline agent could be a composite of multiple specialized sub-agents, each responsible for a different aspect of the workflow, such as data analysis, compliance checking, or citation generation. Regardless of its complexity, the baseline agent's primary function within the loop is to serve as the initial point of comparison. Its performance is measured against a set of predefined criteria, and its outputs are the subject of both human and automated evaluation. The key is that the baseline agent is a stable, reproducible starting point. The loop is designed to improve upon this foundation, and the initial prompt and its corresponding outputs are the first iteration in a long series of continuous enhancements. The simplicity of the initial agent also underscores a key principle of the framework: that significant performance gains can be achieved not just through complex initial engineering, but through a systematic and data-driven process of iterative refinement.
Once the baseline agent has generated its initial outputs, the next critical stage in the self-evolving loop is the collection of structured feedback. This feedback is the lifeblood of the entire system, providing the necessary signals to identify weaknesses and guide the optimization process. The framework employs a dual-pronged approach to feedback collection, leveraging both human expertise and the scalable power of automated evaluation. This hybrid model ensures a balance between nuanced, qualitative judgment and consistent, quantitative scoring. The choice between human review and an LLM-as-a-judge, or a combination of both, depends on the specific context of the evaluation. For instance, during the initial development and prototyping phase, or in production environments where subject matter experts (SMEs) are available, human feedback is invaluable for uncovering subtle edge cases and providing rich, contextual insights. The OpenAI Evals platform provides a user-friendly interface for this purpose, allowing reviewers to provide both binary (thumbs up/down) ratings and detailed textual feedback on the agent's outputs.
In parallel, the framework utilizes an "LLM-as-a-judge" to automate the evaluation process, which is particularly useful for rapid, iterative development and for monitoring model performance at scale. This approach involves using a separate, powerful LLM to act as an evaluator, scoring the agent's outputs against a predefined rubric. This automated judge can assess a wide range of criteria, from factual accuracy and stylistic adherence to the presence of specific keywords or the correct formatting of the output. The LLM-as-a-judge is not just a simple scorer; it can also provide a rationale for its evaluation, offering actionable feedback that can be fed directly into the prompt optimization stage. This automated approach enables fast feedback loops without requiring the constant attention of human experts, making it ideal for continuous integration and deployment pipelines. By combining the strengths of both human and automated evaluation, the self-evolving loop ensures that the feedback it receives is both comprehensive and scalable, providing a solid foundation for the subsequent stages of evaluation and optimization.
The feedback collected from both human reviewers and the LLM-as-a-judge is then processed in the evaluation and scoring stage. This is where the qualitative and quantitative feedback is transformed into a structured, measurable assessment of the agent's performance. The core of this stage is a suite of "graders," which are specialized evaluation functions designed to assess the agent's output against specific, predefined criteria. Each grader is responsible for a different aspect of the output's quality, and together they form a comprehensive evaluation suite. For the regulatory document summarization use case, this cookbook defines four distinct graders, each with a specific pass threshold and a clear rationale for its inclusion. This multi-grader approach ensures that the evaluation is robust and multi-faceted, capturing a wide range of potential failure modes and quality signals. The scores from each grader are then aggregated into a single, composite score that represents the overall performance of the agent for a given input.
The evaluation process is not just about assigning a single number; it's about providing a detailed breakdown of performance across different dimensions. This granular feedback is crucial for the subsequent prompt optimization stage, as it allows the system to understand not just that the agent failed, but why it failed. For example, if the agent's summary is factually accurate but too verbose, the length grader will flag this issue, providing a specific signal for the metaprompt agent to address. Similarly, if the summary is concise but omits critical chemical names, the chemical name grader will provide a clear indication of what needs to be improved. This detailed, multi-faceted scoring system is what enables the self-evolving loop to make targeted, effective improvements to the agent's instructions. The aggregated score is then compared against a target threshold to determine whether the agent's performance is acceptable or if further optimization is required. This systematic, data-driven approach to evaluation is the key to transforming the agent from a static tool into a dynamic, learning system.
When the evaluation stage determines that the agent's performance is below the desired threshold, the prompt optimization stage is triggered. This is the heart of the self-evolving loop, where the system takes the feedback from the graders and uses it to generate a new, improved set of instructions for the agent. This process is not a simple, one-time fix; it is an iterative search for a better prompt. The cookbook explores three distinct strategies for prompt optimization, ranging from quick manual iteration to fully automated loops, each suited for different stages of development and production. The most basic approach involves using the OpenAI Evals platform's "Optimize" button, which uses the structured human feedback to generate a new prompt. This is ideal for rapid prototyping and for scenarios where human-in-the-loop oversight is preferred. The platform's visual interface makes it easy to see the impact of the changes and to compare the performance of the new prompt against the old one.
For a more automated and scalable approach, the cookbook introduces a "metaprompt agent." This is a separate LLM agent whose sole purpose is to act as a prompt optimizer. It takes the original prompt, the agent's output, the source text, and the consolidated feedback from the graders as input, and it generates a new, improved prompt as output. This metaprompt agent is guided by a detailed template that instructs it to produce a prompt that is more specific, more directive, and better aligned with the desired performance criteria. This automated approach enables the system to explore a wide range of prompt variations without requiring manual intervention, making it ideal for continuous integration and deployment. The most advanced strategy presented is the use of the Genetic-Pareto (GEPA) framework, which employs a more sophisticated, evolutionary approach to prompt optimization. GEPA uses a combination of quantitative scores and qualitative feedback to reflect on the agent's performance and propose revisions, leading to more robust and generalized prompts. Regardless of the specific strategy used, the goal of the prompt optimization stage is the same: to use the rich, structured feedback from the evaluation stage to systematically and iteratively improve the agent's instructions, driving its performance closer to the desired target.
The final stage of the self-evolving loop is the promotion of the updated agent. Once a new, improved prompt has been generated and tested, its performance is compared against the baseline. If the new version achieves a higher aggregated score and meets the predefined pass thresholds, it is promoted to become the new baseline agent. This updated agent then becomes the foundation for the next iteration of the loop, creating a continuous cycle of learning and optimization. This process of promotion is not automatic; it is a deliberate decision based on empirical evidence of superior performance. The system maintains a history of all prompt versions, along with their associated performance metrics, allowing for a clear audit trail of the agent's evolution. This versioning system is crucial for traceability and for ensuring that the system can be rolled back to a previous, stable version if a new prompt introduces unexpected regressions.
The promotion of the updated agent is the culmination of the entire loop. It represents the successful application of feedback-driven learning, where the system has not only identified its own weaknesses but has also taken concrete steps to address them. This continuous cycle of evaluation, optimization, and promotion is what enables the agent to evolve over time, gradually improving its performance and adapting to new challenges. The loop is designed to run continuously, either on a schedule or in response to new data, ensuring that the agent remains accurate, compliant, and effective in the face of changing requirements. By closing the loop in this way, the framework transforms a static, brittle agent into a dynamic, resilient, and self-improving system, capable of achieving and maintaining a high level of performance in even the most demanding production environments. This final stage is not an end point, but rather a new beginning, as the updated agent is immediately subjected to the next round of evaluation and refinement, perpetuating the cycle of continuous improvement.
To ground the abstract concepts of the self-evolving loop in a concrete, real-world scenario, this cookbook focuses on a challenging and high-stakes use case: the drafting of regulatory documents for the pharmaceutical industry. This domain is an ideal testbed for the framework because it demands an exceptionally high degree of accuracy, precision, and compliance. The documents produced in this field, such as those submitted to the U.S. Food and Drug Administration (FDA), are subject to rigorous scrutiny, and any errors or omissions can have significant consequences, including delays in the approval of life-saving treatments. The process of creating these documents is traditionally labor-intensive, requiring deep expertise in science, medicine, and regulatory law. Agentic systems offer a compelling solution to this challenge by assisting with tasks such as research synthesis, content generation, and document structuring. However, the critical nature of these documents means that human experts must remain in the loop to ensure factual accuracy and regulatory compliance. The self-evolving loop is perfectly suited to this "human-in-the-loop" scenario, as it is designed to gradually shift the human effort from detailed, line-by-line correction to high-level strategic oversight, thereby improving efficiency without compromising on quality.
The use case is centered around a regulatory authoring agent that is tasked with summarizing sections of a Chemistry, Manufacturing, and Controls (CMC) document. This is a highly complex and iterative process that requires the agent to not only understand the scientific content but also to adhere to strict formatting and content guidelines. The agent must be able to accurately identify and preserve critical information, such as chemical names, molecular formulas, and regulatory citations, while also producing a concise and readable summary. The self-evolving loop is used to continuously improve the agent's ability to perform this task. By providing the agent with a steady stream of feedback from both human reviewers and automated graders, the system can iteratively refine its summarization instructions, leading to progressively better performance. The following subsections will provide a more detailed overview of the problem definition, the architecture of the baseline agent, and the dataset used for evaluation, illustrating how the self-evolving loop can be applied to solve a real-world problem in a highly regulated industry.
The core problem addressed in this use case is the immense challenge of producing accurate and timely regulatory submissions for new pharmaceutical products. Pharmaceutical companies are required to prepare and submit extensive documentation to regulatory authorities like the FDA to obtain approval for new drugs. The speed and accuracy of these submissions are of paramount importance, as they directly impact the timeline for getting new, potentially life-saving treatments to patients. The process of drafting these documents is notoriously complex, iterative, and precision-driven. It requires a deep understanding of scientific and medical principles, as well as a thorough knowledge of the intricate web of regulatory requirements. Despite the availability of advanced authoring tools, the process remains highly labor-intensive and is prone to human error. This creates a significant bottleneck in the drug development pipeline, consuming valuable time and resources that could be better spent on research and innovation.
Agentic systems, powered by LLMs, present a transformative opportunity to address this challenge. These systems can provide substantial leverage by automating many of the more tedious aspects of the document drafting process, such as synthesizing research findings, generating initial drafts of content, and structuring documents according to predefined templates. However, the critical nature of these documents means that they cannot be fully automated. Human experts, with their deep domain knowledge and understanding of the regulatory landscape, are still essential for ensuring the factual accuracy and compliance of the final submissions. The key challenge, therefore, is to design a system that can effectively combine the speed and scalability of an agentic system with the precision and expertise of a human reviewer. The self-evolving loop provides a powerful solution to this challenge by creating a feedback-driven system that can learn from the corrections and guidance of human experts, gradually improving its performance and reducing the burden of manual review. This allows the human experts to focus their attention on the most critical and complex aspects of the submission, while the agent handles the more routine tasks, ultimately leading to faster, more accurate, and more efficient regulatory submissions.
To demonstrate the self-evolving loop in a self-contained and easily reproducible manner, the cookbook defines a simplified version of a regulatory authoring agent. In a full-scale production environment, such an agent would likely be a complex system composed of multiple specialized sub-agents, each responsible for a different part of the workflow, such as drafting, data analysis, compliance checking, citation generation, and fact verification. However, for the purposes of this guide, the scope is narrowed to focus on the core self-healing aspect of the system. The baseline agent is therefore composed of two primary sub-agents: a summarizer and a compliance checker. The summarizer is responsible for the core task of reading a section of a regulatory document and producing a concise, accurate summary. The compliance checker, in turn, evaluates the generated summary to ensure that it adheres to key regulatory requirements, such as those outlined in the FDA's 21 CFR Part 11. This two-agent architecture, while simplified, captures the essential elements of a real-world regulatory authoring workflow and provides a clear demonstration of how the self-evolving loop can be applied to improve the performance of a task-specific agent.
The summarizer agent is the primary focus of the optimization loop. It is configured with a simple, initial prompt and is tasked with summarizing sections of the provided CMC document. The compliance checker agent serves as an additional layer of validation, providing a binary assessment of whether the summary meets a specific regulatory standard. While the compliance checker is not directly optimized in this example, it illustrates how the framework can be extended to include multiple, interdependent agents, each with its own set of evaluation criteria. The prompts and parameters for these agents are explicitly defined in the cookbook, allowing for easy reproduction of the baseline system. For example, the summarizer agent is configured to use the file search tool to access the CMC PDF, and its prompt is a simple instruction to summarize a given section. The compliance checker agent's prompt is more specific, instructing it to verify the summary against FDA 21 CFR Part 11 and return a simple "Compliant" or "This section needs to be manually summarized" response. This clear and transparent definition of the baseline agent's architecture provides a solid foundation for the subsequent stages of the self-evolving loop.
To provide a realistic and domain-specific testbed for the self-evolving agent, the cookbook utilizes a dataset comprising approximately 70 sections extracted from a publicly available Sample CMC Section for Hyperpolarized Pyruvate (13C) Injection. This dataset is particularly well-suited for the task because it contains the kind of dense, technical, and highly specific language that is characteristic of regulatory documents in the pharmaceutical industry. The content covers a range of topics, from the chemical properties and nomenclature of the drug substance to the details of the manufacturing process and the results of stability studies. This rich and varied content provides an excellent opportunity to test the agent's ability to not only understand complex scientific information but also to accurately identify and preserve critical details, such as chemical names, molecular formulas, and regulatory citations. The use of a real-world document, rather than a synthetic or simplified dataset, ensures that the evaluation is both rigorous and relevant to the challenges faced in a production environment.
The dataset is provided in a CSV format, making it easy to load and process within the notebook environment. Each row in the dataset represents a single section of the CMC document, and the content of each section is used as the input for the summarization agent. The dataset is also used to train and validate the different prompt optimization strategies, including the GEPA framework. By using a consistent and well-defined dataset, the cookbook is able to provide a clear and reproducible demonstration of the self-evolving loop in action. The performance of the agent is evaluated against this dataset, and the results are used to drive the iterative improvement process. The choice of this specific dataset is not arbitrary; it is a deliberate decision to ground the abstract concepts of the self-evolving loop in a concrete, challenging, and highly relevant real-world problem. This approach not only makes the concepts easier to understand but also provides a clear and compelling demonstration of the practical value of the framework in a high-stakes, regulated industry.
The OpenAI Evals platform provides a powerful and intuitive web-based interface for the manual optimization and evaluation of prompts. This section of the cookbook demonstrates a complete, end-to-end workflow for using the platform to iteratively improve a prompt based on structured human feedback. This approach is particularly well-suited for the early stages of development, where rapid prototyping and close collaboration with subject matter experts are essential. The platform's visual interface makes it easy to see the impact of changes and to understand the optimization process, providing an excellent foundation for the more automated approaches described later in the cookbook. The workflow begins with the upload of a dataset and proceeds through the configuration of an initial prompt, the generation of outputs, the provision of structured feedback, and the use of the platform's automated optimization feature. The platform's tabbed interface allows for easy comparison of performance across different iterations, making it simple to track the evolution of the prompt and to identify the most effective changes.
The core of the platform's value lies in its ability to facilitate a tight, human-in-the-loop feedback cycle. By providing a simple and intuitive way for reviewers to rate outputs and provide detailed comments, the platform captures the nuanced, qualitative feedback that is often missing from purely automated evaluation systems. This structured feedback is then used to power the platform's automated prompt optimization feature, which generates a new, improved prompt based on the collective input of the reviewers. This combination of human judgment and automated optimization creates a powerful synergy, allowing for the rapid development of high-quality prompts with minimal manual effort. The platform also provides a clear and transparent view of the entire process, from the initial dataset to the final optimized prompt, making it easy to understand and reproduce the results. This section of the cookbook provides a detailed, step-by-step guide to using the OpenAI Evals platform, illustrating how it can be used to quickly and effectively optimize a prompt for a specific task.
The process of manually optimizing a prompt using the OpenAI Evals platform is broken down into a series of clear, sequential steps. This structured approach ensures that the user can easily follow the workflow and achieve the desired results. The process begins with the preparation and upload of a dataset, which serves as the input for the agent. This is followed by the configuration of the initial prompt, which defines the agent's task and behavior. Once the prompt is configured, the platform is used to generate outputs for the entire dataset, creating a baseline for evaluation. The core of the process is the review and evaluation stage, where human reviewers provide structured feedback on the generated outputs. This feedback is then used to power the platform's automated optimization feature, which generates a new, improved prompt. The final step is to iterate on this process, using the new prompt to generate a new set of outputs and evaluating them to measure the improvement. This iterative cycle can be repeated as many times as necessary until the desired level of performance is achieved. The following subsections will provide a more detailed description of each of these steps, offering a comprehensive guide to the manual prompt optimization workflow.
| Step | Action | Description |
|---|---|---|
| **1** | **Upload Dataset** | Upload a CSV file containing the inputs for the agent (e.g., document sections to be summarized). |
| **2** | **Explore Data** | Verify the uploaded data is correctly formatted and contains the expected content. |
| **3** | **Configure Initial Prompt** | Define the system prompt and user prompt template. Select the model and configure parameters like temperature. |
| **4** | **Generate Outputs** | Run the configured prompt against all samples in the dataset to create a baseline of outputs. |
| **5** | **Review and Evaluate** | Add evaluation columns (Rating, Feedback) and provide structured feedback on each generated output. |
| **6** | **Optimize Prompt** | Use the "Optimize" button to automatically generate a new, improved prompt based on the collected feedback. |
| **7** | **Iterate and Compare** | Generate outputs with the new prompt, evaluate them, and repeat the cycle until performance is satisfactory. |
Table 1: A summary of the step-by-step process for manual prompt optimization using the OpenAI Evals platform.
The first step in the manual prompt optimization workflow is to upload the dataset that will be used for evaluation. The OpenAI Evals platform provides a simple and intuitive interface for this task. The user begins by clicking the "+ Create" button, which initiates the process of creating a new evaluation run. The user is then prompted to define a name for the dataset and to upload a CSV file containing the data. The platform allows the user to select which columns from the CSV file should be included in the evaluation, providing flexibility in how the data is structured. The dataset should contain the inputs that will be processed by the agent; in the case of the regulatory document summarization task, each row of the dataset represents a section of the document that needs to be summarized. Once the dataset is uploaded, the user can explore the data to verify that it has been properly formatted and that it contains the expected content. This exploration step is important for ensuring that the evaluation is based on a clean and accurate dataset, which is essential for obtaining reliable results.
The platform's data exploration features allow the user to view the uploaded data in a tabular format, making it easy to scan for any potential issues. The user can review the content of each row and column to ensure that the data is complete and correctly structured. This step is particularly important when working with complex or messy datasets, as it allows for the identification and correction of any errors before proceeding with the evaluation. The ability to preview the data before running the evaluation is a key feature of the platform, as it helps to prevent wasted time and effort on flawed or incomplete datasets. Once the user is satisfied that the dataset is correct, they can proceed to the next step of the workflow, which is the configuration of the initial prompt. The clear and straightforward process for uploading and exploring the dataset makes it easy to get started with the evaluation and ensures that the subsequent steps are based on a solid foundation of high-quality data.
After the dataset has been uploaded and explored, the next step is to configure the initial prompt that will be used to guide the agent's behavior. This is a critical step in the process, as the quality of the initial prompt will have a significant impact on the quality of the generated outputs and the effectiveness of the subsequent optimization. The OpenAI Evals platform provides a user-friendly interface for prompt configuration, allowing the user to define both the system prompt and the user prompt template. The system prompt is a high-level instruction that defines the agent's role and overall task, while the user prompt template is a more specific instruction that is populated with data from the dataset for each individual run. The platform supports the use of variables in the user prompt template, which are replaced with the actual values from the dataset at runtime. This allows for a high degree of flexibility in how the agent is instructed to process the data.
The platform also provides options for configuring the underlying model that will be used for generation. The user can select from a range of available models, such as GPT-4.1 or GPT-5, and can also adjust parameters such as the temperature, which controls the balance between creativity and determinism in the model's output. For the purposes of this cookbook, the initial prompt is deliberately kept simple, such as "summarize," to demonstrate the power of the optimization process to evolve the prompt from a minimal starting point. However, in a real-world scenario, the initial prompt would likely be more detailed and specific to the task at hand. The platform's intuitive interface for prompt configuration makes it easy to experiment with different prompts and model settings, allowing the user to quickly find a good starting point for the optimization process. Once the prompt is configured, the user can proceed to the next step, which is to generate the initial set of outputs for evaluation.
With the initial prompt configured, the next step is to generate outputs for the entire dataset. This is done by clicking the "Generate Output" button in the OpenAI Evals platform. The platform will then run the prompt against each row in the dataset, replacing the template variables with the actual values from the dataset and calling the model with the configured system prompt. The results of this process are displayed in a new column in the data table, providing a clear and easy-to-review record of the agent's outputs. This step creates the baseline of outputs that will be used for the subsequent evaluation and optimization. The generated outputs can be reviewed directly within the platform's interface, allowing the user to get a quick sense of the agent's initial performance and to identify any obvious issues or areas for improvement. This initial review is an important part of the process, as it helps to inform the feedback that will be provided in the next step.
The platform's interface for reviewing outputs is designed for efficiency and ease of use. The user can quickly scan through the generated summaries, comparing them to the original source text to assess their accuracy and completeness. The tabular layout of the data makes it easy to see the input and output side-by-side, facilitating a quick and intuitive review process. This initial review is not intended to be a comprehensive evaluation, but rather a high-level assessment of the agent's performance. The detailed, structured feedback will be provided in the next step of the workflow. However, this initial review is a valuable opportunity to get a feel for the data and to identify any patterns or trends in the agent's outputs. Once the user has reviewed the initial outputs, they can proceed to the next step, which is to provide structured feedback to guide the prompt optimization process.
The core of the manual optimization workflow is the provision of structured feedback on the agent's outputs. This is where the human-in-the-loop aspect of the process comes into play, and it is this feedback that drives the subsequent optimization. The OpenAI Evals platform provides a simple and intuitive interface for providing this feedback, allowing reviewers to add evaluation columns to the data table. These columns can be configured to capture different types of feedback, such as a binary rating (e.g., good/bad), a numeric score, or a free-text comment. For the purposes of this cookbook, two evaluation columns are used: a "Rating" column for a binary assessment and a "Feedback" column for detailed textual comments. This structured approach to feedback collection ensures that the input to the optimization process is both consistent and actionable.
The reviewer is tasked with assessing each generated output and providing a rating and a comment based on how the output could be improved. For example, a reviewer might give a "Bad" rating and a comment such as "The information is good, but it should be presented as bullet points to improve readability." This specific, actionable feedback is exactly what the optimization process needs to generate a better prompt. The platform allows for the easy entry of this feedback, and the annotations are saved with the evaluation run, creating a permanent record of the reviewer's assessment. This structured feedback becomes the foundation for the automated prompt optimization, providing the system with the clear and specific guidance it needs to improve the agent's performance. The quality and detail of the feedback provided at this stage are directly correlated with the quality of the optimized prompt, making this a critical step in the overall workflow.
Once a sufficient amount of structured feedback has been collected, the next step is to use the OpenAI Evals platform's automated prompt optimization feature. This feature takes the feedback provided by the reviewers and uses it to generate a new, improved prompt. The process is simple and straightforward: the user clicks the "Optimize" button, and the platform automatically generates a new prompt version in a new tab. The user can then click "View Prompt" to see the improved version. This automated optimization is a powerful feature, as it leverages the collective intelligence of the human reviewers to generate a prompt that is more specific, more directive, and better aligned with the desired performance criteria. The platform's optimization algorithm is designed to interpret the feedback and translate it into clear and actionable instructions for the agent.
The optimized prompt generated by the platform is often significantly more detailed and specific than the initial prompt. For example, the initial prompt might have been a simple "summarize," while the optimized prompt might include detailed instructions on the desired format, tone, and content of the summary, such as "Use bullet points when answering to improve readability" or "Summarize each sub-section individually." This level of detail is a direct result of the specific feedback provided by the reviewers. The platform's ability to automatically generate such a detailed and well-structured prompt is a key advantage of this approach, as it saves a significant amount of manual effort and ensures that the resulting prompt is of high quality. The optimized prompt is then ready to be tested and evaluated in the next step of the workflow, completing the first iteration of the optimization cycle.
The final step in the manual prompt optimization workflow is to iterate on the process. With the new, optimized prompt in hand, the user can start a new iteration to measure the improvement in performance. This is done by clicking "Generate Output" again, which will run the new prompt against the entire dataset. The user can then review the new results and provide feedback on any remaining issues. If further improvement is needed, the user can click "Optimize" again to generate another new prompt. This iterative cycle of generating outputs, providing feedback, and optimizing the prompt can be repeated as many times as necessary until the desired level of performance is achieved. The platform's tabbed interface makes it easy to compare the performance of different prompt versions, allowing the user to see how the outputs have evolved from the initial prompt to the optimized versions.
This iterative approach is the key to achieving significant and sustained improvements in performance. Each iteration of the cycle provides new data and new feedback, which can be used to further refine the prompt and address any remaining weaknesses. The platform's ability to track and compare the performance of different versions is a crucial feature, as it allows the user to see the impact of their changes and to make informed decisions about which version of the prompt to use. The process is designed to be continued until a certain quality threshold is reached, such as when more than 80% of the outputs receive positive feedback, or when new iterations show diminishing returns. This systematic, data-driven approach to prompt optimization is what enables the user to quickly and effectively develop a high-performing agent for their specific task.
The manual prompt optimization approach using the OpenAI Evals platform is best suited for specific stages of the development lifecycle and for certain types of use cases. Its primary strength lies in its ability to facilitate rapid prototyping and close collaboration with subject matter experts. In the early stages of a project, when the requirements are still being defined and the desired behavior of the agent is not yet fully understood, this approach provides an excellent way to quickly explore different prompt strategies and to gather feedback from stakeholders. The visual interface and the tight feedback loop make it easy to experiment with different ideas and to see the results in real-time, which is invaluable for refining the concept and building a shared understanding of the desired outcome. This approach is also ideal for use cases where human-in-the-loop oversight is a requirement, such as in highly regulated industries like healthcare and finance. In these scenarios, the ability to incorporate the nuanced judgment of human experts into the optimization process is essential for ensuring the accuracy, safety, and compliance of the agent's outputs.
However, the manual approach is not without its limitations. The reliance on human reviewers makes it less scalable than the fully automated approaches described later in the cookbook. It is not well-suited for scenarios where the agent needs to be continuously retrained on a large volume of data or where rapid, iterative development is required. In these cases, the automated, API-driven approach is a more appropriate choice. The manual approach is also not ideal for production environments where the agent is expected to operate autonomously without constant human supervision. In summary, the manual prompt optimization approach is a powerful tool for the early stages of development, for rapid prototyping, and for use cases that require a high degree of human oversight. It provides an excellent foundation for understanding the principles of prompt optimization and for building a high-quality baseline agent, which can then be further refined and scaled using the more automated approaches described in the subsequent sections of the cookbook.