-
Notifications
You must be signed in to change notification settings - Fork 31
Description
This issue consolidates and refines goals from previous discussions (inspired by #106 & #107). It outlines a comprehensive Proof-of-Concept (POC) demonstrating the Evolving Agents Toolkit's (EAT) ability to manage a complete agent lifecycle. The demo will showcase agent evolution through two primary mechanisms:
- Capability-Driven Model Adaptation & Prompt/Description Enhancement: Orchestrated by
SystemAgent
to change the agent's underlying LLM (GPT-4.1 to GPT-4.1-nano) and to improve the agent's prompt and its tools' descriptions for better clarity, discovery, and LLM interaction, based on natural language requirements. - Reinforcement Fine-Tuning (RFT): Leveraging OpenAI's RFT capabilities (conceptually applied to GPT-4.1-nano) to refine the agent's policy for specific criteria, such as output format adherence or content accuracy.
The demonstration will focus on an agent named "Product Description Manager."
Motivation / Problem:
EAT aims for autonomous and data-driven agent evolution. This requires capabilities to:
- Adapt agents to new LLMs or evolving functional requirements based on high-level goals.
- Refine agent prompts and tool descriptions to optimize LLM interactions and improve discoverability.
- Continuously improve agent performance and policy based on experience and targeted feedback signals, leveraging modern fine-tuning techniques like RFT.
This unified demo will showcase a practical, end-to-end self-improvement loop within the EAT framework.
Proposed Solution & Evolution Stages:
The demo will evolve a "Product Description Manager" agent through the following stages:
Stage 1: Initial Conversational Agent (GPT-4.1 based)
- Goal: Create a foundational agent for managing product descriptions.
- Capability: Basic CRUD-like operations (e.g., Create, Retrieve, Update) for product descriptions.
- Interaction Style: Conversational ReAct (requiring step-by-step guidance via prompts to
SystemAgent
or the agent directly). - LLM: GPT-4.1.
- EAT Orchestration:
SystemAgent
receives a natural language requirement (e.g., "Create a 'Product Description Manager' agent using GPT-4.1, capable of conversational interaction, to handle product descriptions.").SystemAgent
uses its internal tools (CreateComponentTool
) to instantiate this agent and its necessary tools (e.g.,RetrieveProductTool
,UpdateProductTool
,SaveProductTool
). - Entity Store: A simple
examples/evolution_workflow/entity_store.py
will simulate product data persistence.
Stage 2: Model Optimization & Prompt/Description Enhancement (GPT-4.1-nano)
- Goal: Adapt the agent to a more efficient model (GPT-4.1-nano) and improve its self-description (meta-prompt) and its tools' descriptions for better usability, LLM guidance, and discoverability.
- EAT Orchestration:
SystemAgent
receives a new goal (e.g., "Evolve the 'Product Description Manager' to use GPT-4.1-nano. Also, refine its main description and the descriptions of its 'RetrieveProductTool' and 'UpdateProductTool' to be more precise, focusing on JSON output and key product attributes.").SystemAgent
usesEvolveComponentTool
to change the base model reference in the agent's configuration/code.SystemAgent
usesEvolveComponentTool
(leveraging its LLM) to rewrite the agent'sdescription
in itsAgentMeta
and the descriptions of its associated tools based on the refinement instructions.
- Interaction Style: Remains Conversational ReAct.
- LLM: GPT-4.1-nano.
Stage 3: Reinforcement Fine-Tuning for Policy Improvement (GPT-4.1-nano RFT)
- Goal: Improve the "Product Description Manager" agent's policy for specific criteria, such as strict adherence to a JSON output schema for product descriptions, completeness of required fields, and conciseness.
- LLM: The target for RFT is GPT-4.1-nano.
- RFT Process (Orchestrated by Demo Script using OpenAI API):
- Define a Grader:
- Implement a grader configuration (e.g., a multi-grader).
- Example RFT Goal: The RFT-enhanced agent must output product information as a JSON object strictly adhering to the schema:
{"product_id": "string", "name": "string", "price": "float", "features": ["string"], "category": "string", "short_summary": "string (max 50 words)"}
. It must always includeproduct_id
,name
, andprice
. - Example sub-graders:
- A Python grader (or
json_schema
type if directly supported by OpenAI RFT grader options) to check JSON schema validity and presence/type of required fields (product_id
,name
,price
). - A
string_check
grader to ensurecategory
is one of predefined valid categories. - A
score_model
grader (using another LLM likegpt-4o-mini
) to evaluate theshort_summary
for conciseness (e.g., < 50 words) and relevance to the input features.
- A Python grader (or
- Prepare Dataset (JSONL):
- Create
training_set.jsonl
andvalidation_set.jsonl
(small, illustrative for POC). - Each line:
messages
: User prompts (e.g., "Create a description for a product with ID 'XYZ', name 'Quantum Widget', price 99.99, features ['self-calibrating', 'eco-friendly'], category 'Electronics'.")- Reference fields for grading (e.g.,
expected_json_schema: <schema_dict>
,expected_category_values: ["Electronics", "Software"]
,max_summary_length: 50
).
- Create
- Upload Files: Upload datasets to OpenAI.
- Create Fine-Tune Job (RFT):
- Use OpenAI API to start an RFT job with the GPT-4.1-nano base model, training/validation file IDs, the grader configuration, and the target JSON schema for structured outputs.
- Monitor & Evaluate: (Briefly) Monitor the job via API. For POC, we might not wait for full completion but show job submission.
- Deploy/Use Fine-Tuned Model: Obtain the fine-tuned model ID.
- Define a Grader:
- EAT Integration:
- The fine-tuned model ID from RFT will be stored in
SmartLibrary
as a new agent version (e.g., "Product Description Manager v1.2-RFT") or as metadata for the existing GPT-4.1-nano version. SystemAgent
(or direct calls in the demo) will then use this RFT-enhanced agent, demonstrating improved output quality.
- The fine-tuned model ID from RFT will be stored in
- Interaction Style: The fine-tuned agent (GPT-4.1-nano based) will still operate conversationally (ReAct), but its policy (choices, generation quality, and adherence to the RFT goal like strict JSON output) will be demonstrably improved.
Key EAT Components Involved:
SystemAgent
: Orchestrates agent creation (Stage 1) and LLM/description evolution (Stage 2). UsesCreateComponentTool
,EvolveComponentTool
,SearchComponentTool
.SmartLibrary
: Stores different versions of the "Product Description Manager" agent, metadata about LLMs, evolution history, and the RFT-tuned model ID.EvolveComponentTool
: Used bySystemAgent
to modify agent code/config for LLM changes and to regenerate descriptions via LLM.LLMService
: Provides access to GPT-4.1, GPT-4.1-nano for agent execution and description generation.AgentBus
: (Conceptually) Logs interactions that inform the RFT dataset creation (dataset is manually curated for POC).
Acceptance Criteria:
- A new example script
examples/evolution_workflow/unified_agent_evolution_demo.py
runs to completion. - Stage 1:
SystemAgent
successfully creates a conversational "Product Description Manager" (v1.0) using GPT-4.1. Basic product management tasks succeed.SmartLibrary
reflects this. - Stage 2:
SystemAgent
successfully evolves the agent to "Product Description Manager" (v1.1) using GPT-4.1-nano. The agent's self-description and its tools' descriptions are demonstrably refined (e.g., logged old vs. new). Tasks still succeed conversationally.SmartLibrary
reflects this new version. - Stage 3 (RFT):
- The demo script defines and prints a suitable RFT grader configuration.
- The script defines and prints sample RFT training/validation data.
- The script successfully initiates an OpenAI RFT job (API call made) targeting a GPT-4.1-nano model. The job ID is logged.
- (If feasible within demo runtime/cost or using a pre-tuned ID) The fine-tuned model ID is retrieved and associated with an agent version (e.g., v1.1-RFT) in
SmartLibrary
. - The RFT-enhanced agent, when prompted for product descriptions, demonstrably adheres more closely to the RFT goal (e.g., consistent JSON output matching the schema) compared to its pre-RFT version (v1.1).
SmartLibrary
(e.g.,unified_evolution_demo_library.json
) contains the distinct agent versions with metadata reflecting their LLMs, refined descriptions, and RFT status.- Logs (e.g.,
unified_evolution_demo.log
) clearly showSystemAgent
's orchestration for Stages 1 & 2, and the demo script's actions for Stage 3 RFT. - A simple entity store (
examples/evolution_workflow/entity_store.py
) and tools (examples/evolution_workflow/tools/
) are implemented and used by the agents. - The demo runs with standard environment setup (
.env
forOPENAI_API_KEY
).INTENT_REVIEW_ENABLED
set tofalse
.
Implementation Details & Considerations:
- RFT Focus: The RFT stage will primarily focus on demonstrating the process of setting up and initiating an RFT job for policy improvement, especially output structuring.
- RFT Dataset & Grader: For the POC, the RFT dataset will be small and illustrative. The grader will be designed to be implementable with OpenAI's grader types.
- RFT Job Management: The demo will show job submission. Full job completion and extensive evaluation might be out of scope for a quick POC run, but the path to using the fine-tuned model will be shown.
- Description Evolution: The "description enhancement" in Stage 2 will be explicitly prompted to
SystemAgent
, which will then use its LLM capabilities (likely viaEvolveComponentTool
or a dedicated tool) to rewrite the provided descriptions. - Model Compatibility for RFT: The demo will aim to use GPT-4.1-nano for RFT conceptually. If the OpenAI API for RFT strictly requires
o4-mini
at the time of implementation, the script will useo4-mini
for the actualfine_tuning/jobs
API call, clearly logging this substitution.
Open Questions/Challenges:
- RFT Job Duration/Cost: Actual RFT job completion can be time-consuming and incur costs. The POC should manage expectations, possibly by simulating the retrieval of a fine-tuned model ID after job submission.
- Measuring "Improved Descriptions": For Stage 2, "demonstrably improves/refines its description" can be shown by logging the "before" and "after" descriptions generated by the LLM.
References: