APIGen-MT

APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay

Akshara Prabhakar^*
Zuxin Liu^*
Ming Zhu^†
Jianguo Zhang^†
Tulika Awalgaonkar^†
Shiyu Wang

Zhiwei Liu
Haolin Chen
Thai Hoang
Juan Carlos Niebles

Shelby Heinecke^‡
Weiran Yao^‡
Huan Wang^‡
Silvio Savarese^‡
Caiming Xiong^‡

^*Co-first Authors. &nbsp&nbsp&nbsp ^†Core Contributors. &nbsp&nbsp&nbsp ^‡Senior Authors.

Salesforce AI Research

Abstract

Training effective AI agents for multi-turn interactions requires high-quality data that captures realistic human-agent dynamics, yet such data is scarce and expensive to collect manually. We introduce APIGen-MT, a two-phase framework that generates verifiable and diverse multi-turn agent data. In the first phase, our agentic pipeline produces detailed task blueprints with groundtruth actions, leveraging a committee of LLM reviewers and iterative feedback loops. These blueprints are then transformed into complete interaction trajectories through simulated human-agent interplay. We train a family of models—the xLAM-2-fc-r—with sizes ranging from 1B to 70B parameters. Our models outperform frontier models such as GPT-4o and Claude 3.5 on τ-bench and BFCL benchmarks, with even the smaller models surpassing larger counterparts, particularly in multi-turn settings, while maintaining superior consistency across multiple trials. Comprehensive experiments demonstrate that our verified-blueprint-to-details approach yields high-quality training data, enabling the development of more reliable, efficient, and capable agents.

Comparative performance of xLAM-2-fc-r models against state-of-the-art baselines
Comparative performance of larger xLAM-2-fc-r models (8B-70B, trained with APIGen-MT data) against state-of-the-art baselines on function-calling (BFCL v3, as of date 04/02/2025) and agentic (τ-bench) capabilities.

APIGen-MT Framework

Multi-turn interactions between an AI assistant and a human user present unique challenges that go beyond single-turn exchanges. We formalize this interaction as a Partially Observable Markov Decision Process (POMDP) where an assistant must engage in a multi-turn conversation to understand the user's intent and solve it through appropriate interactions with the environment while adhering to policies.

Generating high-quality multi-turn data that captures the complexities of agent-human interactions presents significant challenges. Directly synthesizing multi-turn conversations in one shot is difficult for two key reasons: (1) a single error or hallucination in any intermediate step can lead to complete failure, and (2) the content of each turn depends on previous function calls and their outputs, creating complex dependencies that are difficult to maintain consistently.

Overview of the APIGen-MT framework

To address these challenges, we introduce APIGen-MT, a two-phase framework for generating verifiable and diverse multi-turn data. Our approach extends the APIGen framework by adding an agentic feedback loop and simulated human-agent interplay to generate realistic multi-turn conversations.

The core insight of our approach is to separate the task generation process into two distinct phases: first creating a detailed "blueprint" of the task (Phase 1), and then using this blueprint to guide the generation of realistic multi-turn interactions that fill in the conversational details (Phase 2). This separation allows us to ensure both the correctness of the underlying task structure and the naturalness of the resulting conversations.

Phase 1: Task Configuration and Groundtruth Generation

The initial phase of APIGen-MT focuses on systematically generating well-defined task configurations, each comprising a user instruction (q), a corresponding sequence of verifiable groundtruth actions (a_gt), and the expected final outputs (o_gt). This phase establishes a solid, verifiable foundation for each interaction scenario before the complexities of conversational dynamics are introduced. This is achieved through an agentic workflow incorporating multi-stage validation and refinement loops:

Context Preparation: Relevant information such as available APIs, domain-specific rules or policies, and reference data is assembled.
LLM-based Data Generator: An LLM utilizes the prepared context to propose initial task configurations.
Format & Execution Checker: Proposed configurations undergo automated technical validation.
Review Committee: Configurations passing rule-based checks proceed to semantic evaluation by a committee of multiple LLM reviewers.
Feedback Generation and Refinement: If a task fails at either the validation or review stage, a Feedback Generator aggregates failure reasons and reviews, reflects upon them, and produces a summarized improvement plan.

This agentic design with feedback loops is crucial for generating high-quality tasks efficiently. By incorporating reflection and improvement based on validation results, the system can learn from failures and progressively generate better tasks.

Phase 2: Human-Agent-Environment Interaction Trajectory Collection

Building upon the validated task configurations from Phase 1, the second phase generates realistic multi-turn interaction data by simulating dynamic conversations between an LLM-based human user (H) and a test agent (A) operating within an executable environment. Guided by the task instruction q and often a specific persona, the simulated human naturally reveals information or sub-goals incrementally, while the agent interprets the evolving context, interacts with the environment via API calls when needed, and responds coherently.

The simulation produces complete interaction trajectories that capture dialogue turns, agent actions, and environment responses. Each trajectory is validated by comparing its outcome against the groundtruth actions (a_gt) and expected outputs (o_gt) from Phase 1. Only those trajectories that verifiably achieve the task—using both state-based and output-based checks—are accepted into the dataset, ensuring that interactions are both dynamically plausible and grounded in a correct solution.

Realization of APIGen-MT framework for τ-bench

This two-phase design offers several benefits. First, it provides verifiability by grounding interaction data in pre-validated task configurations. Second, it enhances realism by focusing the simulation on natural turn-by-turn dynamics without the simultaneous burden of task solution generation. Lastly, the modular approach isolates issues in task design from those in conversational modeling, facilitating debugging and scalability across diverse interaction patterns.

A Case Study on τ-bench

We implemented the APIGen-MT framework using τ-bench as a testbed. For task generation and validation, we model the available APIs in each τ-bench domain as a directed graph, where nodes represent APIs and edges represent dependencies between them. We utilize specialized context samplers including API Sampler, Policy Sampler, Domain Data Sampler, Persona Sampler, and Example Sampler to ensure task diversity, realism, and grounding.

We implement a rigorous three-stage validation process:

Stage 1: Action Validation - Format Check, Execution Check, and Policy Compliance Check
Stage 2: Alignment Validation - Evaluating whether the groundtruth actions accurately fulfill the user's intent
Stage 3: Final Semantic Review & Refinement - Based on aggregated scores from the committee

We also introduce Reverse Task Recombination, a technique that leverages the principle of compositionality to build complex tasks from simpler, independently validated "building blocks." For Phase 2, we simulate multi-turn interaction trajectories between an agent and a human user modeled by an LLM. We employ rejection sampling to ensure that only trajectories achieving the task goal are retained.

Data Collection & Statistics

We source APIs implemented as Python functions from τ-bench. Among these, we have 15 'read' and 13 'write' APIs across both domains. We utilize GPT-4o and DeepSeek V3 and R1 models in the task generation, validation, and agent-human interplay stages to collect training data.

Density distribution of assistant and user turns in collected trajectories.

Statistics for the dataset generated using APIGen-MT. Success rates (S.R.) are reported for the task configuration (w. and w/o agentic feedback in Phase 1) and trajectory simulation (Phase 2) stages.

We collected a total of 3,820 validated successful trajectories. Our data shows that we can efficiently collect long trajectories requiring a strong model like GPT-4o to take an average of 12 turns to complete the task using APIGen-MT. Our agentic pipeline involving review committee and iterative refinement via reflection provides a 2.5x boost to the task collection success rate to attain 70%.

Experimental Results

We evaluate the effectiveness of our APIGen-MT approach for generating multi-turn data through simulated agent-human interplay. We perform filtered Behavioral Cloning (BC) using the collected trajectories with Llama 3.1/3.2 Instruct models and Qwen 2.5 Instruct models.

We evaluate performance on two challenging benchmarks designed specifically for assessing agent capabilities:

BFCL v3 is a leading benchmark for tool-use evaluation, specifically designed to assess LLMs' function calling capabilities. It introduces comprehensive evaluation across single-turn, multi-turn, and multi-step function calling scenarios.
τ-bench is a comprehensive benchmark for evaluating AI agents in realistic scenarios. It measures an agent's ability to interact with simulated human users and programmatic APIs while following domain-specific policies.

Function-Calling Capabilities (BFCL v3)

On the BFCL v3 benchmark, our models demonstrate impressive results. xLAM-2-70b-fc-r achieves the top position on the leaderboard with an overall accuracy of 78.19%, surpassing all proprietary and open-source models. The strength of our approach is particularly evident in multi-turn scenarios, where xLAM-2-70b-fc-r achieves 75.12% accuracy, significantly outperforming models like GPT-4o (47.62%) and Claude models.

Performance comparison on Berkeley Function-Calling Leaderboard

Performance comparison of different models on BFCL leaderboard. The rank is based on the overall accuracy, which is a weighted average of different evaluation categories. "FC" stands for function-calling mode in contrast to using a customized "prompt" to extract the function calls.

Our smaller models also show remarkable performance on BFCL v3. xLAM-2-32b-fc-r ranks second with 75.83% overall accuracy, and even our 8B parameter model (xLAM-2-8b-fc-r) achieves 72.83% accuracy, outperforming GPT-4o (72.08%). This demonstrates that models trained on our synthetic data can achieve state-of-the-art performance with significantly fewer parameters than proprietary alternatives.

The performance gap is particularly pronounced in multi-turn scenarios, where our models consistently outperform baselines. For instance, xLAM-2-8b-fc-r achieves 69.25% accuracy on multi-turn tasks, compared to 47.62% for GPT-4o and 41% for GPT-4o in function-calling mode. This highlights the effectiveness of our APIGen-MT approach in generating high-quality multi-turn training data that captures the complexities of real-world agent-human interactions.

Multi-Turn Agent Capabilities (τ-bench)

Our xLAM-2-70b-fc-r model achieves an overall success rate of 56.2% on τ-bench, significantly outperforming the base Llama 3.1 70B Instruct model (38.2%) and other open-source models like DeepSeek v3 (40.6%). Notably, our model even outperforms proprietary models such as GPT-4o (52.9%) and approaches the performance of more recent models like Claude 3.5 Sonnet (new) (60.1%).

Success Rate (pass@1) on τ-bench benchmark (averaged across at least 5 trials)
Model	τ-Retail	τ-Airline	Overall
Open-Source Models
xLAM-2-70b-fc-r	67.1	45.2	56.2
xLAM-2-32b-fc-r	64.3	45.0	54.6
xLAM-2-8b-fc-r	58.2	35.2	46.7
xLAM-2-3b-fc-r	44.4	32.0	38.2
xLAM-2-1b-fc-r	22.5	21.0	21.8
Qwen 2.5 32B Instruct	24.4	25.0	24.7
Llama 3.1 70B Instruct	50.4	26.0	38.2
DeepSeek v3	58.3	22.8	40.6
Proprietary Models
Gemini 1.5 pro	54.9	25.2	40.1
gpt-4o-2024-11-20	62.8	43.0	52.9
o1	73.5	54.2	63.9
Claude 3.5 Haiku	51.0	22.8	36.9
Claude 3.5 Sonnet	62.6	36.0	49.3
Claude 3.5 Sonnet (new)	71.5	48.8	60.1
Claude 3.7 Sonnet	78.3	41.2	59.8
Claude 3.7 Sonnet + optimized prompt	81.2	58.4	69.8

A particularly striking result is that our smaller models, such as xLAM-2-32b-fc-r and xLAM-2-8b-fc-r, achieve impressive performance (54.6% and 46.7% respectively), outperforming much larger baseline models. This suggests that our synthetic data generation approach enables efficient knowledge transfer, allowing smaller models to achieve competitive performance with significantly fewer parameters.

Model Consistency & Stability

To evaluate the consistency and reliability of our models, we examine their performance across multiple trials. We plot the pass^k curves on τ-bench, which measure the probability that all k independent trials succeed for a given task, averaged across all tasks.

Pass^k curves for τ-retail and τ-airline domains
Pass^k curves measuring the probability that all 5 independent trials succeed for a given task, averaged across all tasks for τ-retail (left) and τ-airline (right) domains. Higher values indicate better consistency of the models.

As k increases, we see less drop in success rate for our models compared to baselines. Notably, on the more complex airline domain, xLAM-2-70b-fc-r has a higher pass^5 score than Claude, despite having a slightly lower pass^1, suggesting higher reliability and consistency across multiple trials. This is a critical property for deployment in real-world applications, where consistent performance is essential.

These results demonstrate that our APIGen-MT approach for generating synthetic multi-turn data through simulated agent-human interplay is highly effective. Models trained on this data consistently outperform both proprietary and open-source baselines, with particularly strong performance in multi-turn scenarios. Importantly, our approach enables smaller models to achieve competitive or superior performance compared to much larger models, highlighting the efficiency and effectiveness of our data generation methodology.

Citation

If you use our model or dataset in your work, please cite our paper:

@article{prabhakar2025apigenmt,
  title={APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay},
  author={Prabhakar, Akshara and Liu, Zuxin and Zhu, Ming and Zhang, Jianguo and Awalgaonkar, Tulika and Wang, Shiyu and Liu, Zhiwei and Chen, Haolin and Hoang, Thai and Niebles, Juan Carlos and Heinecke, Shelby and Yao, Weiran and Wang, Huan and Savarese, Silvio and Xiong, Caiming},
  journal={arXiv preprint arXiv:2504.03601},
  year={2025}
}