这是indexloc提供的服务,不要输入任何密码

DynaSearcher: Dynamic Knowledge Graph Augmented Search Agent
via Multi-Reward Reinforcement Learning

Chuzhan Hao Wenfeng Feng Yuewei Zhang Hao Wang
Alibaba Cloud Computing
{haochuzhan.hcz,wenfeng.fwf,liyou.zyw}@alibaba-inc.com
cashenry@126.com
  Corresponding author
Abstract

Multi-step agentic retrieval systems based on large language models (LLMs) have demonstrated remarkable performance in complex information search tasks. However, in practical applications, these systems are limited by factually inconsistent intermediate queries and inefficient search trajectories, which can cause reasoning deviations and redundant computations. To address these issues, we propose DynaSearcher, an innovative search agent enhanced by dynamic knowledge graphs and multi-reward reinforcement learning (RL). Specifically, our system leverages knowledge graphs as external structured knowledge to guide the search process by explicitly modeling entity relationships, thereby ensuring factual consistency in intermediate queries and mitigating biases from irrelevant information. Furthermore, we employ a multi-reward RL framework for fine-grained control over training objectives such as retrieval accuracy, efficiency, and response quality. This framework promotes the generation of high-quality intermediate queries and comprehensive final answers, while discouraging unnecessary exploration and minimizing information omissions or redundancy. Experimental results demonstrate that our approach achieves state-of-the-art answer accuracy on six multi-hop question answering datasets and exhibits strong generalization and robustness across diverse retrieval environments and larger-scale models, highlighting its broad applicability.

DynaSearcher: Dynamic Knowledge Graph Augmented Search Agent
via Multi-Reward Reinforcement Learning


Chuzhan Hao Wenfeng Feng Yuewei Zhang Hao Wangthanks:   Corresponding author Alibaba Cloud Computing {haochuzhan.hcz,wenfeng.fwf,liyou.zyw}@alibaba-inc.com cashenry@126.com


1 Introduction

Large language models have demonstrated remarkable capabilities in task planning and agentic reasoning, with reinforcement learning significantly improving their performance on complex reasoning tasks Shao et al. (2024); Guo et al. (2025). However, reliance on static parametric knowledge presents notable limitations, often leading to hallucinations and inefficient reasoning. To tackle these challenges, it is crucial to explore how to efficiently access diverse external information to support LLMs in achieving deliberate and well-substantiated reasoning. Therefore, a novel search paradigm termed Agentic Deep Research Systems has gradually become an important research task Li et al. (2025a); Jin et al. (2025); Chen et al. (2025); Zhang et al. (2025b).

Previous research has utilized Chain-of-Thought (CoT) Wei et al. (2022) prompting to decompose complex problems into sequential sub-tasks, subsequently leveraging external information dynamically to bridge knowledge gaps and tackle intricate reasoning tasks Trivedi et al. (2022a); Shao et al. (2023); Yue et al. (2024); Feng et al. (2025b). However, these approaches remain highly sensitive to various prompt formulations. Feng et al. (2025b) employs the Monte Carlo Tree Search (MCTS) to design more sophisticated reasoning frameworks, achieving deep exploration of the solution space. This process incurs significant inference overhead, which reduces the likelihood of its widespread practical application. Meanwhile, these prompt-based methods incorporate singular strategic planning and iterative refinement capabilities throughout the pipeline, failing to fully exploit the agentic potential of LLMs. Furthermore, Li et al. (2025a) integrates an agentic search workflow into the reasoning process, enabling dynamic retrieval when dealing with uncertain or incomplete information.

Recently, reinforcement learning has achieved remarkable success in mathematical reasoning and decision-making scenarios Guo et al. (2025). Jin et al. (2025); Song et al. (2025) also utilize RL to significantly enhance the capability of small language models (SLMs) to address complex multi-hop reasoning tasks. These training-based approaches directly equip LLMs with the ability to autonomously use external retrieval tools during the end-to-end training process, thereby achieving dynamic interaction with the external environment throughout training Zheng et al. (2025); Chen et al. (2025). Owing to their superior agentic abilities, strong generalization, and efficient reasoning processes, RL-based agentic search approaches are increasingly emerging as a significant trend in deep research Zhang et al. (2025b). However, current RL-based search agents rely on a single search tool and coarse global rewards, lacking effective guidance for intermediate query generation and struggling to explore efficient reasoning trajectories. The presence of substantial noise in unstructured text information can easily lead to deviations from the correct reasoning path or introduce redundant computations. In addition, both coarse-grained global rewards and existing step-wise rewards are typically simple aggregations of distinct reward signals, failing to fully leverage the interdependencies among different rewards to enhance explicit guidance on query trajectories and intermediate query generation.

To address these challenges, we propose DynaSearcher, a dynamic knowledge graph augmented multi-reward reinforcement learning framework tailored for search agents. Specifically, we employ knowledge graphs (KGs) as structured external knowledge, explicitly modeling entity relationships during multi-step reasoning to guide the search process towards factually aligned intermediate queries and reduce deviation caused by noise or irrelevant information. Furthermore, we design a multi-reward mechanism that balances retrieval accuracy, efficiency, and the final response quality during the RL training process. This mechanism not only encourages high-quality intermediate query generation while penalizing excessive search steps to prevent unnecessary exploration or premature termination, but also incentivizes LLMs to produce comprehensive and accurate final responses, avoiding information omission or redundancy. Experimental results demonstrate that our method achieves state-of-the-art performance on various complex multi-hop question answering (QA) datasets, while maintaining comparable effectiveness under low-resource context length settings. We further validate the generalization and robustness of our approach across different retrieval environments and with larger-scale models, indicating its broad applicability. In summary, our main contributions are as follows:

  • We introduce knowledge graphs as an external structured knowledge source, guiding the search process through explicit modeling of entity relationships. This ensures the consistency of intermediate queries with factual information while mitigating deviations induced by noise or irrelevant information, thereby significantly enhancing the performance of the search agent.

  • We propose multi-reward reinforcement learning that integrates both gain and penalty rewards into the conventional outcome-based reward framework. By incorporating fine-grained feedback, the approach provides nuanced guidance, fostering robustness and reliability in handling complex tasks.

  • Extensive experiments demonstrate that DynaSearcher significantly outperforms existing search agents based on reinforcement learning. Our approach exhibits broad applicability across diverse retrieval environments and larger-scale models, while maintaining its performance even in low-resource settings.

2 Related Work

2.1 Retrieval-Augmented Generation

Early retrieval-augmented generation (RAG) approaches employ various strategies such as branching, iteration, and adaptive retrieval to solve complex tasks. These methods rely on manually crafted workflows to guide LLMs in interacting with external knowledge sources. IRCoT Trivedi et al. (2022a) leverages CoT to steer the retrieval process, refining CoT with the retrieved information. Press et al. (2022a); Asai et al. (2023); Yue et al. (2024) refine intermediate queries to acquire valuable knowledge through multi-turn iterations. AirRAG Feng et al. (2025b) applies MCTS to dynamically explore the reasoning paths. However, these approaches are limited to manually designed prompts and workflows, failing to fully unleash the inherent reasoning potential of LLMs.

2.2 Autonomous Search Agents

As the reasoning and decision-making capabilities of the foundation models continue to improve, Search-o1 Li et al. (2025a) significantly improves model performance in complex scenarios by designing an agentic search workflow, providing superior flexibility and generalization. DeepSeek-R1 Guo et al. (2025) also demonstrates that outcome-based RL can significantly enhance the autonomous reasoning and decision-making capabilities of models. Therefore, RL has been applied to various complex reasoning tasks and agent-based scenarios. Complex multi-hop question answering represents a typical integrated application scenario that heavily relies on model-driven planning and reasoning. Song et al. (2025); Chen et al. (2025); Jin et al. (2025) have successfully applied end-to-end reinforcement learning to complex agentic search scenarios, further advancing the development of agentic deep research systems. These methods autonomously select retrieval tools during the reasoning process to interact with external environments. DeepResearcher Zheng et al. (2025) scales RL in real-world environments by incorporating authentic web search interactions. s3 Jiang et al. (2025) decouples the searcher from the generator and trains the searcher with fewer samples. EvolveSearch Zhang et al. (2025a) further explores the self-evolution process of search agents. StepSearch Wang et al. (2025) introduces fine-grained reward signals to steer strategic query planning and improve retrieval quality in complex search environments.

In contrast to such methods, our proposed DynaSearcher incorporates structured knowledge graph information to dynamically model entity relationships during multi-step reasoning, effectively reducing reasoning path deviations and redundant computations caused by irrelevant information. Furthermore, we design the multi-reward reinforcement learning by leveraging the dependencies among rewards at different granularities, introducing gain and penalty rewards to provide precise guidance beyond the outcome-based reward. In the experiment, we thoroughly validate the performance gains and efficient reasoning paths achieved through the integration of structured knowledge graph and the multi-reward reinforcement learning.

Refer to caption
Figure 1: The framework of our proposed DynaSearcher.

3 Methodology

In this section, we introduce DynaSearcher, an advanced and efficient search agent that leverages dynamic knowledge graphs and multi-reward reinforcement learning to guide the generation of intermediate queries during multi-step reasoning. It effectively mitigates reasoning deviations induced by irrelevant information, enabling efficient retrieval processes and precise answer generation. The overview of DynaSearcher is illustrated in Figure 1.

3.1 Rollout with Search Tools

Inspired by Search-o1 Li et al. (2025a), current advanced RAG methods introduce an agentic search strategy, transforming the exploration process into an iterative interaction between the intrinsic reasoning of LLMs and the external environment, thus effectively activating their autonomous reasoning capabilities. During interactions with the external environment, these methods often rely on unstructured text retrieval systems to supplement information for intermediate reasoning steps. Irrelevant textual noise can easily result in inefficient intermediate queries. Therefore, we dynamically retrieve relevant knowledge graphs 𝒢𝒢\mathcal{G}caligraphic_G during the reasoning process to guide the generation of intermediate queries.

Doc Search Tool. We employ two types of document search tools: a locally deployed vector-based retrieval service and the web search provided by Tavily111https://tavily.com. Both tools return text-based retrieval results. In the experiments, we fix a top-k value (e.g., 5) to control the number of retrieved documents.

KG Search Tool. We utilize Wikidata5M Wang et al. (2021) as the source for our structured knowledge graph, which provides a more precise representation of semantic relationships. During intermediate reasoning steps, we retrieve relevant single-hop knowledge subgraphs based on the parsed entities, supporting subsequent reasoning processes.

As illustrated in Figure 1, our approach integrates both unstructured document and structured KG retrieval tools, enabling queries across different knowledge sources. In each reasoning trajectory, DynaSearcher begins with reflective analysis and strategic planning based on the input question, then generates a JSON-formatted request specifying subqueries as well as extracted relevant entities and relationships. It then invokes both search tools to interact with the external environment and gather information. Specifically, we adopt an iterative reasoning-retrieval loop, similar to Chen et al. (2025) and detailed in Table 1, where reasoning and retrieval alternate. By collaboratively integrating these tools, DynaSearcher is able to explore more efficient and effective reasoning paths.

You are a helpful assistant that can solve the given question step by step with the help of the wikipedia search tool.  Given a question, you need to first think about the reasoning process in the mind and then provide the answer.  During thinking, you can invoke the wikipedia search tool to search for fact information about specific topics if needed.  The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags respectively,  and the search input and result are enclosed within <search> </search> and <result> </result> tags respectively.  Search input is json format like {“query”: “xxx”, “entity”: [“yyy”], “relation”: [“zzz”]} and applied to the search tools,  where query is used to search wikipedia articles, entity(s) and relation(s) are used to search wikidata, a knowledge base of entities and relations. For example, <think> This is the reasoning process. </think> <search> {“query”: “Who is the director of Avatar”, “entity”: [“Avatar”], “relation”: [“director”]} </search> <result> search result here </result>  <think> This is the reasoning process. </think> <answer> The final answer is \boxed{ answer here }</answer>.  In the last part of the answer, the final exact answer is enclosed within \boxed{} with latex format.
Table 1: System prompt for generating reasoning trajectories through interaction with the environments during training and inference stages.

[b] Methods HotpotQA 2Wiki Musique Bamboogle Average F1 CEM EM F1 CEM EM F1 CEM EM F1 CEM EM F1 CEM Prompt Based Qwen2.5-7B Vanilla RAG 29.0 22.4 20.5 32.5 27.9 27.0 11.2 5.1 3.4 17.6 12.8 10.4 22.6 17.1 Iter-RetGen 51.4 45.2 39.9 39.2 35.5 32.2 17.4 12.4 10.0 31.8 24.8 22.4 34.9 29.5 IRCoT 47.2 47.3 35.3 35.0 39.2 25.5 14.7 13.3 7.5 32.3 28.8 23.2 32.3 32.2 Search-o1superscriptSearch-o1\text{Search-o1}^{*}Search-o1 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 36.9 32.3 27.4 41.2 41.3 33.7 16.8 13.0 10.5 40.0 34.4 30.4 33.7 30.3     + Qwen2.5-32B 56.9 51.8 44.1 64.6 68.1 55.9 28.6 25.7 19.7 64.1 56.8 51.2 53.6 50.6 Frontier LLMs DeepSeek-R1 62.5 54.0 48.0 65.7 65.0 54.0 39.9 33.0 27.5 63.0 52.8 52.0 57.8 51.2 Qwen3-235B-A22B 57.3 56.1 44.5 59.4 64.1 45.3 41.7 39.5 27.6 55.3 49.2 43.8 53.5 52.2 GPT-4.1 60.6 56.0 45.0 69.7 75.5 56.0 44.9 47.0 28.5 63.8 55.2 49.6 59.7 58.4 O4-mini 57.8 59.5 40.5 62.1 71.0 47.5 41.6 45.5 27.5 61.7 64.0 46.4 55.8 60.0 Gemini-2.5-Pro 55.6 60.5 39.5 71.8 83.0 60.5 37.0 47.0 24.5 59.7 69.6 52.0 56.0 65.0 Training Based Qwen2.5-7B Search-R1-v0.3 61.8 53.6 49.8 60.7 58.7 52.3 30.9 24.7 21.5 59.4 48.0 47.2 53.2 46.3 ReSearch 63.2 55.8 50.4 67.1 65.4 60.3 28.0 34.1 24.0 53.1 45.6 41.6 54.4 48.7 R1-Searcher 57.8 59.7 45.6 64.0 67.8 56.2 28.4 27.9 19.5 49.8 46.4 36.0 50.0 50.5 DynaSearcher 66.1 62.8 52.0 72.0 76.3 61.9 38.7 38.6 26.5 57.9 51.2 41.6 58.7 57.2 Qwen2.5-32B Search-R1-v0.3 66.5 55.8 53.5 73.4 71.7 68.1 36.2 30.6 28.5 65.1 55.2 54.4 60.3 53.5 ReSearch 69.4 61.0 56.3 78.1 76.7 72.3 39.3 33.8 30.5 63.1 52.0 50.4 62.5 55.9 DynaSearcher 71.3 63.1 57.4 81.9 82.7 75.1 47.4 41.6 35.2 70.3 58.6 55.2 67.7 61.5

Table 2: Overall evaluation results on the dev or test sets of four benchmarks. The best and second best results are bold and underlined, respectively. All methods are evaluated in the same local retrieval environment. * indicates the results reproduced by us. /{}^{\dagger}/\ddaggerstart_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT / ‡ represents in-domain/out-of-domain datasets. + denotes replacement with a new base model.

3.2 Multi-Reward Reinforcement Learning

Agentic RL with tool integration is an effective training method that endows LLMs with stronger autonomous reasoning capabilities, and has been extensively validated in mathematical and coding scenarios Feng et al. (2025a); Li et al. (2025b). To further enhance the performance of LLMs in open-domain multi-hop question answering tasks, we conduct experiments based on the GRPO RL algorithm Shao et al. (2024), with the complete rollout process described in Table 1 and Figure 1. To enable the LLM to autonomously interact with the external environments, we introduce three special tokens to standardize the LLM’s outputs, i.e., <think>, <search>, and <answer>. During inference, when the LLM generates special tokens such as <search>, it triggers the corresponding search actions to retrieve external knowledge. Through multiple iterations of the think\rightarrowsearch\rightarrowresult process, the LLM ultimately generates a precise and comprehensive answer within the specific tokens box <answer> and </answer> once it determines that the question has been fully addressed.

During GRPO training, the design of the reward function is a critical component. In Chen et al. (2025); Jin et al. (2025); Song et al. (2025), an outcome-based reward is employed to guide the LLM on learning the reasoning and the capability to call search engine interleaved throughout the RL process. The coarse global reward lacks effective guidance for intermediate query generation and struggles to explore efficient reasoning trajectories. Therefore, we design a multi-reward RL framework to encourage high-quality intermediate query generation, dynamically adapt to different retrieval strategies, and efficiently acquire external knowledge, yielding more stable convergence and higher accuracy in complex reasoning tasks. The reward mechanism mainly includes accuracy, information gain, and penalty rewards.

3.2.1 Accuracy Reward

The accuracy reward includes both format correctness and answer correctness. Following Shao et al. (2024), we employ an explicit format reward to ensure that the LLM adopts a predefined iterative workflow of “think\rightarrowsearch”, thereby guaranteeing that the entire agent executes correctly. The required output format is defined in the system prompt, as shown in Table 1. In addition, for the answer correctness, typical mathematical scenarios use the concrete numerical output as the evaluation criterion. However, this method often fails to provide an accurate and comprehensive evaluation for open-domain QA tasks. We introduce a comprehensive method for evaluating answer accuracy ranssubscript𝑟ansr_{\text{ans}}italic_r start_POSTSUBSCRIPT ans end_POSTSUBSCRIPT. The complete accuracy reward raccsubscript𝑟accr_{\text{acc}}italic_r start_POSTSUBSCRIPT acc end_POSTSUBSCRIPT is defined as:

raccsubscript𝑟acc\displaystyle r_{\text{acc}}italic_r start_POSTSUBSCRIPT acc end_POSTSUBSCRIPT ={max(0.1,rans),if formate is correct,0,if formate is incorrect.absentcases0.1subscript𝑟ansif formate is correct,0if formate is incorrect.\displaystyle=\begin{cases}\max(0.1,\,r_{\text{ans}}),&\text{if formate is % correct,}\\ 0,&\text{if formate is incorrect.}\end{cases}= { start_ROW start_CELL roman_max ( 0.1 , italic_r start_POSTSUBSCRIPT ans end_POSTSUBSCRIPT ) , end_CELL start_CELL if formate is correct, end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL if formate is incorrect. end_CELL end_ROW (1)
ranssubscript𝑟ans\displaystyle r_{\text{ans}}italic_r start_POSTSUBSCRIPT ans end_POSTSUBSCRIPT ={F1(apred,agt),if LprednLgt,CEM(apred,agt),if Lpred<nLgt.absentcasessubscriptF1subscript𝑎predsubscript𝑎gtif subscript𝐿pred𝑛subscript𝐿gtCEMsubscript𝑎predsubscript𝑎gtif subscript𝐿pred𝑛subscript𝐿gt\displaystyle=\begin{cases}\text{F}_{1}(a_{\text{pred}},\,a_{\text{gt}}),&% \text{if }L_{\text{pred}}\geq n\cdot L_{\text{gt}},\\ \text{CEM}(a_{\text{pred}},\,a_{\text{gt}}),&\text{if }L_{\text{pred}}<n\cdot L% _{\text{gt}}.\end{cases}= { start_ROW start_CELL F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) , end_CELL start_CELL if italic_L start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT ≥ italic_n ⋅ italic_L start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL CEM ( italic_a start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) , end_CELL start_CELL if italic_L start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT < italic_n ⋅ italic_L start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT . end_CELL end_ROW (2)

where F1 and CEM are the word-level F1 score and the cover exact match score between the output answer apredsubscript𝑎preda_{\text{pred}}italic_a start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT and the ground truth agtsubscript𝑎gta_{\text{gt}}italic_a start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT. Lpredsubscript𝐿predL_{\text{pred}}italic_L start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT and Lgtsubscript𝐿gtL_{\text{gt}}italic_L start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT denote the number of words in apredsubscript𝑎preda_{\text{pred}}italic_a start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT and agtsubscript𝑎gta_{\text{gt}}italic_a start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT. n𝑛nitalic_n represents the multiple of the text length, which we set to 3 by default. Under the reward mechanism ranssubscript𝑟ansr_{\text{ans}}italic_r start_POSTSUBSCRIPT ans end_POSTSUBSCRIPT, our approach strives to generate responses that are both comprehensive and accurate.

3.2.2 Gain and Penalty Reward

Based on the aforementioned accuracy reward, we further introduce an information gain reward rgainsubscript𝑟gainr_{\text{gain}}italic_r start_POSTSUBSCRIPT gain end_POSTSUBSCRIPT to encourage the generation of high-quality intermediate queries, thus enhancing the performance of document retrieval rrecallsubscript𝑟recallr_{\text{recall}}italic_r start_POSTSUBSCRIPT recall end_POSTSUBSCRIPT. Let TP represent the number of relevant documents retrieved, FN for the missed relevant documents, then rrecallsubscript𝑟recallr_{\text{recall}}italic_r start_POSTSUBSCRIPT recall end_POSTSUBSCRIPT can be defined as:

rrecall=TPTP+FNsubscript𝑟recallTPTPFNr_{\text{recall}}=\frac{\text{TP}}{\text{TP}+\text{FN}}italic_r start_POSTSUBSCRIPT recall end_POSTSUBSCRIPT = divide start_ARG TP end_ARG start_ARG TP + FN end_ARG (3)

where we use the supporting document chunks provided by the dataset as ground truth and then evaluate the relevance of retrieved document chunks based on title matching or embedding similarity scores.

To prevent the LLM from engaging in reward hacking by solely pursuing the rrecallsubscript𝑟recallr_{\text{recall}}italic_r start_POSTSUBSCRIPT recall end_POSTSUBSCRIPT, a retrieval penalty reward rpenaltysubscript𝑟penaltyr_{\text{penalty}}italic_r start_POSTSUBSCRIPT penalty end_POSTSUBSCRIPT is incorporated into the information gain reward rgainsubscript𝑟gainr_{\text{gain}}italic_r start_POSTSUBSCRIPT gain end_POSTSUBSCRIPT. The rgainsubscript𝑟gainr_{\text{gain}}italic_r start_POSTSUBSCRIPT gain end_POSTSUBSCRIPT and rpenaltysubscript𝑟penaltyr_{\text{penalty}}italic_r start_POSTSUBSCRIPT penalty end_POSTSUBSCRIPT can be calculated as:

rgainsubscript𝑟gain\displaystyle r_{\text{gain}}italic_r start_POSTSUBSCRIPT gain end_POSTSUBSCRIPT =α(rrecallrpenalty)absent𝛼subscript𝑟recallsubscript𝑟penalty\displaystyle=\alpha\cdot(r_{\text{recall}}-r_{\text{penalty}})= italic_α ⋅ ( italic_r start_POSTSUBSCRIPT recall end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT penalty end_POSTSUBSCRIPT ) (4)
rpenaltysubscript𝑟penalty\displaystyle r_{\text{penalty}}italic_r start_POSTSUBSCRIPT penalty end_POSTSUBSCRIPT =max(β,1γti)absent𝛽1superscript𝛾𝑡𝑖\displaystyle=\max(\beta,1-\gamma^{t-i})= roman_max ( italic_β , 1 - italic_γ start_POSTSUPERSCRIPT italic_t - italic_i end_POSTSUPERSCRIPT ) (5)

where α𝛼\alphaitalic_α is the scale factor used to balance the accuracy reward and the gain reward, γ𝛾\gammaitalic_γ is the decay factor for the penalty reward, and β𝛽\betaitalic_β denotes the lower bound of the penalty reward. Here, t𝑡titalic_t denotes the current number of retrieval actions executed by the agent at a rollout, and i𝑖iitalic_i represents the ground-truth number of hops or sub-questions for this sample, as annotated in the dataset. When t𝑡titalic_t exceeds i𝑖iitalic_i, indicating redundant retrievals beyond what is necessary, a penalty is applied to rrecallsubscript𝑟recallr_{\text{recall}}italic_r start_POSTSUBSCRIPT recall end_POSTSUBSCRIPT; conversely, if t𝑡titalic_t is less than i𝑖iitalic_i, a slight reward is granted. This design encourages the LLM to make efficient use of retrieval actions for information acquisition, avoiding excessive and redundant searches. The overall reward can be calculated as:

roverall=routcome+rgainsubscript𝑟overallsubscript𝑟outcomesubscript𝑟gainr_{\text{overall}}=r_{\text{outcome}}+r_{\text{gain}}italic_r start_POSTSUBSCRIPT overall end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT outcome end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT gain end_POSTSUBSCRIPT (6)

Combining the overall reward with the training objective of GRPO, we propose a reinforcement learning objective that explicitly incorporates a search engine \mathcal{R}caligraphic_R during optimization for LLM search agent training Jin et al. (2025); Zheng et al. (2025). The objective is formalized as:

maxπθ𝔼x𝒟,yπθ(|x;)[rϕ(x,y)]β𝔻KL[πθ(y|x;)|πref(y|x;)]\max_{\pi_{\theta}}\mathbb{E}_{x\sim\mathcal{D},\ y\sim\pi_{\theta}({\cdot|x;% \mathcal{R}})}\left[r_{\phi}(x,y)\right]\\ -\beta\mathbb{D}_{\text{KL}}[\pi_{\theta}(y|x;\mathcal{R})|\pi_{\text{ref}}(y|% x;\mathcal{R})]start_ROW start_CELL roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D , italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ; caligraphic_R ) end_POSTSUBSCRIPT [ italic_r start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_y ) ] end_CELL end_ROW start_ROW start_CELL - italic_β blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ; caligraphic_R ) | italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ; caligraphic_R ) ] end_CELL end_ROW (7)

where πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT denotes the trainable policy model, πrefsubscript𝜋𝑟𝑒𝑓\pi_{ref}italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT is a fixed reference model, rϕsubscript𝑟italic-ϕr_{\phi}italic_r start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT represents the overall reward function, and 𝔻KLsubscript𝔻KL\mathbb{D}_{\text{KL}}blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT denotes the KL divergence. Here, x𝑥xitalic_x are sampled from the dataset 𝒟𝒟\mathcal{D}caligraphic_D, and y𝑦yitalic_y denote the output sequence interleaving reasoning steps with search engine retrievals. Since the retrieved documents are not generated by the policy model, we mask the retrieval results during loss calculation to prevent the training policy from being biased.

4 Experiments

4.1 Experimental Settings

Datasets and Evaluation Metrics. We conduct extensive experiments on six multi-hop datasets, including HotpotQA Yang et al. (2018), 2WikiMultiHopQA (2Wiki) Ho et al. (2020), Musique Trivedi et al. (2022b), Bamboogle (Bam) Press et al. (2022b), MoreHopQA (MoreHQA) Schnitzler et al. (2024), and Frames Krishna et al. (2024). The first three datasets are in-domain datasets, with portions of their training sets used for training, while the latter three are out-of-domain datasets utilized to evaluate the model’s generalization performance. Our evaluation is conducted on the full dev or test sets corresponding to the above datasets. For evaluation metrics, we employ the standard word-level F1 score (F1), Cover Exact Match (CEM), and Exact Match (EM). For more complex open-domain QA tasks, we additionally utilize LLM-as-Judge (LasJ) to ensure a fair evaluation.

Methods In-Domain Out-of-Domain
F1 CEM EM F1 CEM EM
Training
default (doc search + orm) 54.2 52.3 40.9 34.4 30.3 25.0
(a) w/ KG system prompt 54.4 54.1 40.7 35.2 31.1 27.8
(b) w/ KG search tool 53.9 57.1 38.0 34.3 33.6 24.7
(c) w/ multi-reward 58.9 59.2 46.8 38.8 34.0 30.0
Inference
(a) w/ Doc search tool only 57.1 57.6 45.2 37.3 32.4 28.8
(b) w/ KG search tool 58.4 58.5 46.3 38.2 33.7 29.9
(c) w/ Doc+KG filter 58.9 59.2 46.8 38.8 34.0 30.0
Table 3: Ablation study on various multi-hop datasets (adding one component each time). ’w/’ represent ’with’. orm denotes the combination of format and accuracy reward.
Methods Bam Frames MoreHQA Avg.
F1 LasJ F1 LasJ F1 LasJ LasJ
Search-o1 58.6 64.6 24.1 31.7 22.1 34.3 43.5
ReSearch 71.9 73.8 38.7 48.5 30.9 46.5 56.3
R1-Searcher 67.2 71.3 33.4 42.6 23.5 37.9 50.6
Ours 75.0 77.8 40.2 49.2 33.7 51.8 59.6
Table 4: Generalization experiments on out-of-domain datasets using online search.

Search Tools. An efficient search tool is essential for our search agent. We build a local retrieval environment using a dense retriever with the multilingual-e5-base Wang et al. (2022) model, incorporating the 2018 Wikipedia corpus Ho et al. (2020). To obtain more up-to-date and comprehensive information, we further utilize Tavily as a web search tool. Additionally, we build a retrieval service for the knowledge graph based on Wikidata5M Wang et al. (2021) to obtain triples related to the corresponding entities. Specifically, we use the Wikidata5M dataset to construct our KG retrieval engine, which contains over 4 million entities and more than 800 relation types. We merge and deduplicate all triples from both the transductive and inductive splits to build a comprehensive entity-relation graph. For each entity name generated by the model during inference, we normalize the case and perform fuzzy matching to identify relevant entities and extract all associated single-hop knowledge subgraphs. These candidates are ranked by the string similarity between the queried entity and relation and those in the KG (i.e., by the number of shared words). Finally, we return the top 100 triples and up to 1024 tokens as the KG retrieval result.

In Wikidata5M, a single entity or relation ID often corresponds to multiple words or phrases, and directly concatenating all of them for input into the LLM can introduce substantial noise. To mitigate this, we introduce a KG filter module. All matched entity–relation pairs, together with the queried entity, relation, current subquery, and the original question, are fed into an LLM. The model then selects and returns up to five of the most relevant entity–relation pairs as the final KG search results.

Baselines and Training Details. In our experiments, in addition to comparing with state-of-the-art LLMs such as DeepSeek-R1-0528, Qwen3-235B-A22B, GPT-4.1-0414, o4-mini-0416, and Gemini-2.5-Pro-0325 (as shown in Table 2), we also benchmark against advanced RAG methods Shao et al. (2023); Trivedi et al. (2022a); Li et al. (2025a) and RL-based agentic search models Jin et al. (2025); Chen et al. (2025); Song et al. (2025); Zheng et al. (2025); Wang et al. (2025). These experiments are primarily based on the Qwen2.5 models Qwen et al. (2025), where Qwen2.5-7B and Qwen2.5-32B refer to their respective Instruct models. All training-based models are derived from their corresponding open-source models. The training data of DynaSearcher consist of the stage-2 data from Song et al. (2025) and 8,000 randomly sampled instances from Musique. The RL training epoch is set to 1, with a train batch size of 16 and a learning rate of 1e-6. The KL divergence coefficient set to 1e-3. Each data sample undergoes 8 rollouts during training. We utilize FSDP Zhao et al. (2023) and vLLM222https://github.com/vllm-project/vllm in VeRL333https://github.com/volcengine/verl framework, with a sampling temperature of 1.0, top-p of 0.95 and a maximum response length of 8192. For our information gain and penalty reward, we set α𝛼\alphaitalic_α to 0.5, γ𝛾\gammaitalic_γ to 0.9, and β𝛽\betaitalic_β to -0.2.

4.2 Main Results

Table 2 comprehensively shows the results of DynaSearcher and other strong baselines on four multi-hop benchmarks.

Achieve significant performance improvements. Our approach achieves significant performance improvements across multiple benchmarks under all evaluation metrics. These results demonstrate that our approach can effectively guide the reasoning path and achieve a good balance between the comprehensiveness and accuracy of the final responses. Moreover, DynaSearcher still significantly outperforms other baselines under low-resource settings, as shown in Figure 2.

Achieve frontier LLM performance with small-scale models. The current state-of-the-art LLMs are evaluated on various multi-hop datasets. Interestingly, we find that these models struggle to effectively follow the Search-o1 series prompts to guide multi-step reasoning and retrieval. Therefore, we employ a standard RAG approach to achieve better performance. Our DynaSearcher-7B achieves performance on par with GPT-4.1, while surpassing DeepSeek-R1.

Exhibit strong generalization performance. We further evaluate our approach on larger-scale models and more complex web search environments, where it consistently achieves performance improvements and demonstrates superior performance on out-of-domain datasets. These results show that our approach learns an efficient reasoning paradigm and possesses strong generalization capabilities.

Refer to caption
Figure 2: Performance comparison under different inference settings. The averaged CEM score is computed across HotpotQA, 2Wiki, Musique, and Bam. The x-axis denotes the maximum context length and the number of retrieved documents.

4.3 Further Analysis

4.3.1 Ablation Studies

To validate the effectiveness of our proposed DynaSearcher framework, we conduct comprehensive ablation studies on key components during both the training and inference stages. The performance of various methods is shown in Table 3. In the training stage, we design three key variant methods: (a) introduces our designed KG-augmented system prompt based on the default setting; (b) further incorporates our KG search environment on top of (a), enabling access to structured knowledge during training; and (c) further optimizes the original outcome reward on top of (b), providing more fine-grained control over the training objective. In the inference stage, to investigate the relationship between the generalization ability of our approach and the incorporation of additional structured knowledge, we adopt several inference modes: (a) retrieves relevant documents using only the intermediate generated subqueries; (b) further retrieves related subgraphs to guide reasoning process on top of (a); and (c) further filters the retrieved documents and subgraphs on top of (b) to reduce noise during the reasoning process. Experimental results demonstrate that our approach does not merely learn superficial think patterns, but rather acquires the ability to efficiently decompose questions and generate more precise subqueries. This leads to more effective planning strategies and reasoning trajectories.

4.3.2 Search Environment

The external information retrieval environment plays a crucial role in search agents. In our training process, we utilize a locally deployed retrieval environment, including both embedding retrieval and KG retrieval (see Table 3). To further simulate more realistic interactions, we also incorporate online search as an additional evaluation in Table 4. Experimental results show that web search brings significant performance improvements. Meanwhile, during training, the structured information introduced by KG search can effectively guide the model’s agentic search process. Even in an environment with only document search, DynaSearcher still demonstrates efficient reasoning capabilities.

4.3.3 Inference Settings

Most current search agent methods incorporate the retrieved information into the context, but this method is inevitably constrained by the context length. In Figure 2, we compare the performance of our approach and other baselines under low-resource settings. The results show that our DynaSearcher achieves comparable or better performance under the 4k/top1 setting than other methods do under the 16k/top5 setting. This further demonstrates the efficiency and accuracy of the reasoning trajectories enabled by our approach. Therefore, to minimize the context length of the retrieved information, we introduce a document and KG filtering module to eliminate irrelevant content (see Table 3), which further improves model performance.

4.4 Qualitative Analysis

Table 5 illustrates the iterative reasoning and retrieval process of DynaSearcher. For such complex questions, our model performs a reasonable analysis and decomposition, dynamically adjusting its current strategy based on the retrieved information. For example, in the case presented in the table 5, relevant temporal and entity information is obtained in the first query. This information is then utilized in the next step to further investigate related entities, thereby reducing unnecessary queries. In addition to full text retrieval, we introduce search term queries, which leverage external tools such as knowledge graphs or web search engines to further improve retrieval effectiveness. By integrating these methods, DynaSearcher ultimately achieves both efficient reasoning and accurate answers.

Question: What is the name of the play written in May 2016 by a playwright who won the MacArthur Fellowship the same year as the poet who wrote “Postcolonial Love Poem”? Answer: Skeleton Crew
DynaSearcher: <think> To properly identify the name of the play written in May 2016 by a playwright who won the MacArthur Fellowship the same year as the poet who wrote “Postcolonial Love Poem”, I first need to determine who that poet is and then find out who the playwright who won the MacArthur Fellowship in the same year is and what play they wrote in May 2016. </think> <search> {“query”: “which poet wrote the book ‘Postcolonial Love Poem’?” , “entity”: [“Postcolonial Love Poem”], “relation”: [“author name”]} </search> <result> … Natalie Diaz is the author of Postcolonial Love Poem. In 2018, she received a MacArthur Fellowship … </result>
<think></think>
<search> {“query”: “who is the playwright that won the MacArthur Fellowship in 2018?”, “entity”: [“2018 MacArthur Fellowship”], “relation”: [“recipient name”]} </search>
<result></result>
<think></think>
<search> {“query”: “what is the name of the play written by Dominique Morisseau in May 2016?”, “entity”: [“Dominique Morisseau”], “relation”: [“name of play written in May 2016”]} </search>
<result></result>
<think></think>
<answer> The final answer is \boxed{Skeleton Crew} </answer>
Table 5: Quantitative analysis of the efficient reasoning process in Frames dataset.

5 Conclusions

In this paper, we propose DynaSearcher, a dynamic knowledge graph augmented multi-reward reinforcement learning framework tailed for search agents, enabling flexible interaction between the reasoning process and the external environment. The incorporation of structured knowledge efficiently guides the search process by modeling entity relationships, while multi-reward RL enables nuanced control over training objectives. Extensive experiments on six complex benchmarks demonstrate the superiority and efficiency of our approach, achieving performance on par with current frontier LLMs using only a small-scale model and limited computational resources. Furthermore, we validate the generalization and robustness of our approach across diverse retrieval environments and larger-scale models, highlighting its substantial potential for broad applicability.

References

  • Asai et al. (2023) Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-rag: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations.
  • Chen et al. (2025) Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, Fan Yang, and 1 others. 2025. Learning to reason with search for llms via reinforcement learning. arXiv preprint arXiv:2503.19470.
  • Feng et al. (2025a) Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. 2025a. Retool: Reinforcement learning for strategic tool use in llms. arXiv preprint arXiv:2504.11536.
  • Feng et al. (2025b) Wenfeng Feng, Chuzhan Hao, Yuewei Zhang, Jingyi Song, and Hao Wang. 2025b. Airrag: Activating intrinsic reasoning for retrieval augmented generation via tree-based search. arXiv preprint arXiv:2501.10053.
  • Guo et al. (2025) Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.
  • Ho et al. (2020) Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060.
  • Jiang et al. (2025) Pengcheng Jiang, Xueqiang Xu, Jiacheng Lin, Jinfeng Xiao, Zifeng Wang, Jimeng Sun, and Jiawei Han. 2025. s3: You don’t need that much data to train a search agent via rl. arXiv preprint arXiv:2505.14146.
  • Jin et al. (2025) Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516.
  • Krishna et al. (2024) Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. 2024. Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation. arXiv preprint arXiv:2409.12941.
  • Li et al. (2025a) Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. 2025a. Search-o1: Agentic search-enhanced large reasoning models. arXiv preprint arXiv:2501.05366.
  • Li et al. (2025b) Xuefeng Li, Haoyang Zou, and Pengfei Liu. 2025b. Torl: Scaling tool-integrated rl. arXiv preprint arXiv:2503.23383.
  • Press et al. (2022a) Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. 2022a. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350.
  • Press et al. (2022b) Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. 2022b. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350.
  • Qwen et al. (2025) Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, and 25 others. 2025. Qwen2.5 technical report. Preprint, arXiv:2412.15115.
  • Schnitzler et al. (2024) Julian Schnitzler, Xanh Ho, Jiahao Huang, Florian Boudin, Saku Sugawara, and Akiko Aizawa. 2024. Morehopqa: More than multi-hop reasoning. arXiv preprint arXiv:2406.13397.
  • Shao et al. (2023) Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294.
  • Shao et al. (2024) Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300.
  • Song et al. (2025) Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. 2025. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592.
  • Trivedi et al. (2022a) Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022a. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509.
  • Trivedi et al. (2022b) Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022b. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554.
  • Wang et al. (2022) Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533.
  • Wang et al. (2021) Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhengyan Zhang, Zhiyuan Liu, Juanzi Li, and Jian Tang. 2021. Kepler: A unified model for knowledge embedding and pre-trained language representation. Transactions of the Association for Computational Linguistics, 9:176–194.
  • Wang et al. (2025) Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, and Yichao Wu. 2025. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization. arXiv preprint arXiv:2505.15107.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
  • Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
  • Yue et al. (2024) Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, and Michael Bendersky. 2024. Inference scaling for long-context retrieval augmented generation. arXiv preprint arXiv:2410.04343.
  • Zhang et al. (2025a) Dingchu Zhang, Yida Zhao, Jialong Wu, Baixuan Li, Wenbiao Yin, Liwen Zhang, Yong Jiang, Yufeng Li, Kewei Tu, Pengjun Xie, and 1 others. 2025a. Evolvesearch: An iterative self-evolving search agent. arXiv preprint arXiv:2505.22501.
  • Zhang et al. (2025b) Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, and 1 others. 2025b. From web search towards agentic deep research: Incentivizing search with reasoning agents. arXiv preprint arXiv:2506.18959.
  • Zhao et al. (2023) Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. 2023. Pytorch fsdp: Experiences on scaling fully sharded data parallel. Preprint, arXiv:2304.11277.
  • Zheng et al. (2025) Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. 2025. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. arXiv preprint arXiv:2504.03160.