DHEvo: Data-Algorithm Based Heuristic Evolution for Generalizable MILP Solving

Zhihao Zhang Harbin Institute of Technology Siyuan Li Harbin Institute of Technology Nanyang Technological University Chenxi Li Harbin Institute of Technology Feifan Liu Harbin Institute of Technology Mengjing Chen Huawei Noah’s Ark Lab Kai Li Huawei Noah’s Ark Lab Tao Zhong Huawei Noah’s Ark Lab Bo An Nanyang Technological University Peng Liu Harbin Institute of Technology

Abstract

Primal heuristics play a critical role in improving the efficiency of mixed integer programming (MILP) solvers. As large language models (LLMs) have demonstrated superior code generation abilities, recent MILP works are devoted to leveraging the evolutionary computation approaches with LLMs to generate effective primal heuristics. Although the generated heuristics have achieved better solving performance than the hand-crafted ones with little adaptability, the advantage of current LLM-based methods is limited to few MILP instances in one problem class, as they fail to capture the instance characteristics in the problem class (the MILP instances generated from the same mathematical model are defined as a problem class). Since MILP instances often differ significantly in structure and feature distribution, the neglect of their characteristics in the evolution process results in poor generalization within the same problem class. To overcome this challenge, we propose a data-algorithm co-evolution framework (DHEvo) that iteratively selects representative instances and evolves corresponding heuristics. With the initial instance distribution, we develop an LLM-based multi-agent system to generate data-code pairs simultaneously. These data-code pairs are iteratively refined based on their fitness scores, leading to the identification of the most effective heuristic over the entire problem class. Extensive experiments across diverse MILP benchmarks demonstrate that our approach significantly outperforms both human-designed heuristics and existing LLM-based methods.

1 Introduction

Mixed-Integer Linear Programming (MILP) constitutes a fundamental modeling and solution framework in operations research. It has been widely applied to a broad range of real-world problems, including supply chain optimization (liu2008tsp, ; jeong2019biodiesel, ; jokinen2015milp, ), hardware design (ma2019accelerating, ; hafer1991constraint, ), production scheduling (chen2010integrated, ; caumond2009milp, ; superchi2024optimization, ), and energy management (chang2004practical, ; kassab2024optimal, ; zare2024efficient, ). In practical applications, a complex MILP problem is often defined by numerous parameters, such as cost coefficients, constraints, and bounds. These can all be mathematically represented as:

z^{\dagger}:=\min_{x\in P^{\dagger}}c^{\top}x,\quad P^{\dagger}=\left\{x\in% \mathbb{R}^{n}\mid Ax<b,\underline{\pi}\leq x\leq\overline{\pi},x_{j}\in% \mathbb{Z}\ \forall j\in\mathcal{I}\right\},

where $M^{\dagger}:=(c,P^{\dagger})$ , $A\in\mathbb{R}^{m\times n}$ , $b\in\mathbb{R}^{m}$ , $c,x\in\mathbb{R}^{n}$ , $\underline{\pi},\overline{\pi}\in\mathbb{R}_{\infty}^{n}$ , and $\mathcal{I}\subseteq\{1,\ldots,n\}$ indexes the integer-constrained variables.

In real-world settings, each instance corresponds to a unique and specific problem configuration, even if it originates from the same MILP model. Moreover, the structural and statistical characteristics of these instances can vary significantly, leading to large intra-class diversity. Therefore, well-designed primal heuristics (wong1984worst, ; balas2004pivot, ; berthold2006primal, ; wallace2010zi, ; witzig2021conflict, ), which aim to find feasible solutions quickly, are important not only for improving solver efficiency but also for achieving robust generalization across instances within the same problem class.

Current heuristic design approaches for MILP can be broadly categorized into two types: (i) manually designed and (ii) automatically generated in an adaptive manner. Manually designed heuristics, such as those implemented in commercial solvers like SCIP (scip, ) and Gurobi (gurobi, ), rely heavily on expert knowledge and often struggle to adapt to novel problem instances. The alternative paradigm is the dynamic generation of heuristic algorithms. It mainly combines large language models (LLMs) with evolutionary computation (EC) to automate the design of heuristic algorithms (yang2023large, ; meyerson2024language, ; chen2023evoprompting, ; romera2024mathematical, ; liu2024systematic, ; liu2024evolution, ; zhou2024llm4solver, ). These approaches aim to leverage the reasoning capabilities of LLMs to generate more effective and adaptive heuristics. Nevertheless, current methods mostly perform the evolutionary computation process on only a few instances. Consequently, the generated algorithms fail to capture the shared instance characteristics in the problem class, leading to insufficient representational ability and poor generalization across instances within the same problem class, which can be observed in our ablation study. In summary, both manually crafted and automatically generated heuristics struggle to balance efficiency with generalization. This stems from their limited ability to model the instance distribution and capture shared structural patterns across different instances.

To address this, we propose a data-algorithm co-evolution framework (DHEvo) that generates generalizable algorithms by iteratively evolving both the MILP data and the algorithms. We start by randomly sampling instances from a domain-specific dataset and using an LLM-based multi-agent evolution system (MA-Evolution System) to create initial data-code pairs. Inspired by few-shot (fewshot1, ; fewshot2, ; fewshot3, ; fewshot4, ), we assume that data-code pairs with the highest fitness (relative primal gap) are more likely to generalize well across instances within a problem class. Therefore, the pairs with the highest fitness scores are selected to form the initial population for further evolution. This process is iterated over multiple generations, gradually narrowing down to the most representative algorithm and instance.

Refer to caption — Figure 1: Illustration of data-algorithm co-evolution framework (DHEvo).

In summary, our contributions are as follows:

•

We propose a unified evolutionary computation framework based on data-algorithm co-evolution. It enables better approximation of the instance distribution and enhances the representational capacity of the learned heuristics, leading to improved generalization.
•

We further introduce a co-evolutionary solution tailored to MILP tasks by incorporating a multi-agent evolution system (MA-Evolution System). It refines the evolutionary process, minimizing generation errors and optimizing the quality of the resulting heuristics.
•

Extensive experiments show that our method significantly improves the generalization of diving heuristics and delivers substantial performance gains across multiple MILP datasets.

2 Background and related works

2.1 Branch&Bound and diving heuristic

A common approach to solving MILP problems is Branch-and-Bound (B&B) (land2009automatic, ), which recursively builds a search tree by selecting variables and partitioning the problem into subproblems. These subproblems are created by adding constraints based on the fractional value of the variable in the LP relaxation solution. B&B also prunes branches using objective bounds to improve efficiency. However, B&B can be computationally expensive for large-scale problems, so primal heuristics like the diving heuristic are often used to accelerate the search. Diving performs a depth-first search by iteratively rounding variables and solving the linear program until a feasible solution is found or infeasibility is proven. While existing diving heuristics are effective, they often require manual tuning and expert knowledge to design. In contrast, our approach uses evolutionary computation to automatically generate problem-specific heuristics, offering more flexibility, adaptability, and reduced reliance on expert knowledge. Our experiments show that this approach significantly improves solver performance across multiple datasets.

2.2 Performance measurement

To evaluate the performance of MILP solvers, we use several key performance metrics: Primal-Dual Gap, Primal-Dual Integral, and Primal Gap.

Primal-Dual Gap It is a widely used measure that quantifies the difference between the primal objective value and the dual objective value at any given time during the optimization process. It gives an indication of how close the current solution $\tilde{z}$ is to an optimal solution $\tilde{z}^{*}$ . Mathematically, the Primal-Dual Gap is defined as:

\gamma_{pd}(\tilde{z},\tilde{z}^{*})=\begin{cases}\frac{|\tilde{z}-\tilde{z}^{% *}|}{\max(|\tilde{z}|,|\tilde{z}^{*}|)}&\text{if }0<\tilde{z},\tilde{z}^{*}<% \infty,\\ 1&\text{otherwise}.\end{cases}

Primal-Dual Integral While the primal-dual gap captures a snapshot at a particular time, the primal-dual integral evaluates the solver’s progress over the entire solving process by aggregating the primal-dual gap over time. It is given by:

\gamma_{pdi}(t)=\int_{0}^{t}\gamma_{pd}(\tilde{z}(\tau),\tilde{z}^{*}(\tau))\,% d\tau,

where $\gamma_{pd}(\tilde{z}(\tau),\tilde{z}^{*}(\tau))$ represents the Primal-Dual Gap at time $\tau$ .

Primal Gap It is used to evaluate the effectiveness of diving heuristics, which primarily aim to improve the primal performance by guiding the search toward better feasible solutions. The relative primal gap is defined as the absolute difference between the current objective value $\tilde{z}$ and the optimal solution $z^{\dagger}$ , normalized by the objective value of the optimal solution. The formula for the primal gap is given by:

\gamma_{p}(\tilde{z})=\frac{|\tilde{z}-z^{\dagger}|}{|z^{\dagger}|},

where $z^{\dagger}$ is the objective value of the optimal solution obtained after presolving. In the case where $|z^{\dagger}|=0$ , we use the following modified primal gap:

\gamma^{\prime}_{p}(\tilde{z})=|\tilde{z}-z^{\dagger}|.

2.3 LLM for evolutionary computation

Evolutionary computation is a widely used method for solving optimization problems inspired by natural evolution (back1997handbook, ). Given an optimization problem, evolutionary algorithms (zhou2019evolutionary, ; eiben2015evolutionary, ) treat each candidate solution as an individual. They iteratively apply operations such as parent selection, crossover, mutation, fitness evaluation, and survivor selection to evolve the population toward better solutions.

In recent years, the capabilities of large language models have advanced significantly (naveed2023comprehensive, ). Researchers have started exploring the integration of LLMs into evolutionary computation frameworks to automatically generate heuristic algorithms (liu2024systematic, ; zhang2024understanding, ; wu2024evolutionary, ). For example, Funsearch (romera2024mathematical, ) has developed an evolutionary computation framework based on LLM to improve the quality of generated functions iteratively. EoH (liu2024evolution, ) further integrates thoughts with code to generate more effective algorithms and has achieved promising results, such as the online bin packing problem. Current methods typically train on a limited set of specific instances. This prevents large language models from learning the common characteristics of the problem class. As a result, the generated algorithms excel in similar instances but lack generalization with others within the same class.

3 Method

In this section, we introduce the problem setting and our key insights in subsection 3.1, and then present the proposed data-algorithm co-evolution framework in subsection 3.2. Finally, we detail the implementation process, including the specific evolutionary operation and the design of the MA-Evolution System in subsection 3.3.

3.1 Problem formulation and insight

In real-world MILP tasks, instances within the same problem class can differ significantly in their distributions, constraints, and structures. However, they often share common features, such as similar types of constraints, variable ranges, or recurring patterns in the objective function. As shown in Figure 2, we visualize 17 typical features from the four combinatorial optimization datasets, including variable-constraint ratios (detailed results can be found in the appendix). How to extract these shared features within the same problem class to improve algorithm performance has become a key focus in large language model evolutionary computation research.

Current evolutionary computation methods typically evaluate heuristics by averaging their performance over a small number of randomly selected instances. Specifically, given a heuristic algorithm $h$ and a set of instances $\mathcal{I}=\{I_{1},I_{2},\dots,I_{k}\}$ , the fitness of the heuristic is calculated as:

\mathcal{F}(h)=\frac{1}{k}\sum_{i=1}^{k}\text{Perf}(h,I_{i}),

where $\text{Perf}(h,I_{i})$ represents the performance of the heuristic $h$ on instance $I_{i}$ .

This method assumes that all instances are equally representative, which is often not the case in MILP problems, where instances’ structure can vary significantly. To account for this, we introduce the variance of performance $\sigma^{2}(h)$ over a broader set of instances $\mathcal{D}$ . A high $\sigma^{2}(h)$ indicates that the heuristic performs inconsistently across different instances, which suggests poor generalization. Inspired by few-shot learning (fewshot1, ; fewshot2, ; fewshot3, ; fewshot4, ), extensive research (wu2017towards, ; akbari2021does, ; jiang2019fantastic, ) has shown that starting with "simple" or "representative" samples in complex datasets often enhances both learning efficiency and generalization. In this context, to identify the most representative instances, we hypothesize that those with higher fitness scores in the evolutionary computation process will likely exhibit greater structural representativeness. The rationale behind this assumption is as follows:

Insight 1: High fitness scores indicate structural representativeness. Let $I$ be an MILP instance with LP relaxation $I_{LP}$ , and let $\Delta(I)=|z^{*}_{LP}-z^{*}|$ denote the integrality gap of $I$ , where $z^{*}_{LP}$ and $z^{*}$ represent the optimal solutions to the LP relaxation and the integer program, respectively. If the integrality gap is small, i.e., $\Delta(I)\leq\epsilon$ , then heuristics trained on $I$ will achieve lower variance $\sigma^{2}(h)$ to instances with similar LP relaxation tightness $\Delta(I)$ . This suggests that higher fitness scores, which often correspond to tighter LP relaxations, indicate a more representative structure that enables better generalization across similar instances.

Insight 2: regular feasible regions enhance heuristic stability. Let $\mathcal{R}(I)$ represent the feasible region of instance $I$ , and $\nu(I)$ be a measure of the irregularity of $\mathcal{R}(I)$ , such as the number of disconnected components or redundant constraints. If $\nu(I)$ is small, then heuristics trained on $I$ will perform more consistently on similar instances with small $\nu(I)$ . This suggests that instances with more regular feasible regions are more stable and, thus, their heuristics are likely to generalize better.

3.2 Data-Algorithm based heuristic evolution framework

As illustrated in Figure 1, our framework adopts a structured evolutionary process that tightly couples instance selection with heuristic generation and optimization. This process is designed to progressively improve both the solution quality and the generalization capability of the generated heuristics through data-algorithm co-evolution.

Algorithm 1 DHEvo framework

1:Problem distribution

\mathcal{D}

, population size

m

, number of instances

n

, top-

k

, total iterations

T

2:Final heuristic algorithm population

\mathcal{H}^{\text{final}}

3:Initialization: Sample initial instance set

\mathcal{I}_{0}\in\mathcal{\{}I_{1},\dots,I_{n}\}\sim\mathcal{D}

4:Generate initial algorithm population

\mathcal{H}_{0}=\{h_{1}^{(0)},\dots,h_{m}^{(0)}\}

via MA-Evolution System (MA-E)

5:for each

h_{j}^{(0)}\in\mathcal{H}_{0}

6: Evaluate

f(h_{j}^{(0)})=\text{Perf}(h_{j}^{(0)},I_{j})

7:end for

8:for each

I_{i}\in\mathcal{I}_{0}

9: Generate

\mathcal{H}_{i}^{(1)}=\text{MA-E}(I_{i},\mathcal{H}_{0})

10: Evaluate fitness for each

h\in\mathcal{H}_{i}^{(1)}

11: Select top performer

h_{i}^{*}=\arg\max_{h}\text{Perf}(h,I_{i})

12:end for

13:Let

\mathcal{P}^{*}\leftarrow\text{top-}k\text{ pairs in }\mathcal{P}^{(1)}\text{ % ranked by }\operatorname{Perf}(I_{i},h_{i}^{*})

14:for iteration

t=2

T

15: Re-Initialization:

16: for each

(I_{j},h_{j}^{*})\in\mathcal{P}^{*}

17: Generate new candidates via prompt:

\mathcal{H}_{j}^{(t)}=\text{MA-E}(I_{j},\text{Prompt}(h_{j}^{*}))

18: Evaluate each

h\in\mathcal{H}_{j}^{(t)}

and select

h_{j}^{(t)}=\arg\max_{h}\text{Perf}(h,I_{j})

19: end for

20: Update

\mathcal{P}^{(t)}=\{(I_{j},h_{j}^{(t)})\}

, smale top-

k

pairs

\mathcal{P}^{*}\leftarrow Smaple(\text{Perf}(h,I_{i}))

21:end for

22:Final Selection:

23:Compute

f_{\text{avg}}(h_{j}^{(T)})=\frac{1}{k}\sum_{i=1}^{k}\text{Perf}(h_{j}^{(T)},I% _{i})

24:Select top heuristics

\mathcal{H}^{\text{final}}=\{h_{j}^{(T)}\}_{j=1}^{k}

25:return

\mathcal{H}^{\text{final}}

We assign fitness scores to each instance individually rather than averaging them across the dataset, which allows complex or unrepresentative instances to dominate the optimization process and ultimately undermines the generalization ability of the learned heuristics. As shown in Algorithm 1, it begins by selecting a subset of high-quality instance-heuristic pairs $(\mathcal{I}^{*},\mathcal{H}^{*})$ , where each heuristic in $\mathcal{H}^{*}$ demonstrates strong performance on its corresponding instance in $\mathcal{I}^{*}$ . These pairs are identified by evaluating all candidate heuristics against a sampled MILP dataset and ranking them based on a independent fitness score. These instances with higher scores indicate better structural representativeness for the problem class.

To balance exploration and exploitation during selection, we employ a temperature-controlled strategy. The retention probability $\mathcal{S}$ for each candidate is modeled by a sigmoid function with temperature parameter $T$ :

\mathcal{S}=\frac{1}{1+\exp\left(-\frac{1}{T}\right)}.

Higher temperatures encourage diversity by favoring a broader range of candidates, while lower temperatures promote exploitation of already promising heuristics. Once the top-performing pairs $(\mathcal{I}^{*},\mathcal{H}^{*})$ are selected, the heuristics in $\mathcal{H}^{*}$ are used to extract prompts for generating new candidates via our MA-Evolution System, a multi-agent LLM-based generation and evaluation module. These new heuristics are then re-evaluated on the structurally representative instances from $\mathcal{I}^{*}$ , ensuring alignment with the overall problem distribution.

This cycle—comprising heuristic generation, evaluation, selection, and re-initialization—is repeated for a fixed number of iterations. Through this iterative process, the framework incrementally converges toward a set of heuristics that exhibit both strong instance-level performance and robust generalization across the problem class.

At the final stage, the top $k$ instance-heuristic pairs are fixed, and their aggregated performance is assessed. The resulting heuristics constitute the final evolved algorithm portfolio for solving MILP problems.

3.3 Framework implementation

Evolution operation Our evolutionary framework comprises four main operations: initialization, crossover, mutation, and parent selection. As shown in Figure 3, we implement initialization, crossover, and mutation through tailored prompts, and leverage it to generate candidate individuals. Unlike traditional LLM-based evolutionary approaches, we leverage the MA-Evolution System to perform both crossover and mutation, enabling more targeted and problem-aware generation of new individuals. Specific prompts are employed only in the first generation during initialization to generate the initial population. In subsequent generations, the initialization step reuses high-quality algorithms obtained from the previous round. Parent heuristics are combined using specific prompts to create new candidate algorithms during crossover. Mutation then slightly alters these candidates to explore nearby solutions. To balance exploration and exploitation during parent selection, we adopt fitness-proportional selection (zhou2019evolutionary, ) by assigning selection probabilities to individuals based on their fitness scores.

MA-Evolution System To generate high-quality heuristics, we propose a multi-agent evolution system inspired by multi-agent system (mad1, ; mad2, ; mad3, ; mad4, ). As shown in Figure 3, the process includes three stages. In the first stage, the Designer agent receives the MILP task context, existing code, and the specified evolutionary operation. It produces a high-level design plan and procedural outline for a new heuristic. In the second stage, the Coder agent implements the algorithm based on the Designer’s plan. The Reviewer agent then checks the code by compiling it and performing logical analysis, providing feedback and suggestions. The Coder and Reviewer iteratively improve the code through several rounds of interaction. In the final stage, if no consensus is reached, the Judge agent reviews the full interaction history and feedback, and makes a final decision on the output code and its description.

Prompt engineering Our prompts are constructed based on three essential elements: the designated role of the large language model within the MA-E System, contextual information about the MILP problem, and evolution-specific operations intrinsic to evolutionary computation, such as mutation. The full prompt information is presented in the appendix.

4 Experiments

4.1 Experimental settings

To demonstrate the superiority of our method in the diving task, we conduct three sets of experiments across six MILP datasets. 1) The first set of experiments was designed to study the diving performance of our method and compare it against existing diving heuristics. 2) To evaluate the solving efficiency of our method, we conduct a second set of experiments comparing it against default SCIP (achterberg2007constraint, ), tuned SCIP, and EoH (liu2024evolution, ). 3) To evaluate the effectiveness of our method on real-world datasets, we conducted a third set of experiments. We replaced the default diving heuristics in the solver with our proposed method and aimed to improve overall performance on practical MILP tasks. The results of this benchmark experiment are reported in the Appendix.

4.2 Experiments for diving performance

Experimental setup With this first set of experiments, we evaluate a total of 11 publicly available diving methods on four combinatorial optimization benchmarks: cauctions, setcover, facilities, and indset. These heuristics include 6 human-designed algorithms integrated into the open-source solver SCIP (i.e., coefficient, fractional, linesearch, pseudocost, distributional, vectorlength, and farkas (witzig2021conflict, )), a learning-based GNN method L2DIVE (paulus2023learning, ), and 4 heuristics generated via large language models using evolutionary computation. Among the LLM-based methods, we include LLM4Solver (zhou2024llm4solver, ), FunSearch (romera2024mathematical, ), EoH (liu2024evolution, ) and HillClimb (HillClimb, ).

Table 1: The standard error and average relative primal gap of different diving heuristics. The results compare our method with other LLM-based evolutionary approaches, as well as seven human-designed and one learning-based baseline.

Category	Method	Cauctions	Facilities	Setcover	Indset
LLM-based Evolution	DHEvo(Ours)	1.92 (2.45)	0.70 (1.40)	9.74 (7.35)	1.12 (1.31)
	LLM4Solver (zhou2024llm4solver, )	2.50 (3.50)	0.85 (1.42)	18.33 (19.26)	1.13 (1.15)
	Funsearch (romera2024mathematical, )	3.04 (7.35)	1.18 (3.06)	77.99 (83.89)	1.61 (3.75)
	HillClimb (HillClimb, ).	6.10 (60.30)	0.75 (1.40)	81.55 (343.17)	1.61 (3.75)
	EoH (liu2024evolution, )	3.15 (3.15)	0.80 (1.47)	20.39 (19.70)	0.92 (1.06)
Hand-crafted Heuristics	Coeficient	23.67 (2.14)	3.20 (3.76)	68.58 (345.99)	4.23 (14.42)
	Distributional	47.80 (71.56)	1.46 (2.12)	75.79 (325.90)	2.57 (10.59)
	Farkas	23.32 (0.89)	1.04 (1.64)	8.13 (8.22)	-
	Pseudocost	22.51 (2.30)	1.06 (1.23)	23.56 (30.31)	3.31 (2.98)
	Linesearch	22.95 (0.90)	13.80 (10.94)	68.59 (346.00)	3.31 (3.10)
	Vectorlength	42.93 (83.57)	13.93 (10.61)	68.59 (346.01)	8.89 (7.61)
Learning-based	L2DIVE (paulus2023learning, )	2.60	0.71	3.58	1.37

To independently investigate the quality of heuristic algorithms, diving is applied solely at the root node of each instance. All other components of the solver, such as branching, cutting planes, and other primal heuristics, are disabled. For fitness evaluation during evolution, we generate 50 instances each for setcover, cauctions, and indset, and 25 instances for facilities. We test the discovered diving heuristics on 100 instances each. For fairness, all LLM-based evolutionary algorithms are trained on the same dataset mentioned above. They also use the same API interfaces. Moreover, we ensure that the prompts used for these methods are aligned in terms of task context, including MILP-specific background and diving-related objectives, to match the design of our proposed method.

Experimental results We compare our method to various baselines in Table 1. Under the metric of average primal gap, our method consistently achieves strong results across all datasets. In particular, on the Indset dataset, our approach improves over the best manually designed heuristic by 56.04%. Compared to other LLM-based algorithm design methods, our approach also achieves state-of-the-art performance. For example, on the Setcover dataset, our method surpasses the best LLM-based baseline by 61.8%. More importantly, in terms of performance variance, our method achieves the lowest variance across all four datasets. Notably, on the setcover dataset, our approach reduces variance by 46.9% compared to the best-performing LLM-based algorithm design method.

Considering both average performance and variance, our method consistently outperforms all baselines across the four combinatorial optimization datasets. This demonstrates the strong effectiveness and robustness of our approach in generating high-quality heuristics that generalize well across diverse problem instances.

4.3 Experiments for solving efficiency

Experimental setup To compare the solving efficiency of our method, we use three baselines: default SCIP, tuned SCIP (with adjusted freq and freqofs), and EoH. We run experiments on the same four combination optimization benchmark datasets. For each dataset, we randomly generate 1000 instances and select the 100 most challenging ones for testing. We set a time limit of $T_{\text{limit}}=900$ seconds for each instance and use the primal-dual integral to evaluate the solving efficiency.

Experimental results We compare our method to the default, tuned SCIP settings and EoH, as shown in Table 2. Results demonstrate that our approach not only improves solution quality but also leads to better solving efficiency. On the challenging facility dataset, our method outperforms the current state-of-the-art by 6.7% in solving time and 2.8% in primal-dual integral.

Table 2: Performance comparison of our method, EoH, default SCIP and tuned SCIPP. Each cell reports the solving time and the primal-dual integral (PDI) as time / PDI.

Method	Cauctions	Facility	Setcover	Indset
Default SCIP	4.08/55.87	301.20/506.71	2.43/117.65	21.07/230.33
Tuned SCIP	2.73/24.21	201.64/553.15	2.33/77.02	22.71/167.43
EoH (liu2024evolution, )	2.62/37.12	197.35/504.56	2.76/96.75	20.32/151.34
DHEvo(Ours)	2.28/23.42	181.27/490.43	2.27/75.88	18.54/146.39

4.4 Ablation Study

To verify the effectiveness of each component in our method, we conduct an ablation analysis of our method using four combinatorial optimization datasets. Our framework contains two main components: data-algorithm co-evolution and multi-agent evolution system. Next, we will conduct ablation experiments on these two parts to verify their effectiveness on the validation dataset.

Analysis on data-algorithm co-evolution We conduct a comprehensive ablation study and comparative evaluation to assess the effectiveness of the data-algorithm co-evolution mechanism. When this mechanism is excluded, the fitness score is computed as the average relative primal gap across all training instances without any selection. As a result, the variance in the performance of evolved heuristics increases by approximately 10% on each dataset. Meanwhile, the overall solution quality declines by around 10% across all datasets. This degradation arises because all training instances are treated equally during evolution, allowing complex or unrepresentative instances to dominate the optimization process, which undermines the generalization of the learned heuristics. Additionally, we replace our MA-Evolution System with the existing best LLM-based evolutionary methods EoH. As shown in Table 3, the variants augmented with our co-evolution framework achieve markedly better generalization performance compared to their original versions. In particular, the variance of the evolved heuristics decreases by nearly 30% on the setcover dataset when the co-evolution mechanism is removed.

Table 3: Comparison of standard error and average relative primal gap on validation dataset, including our full method, its variant without co-evolution, EoH baseline, and EoH with co-evolution.

Method	Cautions	Facilities	Setcover	Indset
DHEvo	2.15(2.53)	0.83(1.30)	9.74(13.42)	1.01(1.03)
DHEvo_OFF	2.33(2.79)	0.93(1.45)	10.8(13.99)	1.23(1.11)
EoH_DH	2.90(5.60)	0.84(1.47)	18.31(17.48)	1.07(1.14)
EoH	4.38(6.15)	1.96(4.36)	26.14(28.89)	1.36(1.21)

Analysis on MA-Evolution System To verify the effectiveness of the MA-Evolution System in generating higher-quality diversity generated individual algorithms, we conduct an ablation study by removing this system from our framework and comparing it with the original version in the setcover dataset.

To evaluate the diversity of algorithms generated by the MA-Evolution System, inspired by diversity indicator metrics (diversity1, ; diversity2, ), we introduce a diversity index defined as

DI=\frac{H}{\log_{2}N},

where $H$ is the Shannon entropy of the score distribution over $N$ generated samples. A value closer to 1 indicates higher diversity among solutions.

Table 4: Ablation study of the MA-Evolution system in terms of average primal gap (APG), diversity index (DI), and primal gap standard deviation (PGSD).

Method	APG	DI	PGSD
MA-Evolution OFF	9.14	0.76	8.75
MA-Evolution ON	8.00	0.88	4.78

As shown in the Table 4, the algorithms generated by the MA-Evolution System achieve significantly lower average primal gaps, improving by 12.4% compared to those without the MA-Evolution System. Additionally, they show a 15.8% improvement in the diversity index, demonstrating the superior diversity of the generated heuristics.

5 Conclusion

We present a novel data-algorithm co-evolution framework for solving MILP. By iteratively identifying the most representative instances and co-evolving heuristic algorithms based on them, our method significantly improves the generalization ability of the generated heuristics within the same problem class. Unlike traditional approaches that treat training data as static, our method selects representative instances during the evolutionary process, enabling the algorithm to generalize better across diverse problem distributions. Meanwhile, heuristic algorithms are co-evolved with the instance set, ensuring mutual adaptation and continuous improvement. We also introduce a multi-agent evolutionary system to improve generation quality and solution diversity. Experimental results show that our approach significantly outperforms existing human-designed, learning-based, and LLM-based baselines in both the primal gap and solving efficiency.

References

[1] Songsong Liu, Jose M Pinto, and Lazaros G Papageorgiou. A tsp-based milp model for medium-term planning of single-stage continuous multiproduct plants. Industrial & Engineering Chemistry Research, 47(20):7733–7743, 2008.
[2] Hyunju Jeong, Heidi L Sieverding, and James J Stone. Biodiesel supply chain optimization modeled with geographical information system (gis) and mixed-integer linear programming (milp) for the northern great plains region. BioEnergy research, 12:229–240, 2019.
[3] Raine Jokinen, Frank Pettersson, and Henrik Saxén. An milp model for optimization of a small-scale lng supply chain along a coastline. Applied energy, 138:423–431, 2015.
[4] Kefan Ma, Liquan Xiao, Jianmin Zhang, and Tiejun Li. Accelerating an fpga-based sat solver by software and hardware co-design. Chinese Journal of Electronics, 28(5):953–961, 2019.
[5] Lou Hafer. Constraint improvements for milp-based hardware synthesis. In Proceedings of the 28th ACM/IEEE Design Automation Conference, pages 14–19, 1991.
[6] Zhi-Long Chen. Integrated production and outbound distribution scheduling: review and extensions. Operations research, 58(1):130–148, 2010.
[7] Anthony Caumond, Philippe Lacomme, Aziz Moukrim, and Nikolay Tchernev. An milp for scheduling problems in an fms with one vehicle. European Journal of Operational Research, 199(3):706–722, 2009.
[8] Francesco Superchi, Nathan Giovannini, Antonis Moustakis, George Pechlivanoglou, and Alessandro Bianchini. Optimization of the power output scheduling of a renewables-based hybrid power station using milp approach: The case of tilos island. Renewable Energy, 220:119685, 2024.
[9] Gary W Chang, YD Tsai, CY Lai, and JS Chung. A practical mixed integer linear programming based approach for unit commitment. In IEEE Power Engineering Society General Meeting., pages 221–225. IEEE, 2004.
[10] Fadi Agha Kassab, Berk Celik, Fabrice Locment, Manuela Sechilariu, Sheroze Liaquat, and Timothy M Hansen. Optimal sizing and energy management of a microgrid: A joint milp approach for minimization of energy cost and carbon emission. Renewable Energy, 224:120186, 2024.
[11] Peyman Zare, Abdolmajid Dejamkhooy, and Iraj Faraji Davoudkhani. Efficient expansion planning of modern multi-energy distribution networks with electric vehicle charging stations: A stochastic milp model. Sustainable Energy, Grids and Networks, 38:101225, 2024.
[12] Hoon Liong Ong and JB Moore. Worst-case analysis of two travelling salesman heuristics. Operations Research Letters, 2(6):273–277, 1984.
[13] Egon Balas, Stefan Schmieta, and Christopher Wallace. Pivot and shift—a mixed integer programming heuristic. Discrete Optimization, 1(1):3–12, 2004.
[14] Timo Berthold. Primal heuristics for mixed integer programs. PhD thesis, Zuse Institute Berlin, 2006.
[15] Chris Wallace. Zi round, a mip rounding heuristic. Journal of Heuristics, 16:715–722, 2010.
[16] Jakob Witzig and Ambros Gleixner. Conflict-driven heuristics for mixed integer programming. INFORMS Journal on Computing, 33(2):706–720, 2021.
[17] Ambros Gleixner, Leon Eifler, Tristan Gally, Gerald Gamrath, Patrick Gemander, Robert Lion Gottwald, Gregor Hendel, Christopher Hojny, Thorsten Koch, Matthias Miltenberger, et al. The scip optimization suite 5.0. Optimization Online, 2017.
[18] Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023.
[19] Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
[20] Elliot Meyerson, Mark J Nelson, Herbie Bradley, Adam Gaier, Arash Moradi, Amy K Hoover, and Joel Lehman. Language model crossover: Variation through few-shot prompting. ACM Transactions on Evolutionary Learning, 4(4):1–40, 2024.
[21] Angelica Chen, David Dohan, and David So. Evoprompting: Language models for code-level neural architecture search. Advances in neural information processing systems, 36:7787–7817, 2023.
[22] Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, 2024.
[23] Fei Liu, Yiming Yao, Ping Guo, Zhiyuan Yang, Zhe Zhao, Xi Lin, Xialiang Tong, Mingxuan Yuan, Zhichao Lu, Zhenkun Wang, et al. A systematic survey on large language models for algorithm design. arXiv preprint arXiv:2410.14716, 2024.
[24] Fei Liu, Xialiang Tong, Mingxuan Yuan, Xi Lin, Fu Luo, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang. Evolution of heuristics: Towards efficient automatic algorithm design using large language model. arXiv preprint arXiv:2401.02051, 2024.
[25] Yuyan Zhou, Jie Wang, Yufei Kuang, Xijun Li, Weilin Luo, Jianye HAO, and Feng Wu. Llm4solver: Large language model for efficient algorithm design of combinatorial optimization solver. https://openreview.net/pdf?id=XTxdDEFR6D, 2024.
[26] Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander Hauptmann. Self-paced curriculum learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.
[27] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenenbaum, Hugo Larochelle, and Richard S Zemel. Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676, 2018.
[28] Ryoma Sato, Makoto Yamada, and Hisashi Kashima. Learning to sample hard instances for graph algorithms. In Asian Conference on Machine Learning, pages 503–518. PMLR, 2019.
[29] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of annual international conference on machine learning, pages 41–48, 2009.
[30] Ailsa H Land and Alison G Doig. An automatic method for solving discrete programming problems. In 50 Years of Integer Programming 1958-2008: From the Early Years to the State-of-the-Art, pages 105–132. Springer, 2009.
[31] Thomas Bäck, David B Fogel, and Zbigniew Michalewicz. Handbook of evolutionary computation. Release, 97(1):B1, 1997.
[32] Zhi-Hua Zhou, Yang Yu, and Chao Qian. Evolutionary learning: Advances in theories and algorithms. Springer, 2019.
[33] Agoston E Eiben and Jim Smith. From evolutionary computation to the evolution of things. Nature, 521(7553):476–482, 2015.
[34] Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435, 2023.
[35] Rui Zhang, Fei Liu, Xi Lin, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang. Understanding the importance of evolutionary search in automated heuristic design with large language models. In International Conference on Parallel Problem Solving from Nature, pages 185–202. Springer, 2024.
[36] Xingyu Wu, Sheng-hao Wu, Jibin Wu, Liang Feng, and Kay Chen Tan. Evolutionary computation in the era of large language model: Survey and roadmap. IEEE Transactions on Evolutionary Computation, 2024.
[37] Lei Wu, Zhanxing Zhu, et al. Towards understanding generalization of deep learning: Perspective of loss landscapes. arXiv preprint arXiv:1706.10239, 2017.
[38] Ali Akbari, Muhammad Awais, Manijeh Bashar, and Josef Kittler. How does loss function affect generalization performance of deep learning? application to human age estimation. In International Conference on Machine Learning, pages 141–151. PMLR, 2021.
[39] Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. arXiv preprint arXiv:1912.02178, 2019.
[40] Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118, 2023.
[41] Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023.
[42] Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation. arXiv preprint arXiv:2410.10762, 2024.
[43] Yunxuan Li, Yibing Du, Jiageng Zhang, Le Hou, Peter Grabowski, Yeqing Li, and Eugene Ie. Improving multi-agent debate with sparse communication topology. arXiv preprint arXiv:2406.11776, 2024.
[44] Tobias Achterberg. Constraint integer programming. PhD thesis, technical university of berlin, 2007.
[45] Max Paulus and Andreas Krause. Learning to dive in branch and bound. Advances in Neural Information Processing Systems, 36:34260–34277, 2023.
[46] Rui Zhang, Fei Liu, Xi Lin, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang. Understanding the importance of evolutionary search in automated heuristic design with large language models. In International Conference on Parallel Problem Solving from Nature, pages 185–202. Springer, 2024.
[47] Mark Wineberg and Franz Oppacher. The underlying similarity of diversity measures used in evolutionary computation. In Genetic and Evolutionary Computation, pages 1493–1504. Springer, 2003.
[48] Adel Nikfarjam, Jakob Bossek, Aneta Neumann, and Frank Neumann. Entropy-based evolutionary diversity optimisation for the traveling salesperson problem. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 600–608, 2021.

Appendix A Diving Heuristics

Diving heuristics are primal heuristics that iteratively fix variables based on LP relaxation solutions, simulating a depth-first search in the branch-and-bound tree. Given the LP relaxation of an MILP:

z_{LP}^{\dagger}:=\min_{x\in P_{LP}^{\dagger}}c^{\top}x,\quad P_{LP}^{\dagger}% =\left\{x\in\mathbb{R}^{n}\mid Ax<b,\underline{\pi}\leq x\leq\overline{\pi}% \right\},

the algorithm starts from an LP solution $\hat{x}\in P_{LP}^{\dagger}$ and incrementally fixes fractional variables $x_{j}\notin\mathbb{Z}$ to integer values. At each step, the feasible region is updated with new bound constraints, and the relaxed problem is re-solved. This process emulates a depth-first traversal of the search space, aiming to quickly construct a feasible integer solution. In general, a generic diving heuristic can be described by Algorithm 2. The only difference among various diving heuristics lies in the scoring function $s(\cdot)$ , which determines the variable to round and the direction of rounding at each iteration.

Algorithm 2 Generic Diving Heuristic

1:MILP with relaxed feasible region

P^{*}

, LP solution

x^{*}

, maximum depth

d_{\text{max}}

2:A set of feasible solutions

\mathcal{X}

(if found)

3:A scoring function

s

for selecting branching variables and their rounding direction

4:Initialize depth

d\leftarrow 1

, candidate set

\mathcal{C}\leftarrow\{j\in\mathcal{I}\mid x^{*}_{j}\notin\mathbb{Z}\}

5:while

d\leq d_{\text{max}}

j\leftarrow\arg\max_{i\in\mathcal{C}}s(x_{i})

7: if round up then

l_{j}\leftarrow\lceil x^{*}_{j}\rceil

9: else

10:

u_{j}\leftarrow\lfloor x^{*}_{j}\rfloor

11: end if

12:

P^{*}\leftarrow P^{*}\cap\{l_{j}\leq x_{j}\leq u_{j}\}

13: if

P^{*}

is infeasible then

14: break

15: end if

16:

x^{*}\leftarrow\arg\min_{x\in P^{*}}c^{\top}x

17: if

x^{*}

is roundable then

18:

\mathcal{X}\leftarrow\mathcal{X}\cup\text{round}(x^{*})

19: end if

20:

d\leftarrow d+1

21: Update candidate variable index set

\mathcal{C}

22:end while

Here are some diving heuristic algorithms included in SCIP.

Coefficient. This strategy selects a variable that has the smallest number of positive up-locks or down-locks. These locks represent how many constraints would prevent increasing or decreasing the variable, respectively. The variable is then fixed in the direction where fewer locks occur. If there is a tie between multiple variables, the method uses a secondary rule called fractional diving to break the tie.

Distribution. This method is based on the empirical distribution of fractional values observed in historical solutions. It favors variables that are more frequently fractional in previous LP relaxations. The idea is that such variables are likely to remain fractional and therefore more useful for branching.

Farkas. This strategy tries to construct a Farkas proof to show the infeasibility of the current LP relaxation after branching. It selects the variable whose rounding, in the direction that improves the objective, is predicted to cause the largest gain. This prediction is based on LP dual information or inference from constraint violation. The method is designed to make branching decisions that quickly lead to pruning.

Fractional. This method selects the variable that is closest to an integer value, but still fractional. The measure used is $\left|x_{j}^{*}-\lfloor x_{j}^{*}+0.5\rfloor\right|$ , which captures how far the variable’s value is from the nearest integer. The selected variable is then rounded in the direction that brings it closest to an integer. This approach is simple and focuses on reducing the integrality gap.

Linesearch. This method traces a straight line (ray) from the root node LP solution to the current LP solution $x^{*}$ . It identifies which integer hyperplane—either $x_{j}=\lfloor x_{j}^{*}\rfloor$ or $x_{j}=\lceil x_{j}^{*}\rceil$ —is intersected first along this ray. The variable defining that hyperplane is selected for branching. This approach can be seen as a geometric way to decide which variable will influence the search path as soon as possible.

Pseudocost. This strategy uses historical data, called pseudocosts, to guide branching. For each variable, it records the average objective improvement caused by previous up- or down-branching decisions. The variable and branching direction with the highest expected improvement are selected. This method also considers the current fractionality of the variable to refine the choice. It is widely used due to its balance between accuracy and efficiency.

Vectorlength. This method is inspired by set-partitioning problems. It evaluates the trade-off between how much rounding a variable is expected to degrade the objective, and how many constraints the variable appears in. The selected variable minimizes the ratio between the expected degradation and its constraint count. This helps prioritize variables that have a broad structural impact while limiting damage to the objective.

To guide our learned diving score function, we use variable-level features that are inspired by those employed in existing human-designed diving heuristics. These include 13 features in total, which are listed and described in Table 5.

Table 5: Description of the 13 input features used in the diving score function.

Feature Name	Feature Description
mayrounddown	Boolean; indicates whether the variable can be rounded down while maintaining feasibility.
mayroundup	Boolean; indicates whether the variable can be rounded up while maintaining feasibility.
candsfrac	Float; fractional part of the variable’s value in the LP relaxation, i.e., $\|x_{j}^{}-\lfloor x_{j}^{}\rfloor\|$ .
candsol	Float; value of the variable in the current LP relaxation solution.
nlocksdown	Integer; number of down-locks, i.e., constraints that would be violated by decreasing the variable.
nlocksup	Integer; number of up-locks, i.e., constraints that would be violated by increasing the variable.
obj	Float; coefficient of the variable in the objective function.
objnorm	Float; Euclidean norm of the objective function coefficient vector.
pscostdown	Float; pseudocost for decreasing the variable’s value.
pscostup	Float; pseudocost for increasing the variable’s value.
rootsolval	Float; value of the variable in the LP relaxation at the root node.
nNonz	Integer; number of nonzero entries in the variable’s column in the constraint matrix.
isBinary	Boolean; TRUE if the variable is binary, i.e., has domain $\{0,1\}$ .

Appendix B More experiment details

In all the experiments, we evaluate the performance of agents driven by GPT-4o mini across various tasks. We run all the experiments with three random seeds on Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz and NVIDIA A100.

Note: Since the code for L2DIVE is currently not open-source and specific hyperparameters are unavailable, we officially report the performance of L2DIVE based on its ratio to the best human-designed heuristic as presented in the original article.

B.1 SCIP settings

To construct our Tuned baseline, we incorporated domain knowledge and performed a randomized search over key diving-related parameters in SCIP 7.0.2. The primary parameters that govern the invocation of individual diving heuristics are freq and freqofs. These parameters determine when and how frequently a given diving heuristic is triggered during the branch-and-bound process. By adjusting their values, we can generate diverse solver behaviors that vary the timing and intensity of heuristic application. For each diving heuristic, we independently sampled its configuration by setting freq to one of four values with equal probability: $-1$ (disabled), $\lfloor 0.5\times\textit{freq}_{\text{default}}\rfloor$ (increased frequency), $\textit{freq}_{\text{default}}$ (default frequency), or $\lfloor 2\times\textit{freq}_{\text{default}}\rfloor$ (reduced frequency). In parallel, we randomly set freqofs to either zero or its default value, also with equal probability. This approach allows us to sample a wide range of heuristic schedules while maintaining compatibility with established SCIP parameter semantics.

B.2 Datasets

We evaluate our method on seven benchmark datasets, including four synthetic combinatorial optimization problems and three real-world MILP tasks. The datasets are widely used in prior work and include:

•

Setcover: A classical combinatorial problem where the objective is to select a minimum number of subsets such that their union covers all elements. Instances are represented as binary matrices with rows corresponding to elements and columns to subsets. Easy instances have 500 rows and 1000 columns, while hard instances increase the size to 2000 rows and 1000 columns.
•

Cauctions: A combinatorial auction problem where bidders submit bids on bundles of items, aiming to maximize total revenue without violating item availability constraints. Easy instances contain 100 items and 500 bids, while hard instances include 300 items and 1500 bids.
•

Facilities: A capacitated facility location problem involving the selection of facility sites and the assignment of customers to minimize facility opening and service costs. Easy instances consist of 100 facilities and 100 customers, whereas hard instances have 100 facilities and 400 customers.
•

Indset: The maximum independent set problem, which seeks the largest possible set of mutually non-adjacent vertices within a graph. Easy instances feature 500 nodes with an affinity of 4, and hard instances have 1500 nodes with the same affinity.
•

LoadBalance: A server load balancing problem arising in distributed systems, modeled as an MILP.
•

NNVerify: A verification problem for neural networks, where constraints encode input-output relationships that must be satisfied.
•

MIPLIB: It contains a diverse collection of real-world and academic instances spanning various domains such as scheduling, network design, logistics, and combinatorial optimization. We selected 20 instances for experimental comparison.

The first and second experimental groups are conducted on the four synthetic datasets, focusing on diving performance and solving efficiency. The third group uses the three real-world datasets to demonstrate the effectiveness of our method in practical applications.

Table 6: Table 10: Used MIPLIB instance names

air05	beasleyC3	binkar10_1	cod105	dano3_3
eil33-2	hypothyroid-k1	istanbul-no-cutoff	markshare_4_0	mas76
mc11	mik-250-20-75-4	n5-3	neos-860300	neos-957323
neos-1445765	nw04	piperout-27	pk1	seymour1

Appendix C Experiments for practical MILP tasks.

C.1 Experimental settings

To evaluate the effectiveness and generalization ability of the proposed heuristic framework, we conduct experiments on three representative datasets: LoadBalance, MILPLIB, and NNVerify, which cover a broad range of MILP problem structures. Across all datasets, we adopt two standard performance metrics: the primal-dual integral (PDI), which captures convergence behavior and solution quality over time, and the solving time (T), which measures how quickly a feasible or optimal solution is found. For LoadBalance, we use 100 instances for validation and another 100 for testing, with $T_{\text{limit}}=3600$ seconds as the standard setting and $T_{\text{limit}}=1800$ seconds for additional robustness evaluation under tighter budgets. For MILPLIB, we select 20 relatively simple benchmark instances as a test set to evaluate generalization performance on classical MILP formulations; the instance names are listed in Table 6. For NNVerify, we evaluate on 100 testing instances derived from neural network verification problems, using a time limit of $T_{\text{limit}}=900$ seconds and considering only instances successfully solved within the limit. To isolate the contribution of the learned diving heuristics, we perform all experiments under both cut-selection enabled and disabled configurations. In all settings, the heuristics are integrated into SCIP, and the best-performing variant is selected on the validation set based on either PDI or solving time before being applied to the testing set, mirroring realistic deployment scenarios.

C.2 Experimental results

We compare our method to the default SCIP settings, as shown in Table 7. The experimental results demonstrate that our method achieves competitive improvements across all datasets. With cut plane selection enabled, our method achieves a 26.1% improvement in the primal dual integral (PDI) on the LoadBalance dataset. On the NNVerify dataset, our approach more than doubles the solving efficiency regardless of whether cut selection is enabled. For the MIPLIB dataset, our method improves solving efficiency by 36% and achieves a 12% reduction in PDI compared to the baseline.

Table 7: Comparison of solving time and primal-dual integral (PDI) across different methods and datasets. The suffix -off-cut indicates that cut plane selection is disabled, while -on-cut means cut plane selection is enabled. All methods fail to obtain the optimal solution on the Load balancing dataset within the time limit(3600 or 1800).

	Load balancing		Neural network verif		MIPLIB
	Solving time	PDI	Solving time	PDI	Solving time	PDI
Ours-off-cut	3600	346980.53	72.42	5413.32	263.48	12101.11
Scip-off-cut	3600	347597.70	669.15	38455.17	469.22	18127.57
Ours-on-cut	1800	7305.2	35.67	2744.21	117.67	5599.62
Scip-on-cut	1800	9881.29	137.19	8210.46	184.3	6339.64

Appendix D Prompts

D.1 Prompts design

Our prompt design adopts a structured and modular format to effectively guide LLMs in performing evolutionary search within the multi-agent evolutionary framework. Each prompt is composed of three essential components, designed to provide the LLM with both domain-specific grounding and a clear operational goal.

As shown in Figure 4, background prompts contain <Introduction of MILP>, <Diving Heuristics>, <Instuction> , <in_out of Heuristic>, <Features description>. Together, they provide enough background knowledge of diving heuristics for the downstream tasks. Prompts in MA-Evolution System are modular and follow a structured template to ensure consistency across generations, as shown in Figure 5. At the core of each prompt are three elements: (1) the functional role of the agent, which defines the nature of the task (e.g., proposing a new heuristic or reviewing existing code); (2) a formal or semi-formal description of the MILP problem to ground the response in the relevant optimization context; and (3) a specification of the evolutionary operation that informs the agent’s goal in the current generation cycle.

Our evolution operation’s prompt includes three main types: initialization, mutation, and crossover. Each type corresponds to a distinct stage in the evolutionary search process and is designed to guide LLMs in generating or improving heuristics for MILP diving.

Initialization The LLM is instructed to create a new scoring function from scratch. The function should assign a score and a rounding direction to each fractional variable, based only on the LP relaxation and objective function. This stage initializes the population with diverse and problem-aware heuristics.

Example prompt: Please create a new Python scoring function for a Generic Diving Heuristic. The function should assign a score and rounding direction to each fractional variable, using only information from the LP relaxation and the objective function.

Mutation The LLM receives an existing scoring function and modifies it slightly. The modification should be meaningful and aimed at improving performance or exploring nearby variants in the heuristic space. This enables local search around known good solutions.

Example prompt: Please make a small but meaningful change that may improve performance or explore alternative behavior. Ensure the result is syntactically correct and remains within the MILP context.
Original function: [insert code]

Crossover The LLM combines two existing scoring functions into a new one. It should preserve useful components from both parents while ensuring the resulting function is coherent and consistent. This enables global search by recombining successful patterns.

Example prompt: You are creating a new heuristic by combining two existing ones. Please synthesize a scoring function that inherits effective components from both parents while maintaining logical consistency.
Heuristic A: [insert code]
Heuristic B: [insert code]

D.2 Example

The following is an example of our method applied within DHEvo:

Designer You are a Coder Designer. Hello and welcome to the program. Read the prompt content, propose your algorithm ideas, and provide detail implementation procedure, similar to the role of a project manager. The topic is stated as follows:Diving heuristics are one of the most important categories of primal heuristics in SCIP framework for Mixed Integer Linear Programming (MILP) problem. It starts from the current LP solution and iteratively fix an integer variable to an integral value and resolve the LP. You should create a totally new Python scoring function for me (different from the heuristics in the literature) to choose the fractional variable and corresponding rounding direction using the information of the LP relaxation and objective function. The function is used for every variable to decide the variable’s score and rounding direction. Specifically, you have these features to use in the score function: "mayrounddown" and "mayroundup" (bool, indicate whether it is possible to round variable down/up and stay feasible, it should be penalized because we need more exploration); "candsfrac" (float, fractional part of solution value of variable); "candsol" (float, solution value of variable in LP relaxation solution); "nlocksdown" and "nlocksup" (int, the number of locks for rounding down/up of a special type); "obj" (float, objective function value of variable); "objnorm" (float, the Euclidean norm of the objective function vector); "pscostdown" and "pscostup" (float, the variable’s pseudo cost value for the given change of the variable’s LP value); "rootsolval" (float, the solution of the variable in the last root node’s relaxation, if the root relaxation is not yet completely solved, zero is returned); "nNonz" (int, the number of nonzero entries in variable); "isBinary" (bool, TRUE if the variable is of binary type). Provide a brief description of the new score function’s logic and its corresponding Python code. The description must start with ’<start_des>’ and end with ’</end_des>’. The code must start with ’<start_code>’ and end with ’</end_code>’. The code score function must call ’myheurdiving’ that takes 13 inputs ’mayrounddown’, ’mayroundup’, ’candsfrac’, ’candsol’, ’nlocksdown’, ’nlocksup’, ’obj’, ’objnorm’, ’pscostdown’, ’pscostup’, ’rootsolval’, ’nNonz’, and ’isBinary’. The function must output the ’score’ and ’roundup’, where ’score’ is a float type indicating the variable’s score, the more the better, and the ’roundup’ is a bool type indicating whether we should round the variable up, True for rounding up. Be creative and do not give additional explanations.

Coder Your role is to create a new algorithm based on the original content and Designer‘s needs. The original content is stated as follows: Diving heuristics are one of the most important categories of primal heuristics in the SCIP framework for Mixed Integer Linear Programming (MILP) problems. It starts from the current LP solution and iteratively fixes an integer variable to an integral value and resolves the LP. You should create a new Python scoring function for me (different from the heuristics in the literature) to choose the fractional variable and corresponding rounding direction using the information of the LP relaxation and objective function. The function is used for every variable to decide the variable’s score and rounding direction. Specifically, you have these features to use in the score function: "mayrounddown" and "mayroundup" (bool, indicate whether it is possible to round variable down/up and stay feasible, it should be penalized because we need more exploration); "candsfrac" (float, fractional part of solution value of variable); "candsol" (float, solution value of variable in LP relaxation solution); "nlocksdown" and "nlocksup" (int, the number of locks for rounding down/up of a special type); "obj" (float, objective function value of variable); "objnorm" (float, the Euclidean norm of the objective function vector); "pscostdown" and "pscostup" (float, the variable’s pseudo cost value for the given change of the variable’s LP value); "rootsolval" (float, the solution of the variable in the last root node’s relaxation, if the root relaxation is not yet completely solved, zero is returned); "nNonz" (int, the number of nonzero entries in variable); "isBinary" (bool, TRUE if the variable is of binary type). Provide a brief description of the new score function’s logic and its corresponding Python code. The description must start with ’<start_des>’ and end with ’</end_des>’. The code must start with ’<start_code>’ and end with ’</end_code>’. The code score function must call ’myheurdiving’ that takes 13 inputs: ’mayrounddown’, ’mayroundup’, ’candsfrac’, ’candsol’, ’nlocksdown’, ’nlocksup’, ’obj’, ’objnorm’, ’pscostdown’, ’pscostup’, ’rootsolval’, ’nNonz’, and ’isBinary’. The function must output the ’score’ and ’roundup’, where ’score’ is a float type indicating the variable’s score, the more the better, and the ’roundup’ is a bool type indicating whether we should round the variable up, True for rounding up. Be creative and do not give additional explanations. Designer idea: Allow weighting coefficients for each component.

Reviewer You are a code evaluator who tries to evaluate the code, giving relevant comments in the direction of code redundancy and code logic, etc. <code>

Judger You are the final arbiter in the MA-Evolution System. Evaluate the heuristic code and review feedback to determine if it should be accepted, revised, or discarded. Output the final code and description. <historical information >

Appendix E Generated heuristics

This section presents the best heuristics generated by DHEvo.

⬇

def myheurdiving(mayrounddown, mayroundup, candsfrac, candsol,

nlocksdown, nlocksup, obj, objnorm,

pscostdown, pscostup, rootsolval,

nNonz, isBinary):

score = 0.0

roundup = False

# Penalize if both rounding options are feasible

if mayrounddown and mayroundup:

score = -40

# Evaluate candidate based on fractional part

if candsfrac > 0.5:

score += candsfrac * 80

roundup = True

if pscostup > 0.5:

score += pscostup * 50

else:

score += (1 - candsfrac) * 60

if pscostdown < -0.3:

score -= abs(pscostdown) * 25

# Normalize objective contribution

score += (obj / (objnorm + 1e-6)) * 90

# Adjust for locking counts

score += (nlocksdown * 25 - nlocksup * 15)

# Reward for non-zero entries and binary variable nature

if nNonz > 2:

score += nNonz * 20

if isBinary:

score += 50

return score, roundup

Listing 1: Heuristic for cauctions

⬇

def myheurdiving(mayrounddown, mayroundup, candsfrac, candsol, nlocksdown, nlocksup, obj, objnorm, pscostdown, pscostup, rootsolval, nNonz, isBinary):

score = 0.0

roundup = False

# Base score weighted by normalized objective contribution

score += (obj/(objnorm + 1e-9)) * 5 if objnorm >0 else 0

# Penalize rounding options to encourage exploration

score -= nlocksdown * 7 if mayrounddown else 0

score -= nlocksup *7 if mayroundup else 0

# Favor large fractions away from 0.5 for exploration

score += (abs(candsfrac - 0.5) * 10)

# Adjust score based on solution value and its contribution

score += (candsol / (1 + abs(rootsolval) * obj)) * 4 if rootsolval != 0 else 0

# Employ pseudo costs to influence rounding decisions

if pscostdown < 0 and mayrounddown:

score += -pscostdown * 3 # Favor rounding down with negative pseudo costs

if pscostup < 0 and mayroundup:

score -= -pscostup * 3 # Discourage rounding up with negative pseudo costs

#Determine rounding direction based on fractional part and exploration potential

if candsfrac >= 0.7 and mayroundup:

roundup = True

elif candsfrac <= 0.3 and mayrounddown:

roundup = False

# Encourage solutions with fewer nonzero entries

score += (1 / (nNonz + 1)) * 2 if nNonz > 0 else 0

return score, roundup

Listing 2: Heuristic for facility

⬇

def myheurdiving(mayrounddown, mayroundup, candsfrac, candsol, nlocksdown, nlocksup, obj, objnorm, pscostdown, pscostup, rootsolval, nNonz, isBinary):

score = 0.0

# Strongly penalize feasible rounding options

if mayrounddown:

score -= 3.0

if mayroundup:

score -= -3.0

# Incorporate fractional part and objective value

score += (1.0 - candsfrac) * obj * 0.5 if mayrounddown else 0

score += candsfrac * obj *0.5 if mayroundup else 0

# Adjust with pseudo costs

score += pscostdown * candsfrac * 1.5 if mayrounddown else 0

score += pscostup * (1 - candsfrac) * 1.5 if mayroundup else 0

# Apply less severe penalty for distance from the root solution

score -= abs(rootsolval - candsol) * 0.1

# Normalize the score

if objnorm > 0 :

score /= objnorm

# Reward more for binary variables

score += nlocksup * 0.3 - nlocksdown * 0.3

if isBinary:

score += 1.0

# Determine rounding direction

roundup = (score > 0) and (not isBinary or mayroundup)

return score, roundup

Listing 3: Heuristic for indset

⬇

def myheurdiving(mayrounddown, mayroundup, candsfrac, candsol, nlocksdown, nlocksup, obj, objnorm, pscostdown, pscostup, rootsolval, nNonz, isBinary):

score = 0.0

# Strongly penalize feasible rounding options

if mayrounddown:

score -= 3.0

if mayroundup:

score -= -3.0

# Incorporate fractional part and objective value

score += (1.0 - candsfrac) * obj * 0.5 if mayrounddown else 0

score += candsfrac * obj *0.5 if mayroundup else 0

# Adjust with pseudo costs

score += pscostdown * candsfrac * 1.5 if mayrounddown else 0

score += pscostup * (1 - candsfrac) * 1.5 if mayroundup else 0

# Apply less severe penalty for distance from the root solution

score -= abs(rootsolval - candsol) * 0.1

# Normalize the score

if objnorm > 0 :

score /= objnorm

# Reward more for binary variables

score += nlocksup * 0.3 - nlocksdown * 0.3

if isBinary:

score += 1.0

# Determine rounding direction

roundup = (score > 0) and (not isBinary or mayroundup)

return score, roundup

Listing 4: Heuristic for indset

⬇

def myheurdiving(mayrounddown, mayroundup, candsfrac, candsol, nlocksdown, nlocksup, obj, objnorm, pscostdown, pscostup, rootsolval, nNonz, isBinary):

score = 0.0

# Penalties for feasible rounding options to promote exploration

if mayrounddown:

score -= 10.0

if mayroundup:

score -= 10.0

# Favor fractional values at extremes (0 or 1)

score += (1 - abs(candsfrac - 0.5)) * 30.0

# Normalize impact of the objective function

score += (obj/(objnorm + 1e-5)) * 0.5

#Include pseudo cost adjustments for better decision-making

score += pscostup if mayroundup else 0.0

score -= pscostdown if mayrounddown else 0.0

# Integrate root solution value adjusted by variable complexity

score += rootsolval / (nNonz + 1)

# Amplify score for binary variables to encourage decisive rounding

if isBinary:

score *= 2.0

# Determine rounding direction based on computed score and pseudo costs

roundup = (mayrounddown and (pscostup <= pscostdown or not mayrounddown))

return score, roundup

Listing 5: Heuristic for setcover