这是indexloc提供的服务,不要输入任何密码

DHEvo: Data-Algorithm Based Heuristic Evolution for Generalizable MILP Solving

Zhihao Zhang Harbin Institute of Technology Siyuan Li Harbin Institute of Technology Nanyang Technological University Chenxi Li Harbin Institute of Technology Feifan Liu Harbin Institute of Technology Mengjing Chen Huawei Noah’s Ark Lab Kai Li Huawei Noah’s Ark Lab Tao Zhong Huawei Noah’s Ark Lab Bo An Nanyang Technological University Peng Liu Harbin Institute of Technology
Abstract

Primal heuristics play a critical role in improving the efficiency of mixed integer programming (MILP) solvers. As large language models (LLMs) have demonstrated superior code generation abilities, recent MILP works are devoted to leveraging the evolutionary computation approaches with LLMs to generate effective primal heuristics. Although the generated heuristics have achieved better solving performance than the hand-crafted ones with little adaptability, the advantage of current LLM-based methods is limited to few MILP instances in one problem class, as they fail to capture the instance characteristics in the problem class (the MILP instances generated from the same mathematical model are defined as a problem class). Since MILP instances often differ significantly in structure and feature distribution, the neglect of their characteristics in the evolution process results in poor generalization within the same problem class. To overcome this challenge, we propose a data-algorithm co-evolution framework (DHEvo) that iteratively selects representative instances and evolves corresponding heuristics. With the initial instance distribution, we develop an LLM-based multi-agent system to generate data-code pairs simultaneously. These data-code pairs are iteratively refined based on their fitness scores, leading to the identification of the most effective heuristic over the entire problem class. Extensive experiments across diverse MILP benchmarks demonstrate that our approach significantly outperforms both human-designed heuristics and existing LLM-based methods.

1 Introduction

Mixed-Integer Linear Programming (MILP) constitutes a fundamental modeling and solution framework in operations research. It has been widely applied to a broad range of real-world problems, including supply chain optimization (liu2008tsp, ; jeong2019biodiesel, ; jokinen2015milp, ), hardware design (ma2019accelerating, ; hafer1991constraint, ), production scheduling (chen2010integrated, ; caumond2009milp, ; superchi2024optimization, ), and energy management (chang2004practical, ; kassab2024optimal, ; zare2024efficient, ). In practical applications, a complex MILP problem is often defined by numerous parameters, such as cost coefficients, constraints, and bounds. These can all be mathematically represented as:

z:=minxPcx,P={xnAx<b,π¯xπ¯,xjj},formulae-sequenceassignsuperscript𝑧subscript𝑥superscript𝑃superscript𝑐top𝑥superscript𝑃conditional-set𝑥superscript𝑛formulae-sequenceformulae-sequence𝐴𝑥𝑏¯𝜋𝑥¯𝜋subscript𝑥𝑗for-all𝑗z^{\dagger}:=\min_{x\in P^{\dagger}}c^{\top}x,\quad P^{\dagger}=\left\{x\in% \mathbb{R}^{n}\mid Ax<b,\underline{\pi}\leq x\leq\overline{\pi},x_{j}\in% \mathbb{Z}\ \forall j\in\mathcal{I}\right\},italic_z start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT := roman_min start_POSTSUBSCRIPT italic_x ∈ italic_P start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x , italic_P start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT = { italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_A italic_x < italic_b , under¯ start_ARG italic_π end_ARG ≤ italic_x ≤ over¯ start_ARG italic_π end_ARG , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_Z ∀ italic_j ∈ caligraphic_I } ,

where M:=(c,P)assignsuperscript𝑀𝑐superscript𝑃M^{\dagger}:=(c,P^{\dagger})italic_M start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT := ( italic_c , italic_P start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ), Am×n𝐴superscript𝑚𝑛A\in\mathbb{R}^{m\times n}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT, bm𝑏superscript𝑚b\in\mathbb{R}^{m}italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, c,xn𝑐𝑥superscript𝑛c,x\in\mathbb{R}^{n}italic_c , italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, π¯,π¯n¯𝜋¯𝜋superscriptsubscript𝑛\underline{\pi},\overline{\pi}\in\mathbb{R}_{\infty}^{n}under¯ start_ARG italic_π end_ARG , over¯ start_ARG italic_π end_ARG ∈ blackboard_R start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and {1,,n}1𝑛\mathcal{I}\subseteq\{1,\ldots,n\}caligraphic_I ⊆ { 1 , … , italic_n } indexes the integer-constrained variables.

In real-world settings, each instance corresponds to a unique and specific problem configuration, even if it originates from the same MILP model. Moreover, the structural and statistical characteristics of these instances can vary significantly, leading to large intra-class diversity. Therefore, well-designed primal heuristics  (wong1984worst, ; balas2004pivot, ; berthold2006primal, ; wallace2010zi, ; witzig2021conflict, ), which aim to find feasible solutions quickly, are important not only for improving solver efficiency but also for achieving robust generalization across instances within the same problem class.

Current heuristic design approaches for MILP can be broadly categorized into two types: (i) manually designed and (ii) automatically generated in an adaptive manner. Manually designed heuristics, such as those implemented in commercial solvers like SCIP (scip, ) and Gurobi (gurobi, ), rely heavily on expert knowledge and often struggle to adapt to novel problem instances. The alternative paradigm is the dynamic generation of heuristic algorithms. It mainly combines large language models (LLMs) with evolutionary computation (EC) to automate the design of heuristic algorithms (yang2023large, ; meyerson2024language, ; chen2023evoprompting, ; romera2024mathematical, ; liu2024systematic, ; liu2024evolution, ; zhou2024llm4solver, ). These approaches aim to leverage the reasoning capabilities of LLMs to generate more effective and adaptive heuristics. Nevertheless, current methods mostly perform the evolutionary computation process on only a few instances. Consequently, the generated algorithms fail to capture the shared instance characteristics in the problem class, leading to insufficient representational ability and poor generalization across instances within the same problem class, which can be observed in our ablation study. In summary, both manually crafted and automatically generated heuristics struggle to balance efficiency with generalization. This stems from their limited ability to model the instance distribution and capture shared structural patterns across different instances.

To address this, we propose a data-algorithm co-evolution framework (DHEvo) that generates generalizable algorithms by iteratively evolving both the MILP data and the algorithms. We start by randomly sampling instances from a domain-specific dataset and using an LLM-based multi-agent evolution system (MA-Evolution System) to create initial data-code pairs. Inspired by few-shot (fewshot1, ; fewshot2, ; fewshot3, ; fewshot4, ), we assume that data-code pairs with the highest fitness (relative primal gap) are more likely to generalize well across instances within a problem class. Therefore, the pairs with the highest fitness scores are selected to form the initial population for further evolution. This process is iterated over multiple generations, gradually narrowing down to the most representative algorithm and instance.

Refer to caption
Figure 1: Illustration of data-algorithm co-evolution framework (DHEvo).

In summary, our contributions are as follows:

  • We propose a unified evolutionary computation framework based on data-algorithm co-evolution. It enables better approximation of the instance distribution and enhances the representational capacity of the learned heuristics, leading to improved generalization.

  • We further introduce a co-evolutionary solution tailored to MILP tasks by incorporating a multi-agent evolution system (MA-Evolution System). It refines the evolutionary process, minimizing generation errors and optimizing the quality of the resulting heuristics.

  • Extensive experiments show that our method significantly improves the generalization of diving heuristics and delivers substantial performance gains across multiple MILP datasets.

2 Background and related works

2.1 Branch&Bound and diving heuristic

A common approach to solving MILP problems is Branch-and-Bound (B&B) (land2009automatic, ), which recursively builds a search tree by selecting variables and partitioning the problem into subproblems. These subproblems are created by adding constraints based on the fractional value of the variable in the LP relaxation solution. B&B also prunes branches using objective bounds to improve efficiency. However, B&B can be computationally expensive for large-scale problems, so primal heuristics like the diving heuristic are often used to accelerate the search. Diving performs a depth-first search by iteratively rounding variables and solving the linear program until a feasible solution is found or infeasibility is proven. While existing diving heuristics are effective, they often require manual tuning and expert knowledge to design. In contrast, our approach uses evolutionary computation to automatically generate problem-specific heuristics, offering more flexibility, adaptability, and reduced reliance on expert knowledge. Our experiments show that this approach significantly improves solver performance across multiple datasets.

2.2 Performance measurement

To evaluate the performance of MILP solvers, we use several key performance metrics: Primal-Dual Gap, Primal-Dual Integral, and Primal Gap.

Primal-Dual Gap It is a widely used measure that quantifies the difference between the primal objective value and the dual objective value at any given time during the optimization process. It gives an indication of how close the current solution z~~𝑧\tilde{z}over~ start_ARG italic_z end_ARG is to an optimal solution z~superscript~𝑧\tilde{z}^{*}over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Mathematically, the Primal-Dual Gap is defined as:

γpd(z~,z~)={|z~z~|max(|z~|,|z~|)if 0<z~,z~<,1otherwise.subscript𝛾𝑝𝑑~𝑧superscript~𝑧cases~𝑧superscript~𝑧~𝑧superscript~𝑧formulae-sequenceif 0~𝑧superscript~𝑧1otherwise\gamma_{pd}(\tilde{z},\tilde{z}^{*})=\begin{cases}\frac{|\tilde{z}-\tilde{z}^{% *}|}{\max(|\tilde{z}|,|\tilde{z}^{*}|)}&\text{if }0<\tilde{z},\tilde{z}^{*}<% \infty,\\ 1&\text{otherwise}.\end{cases}italic_γ start_POSTSUBSCRIPT italic_p italic_d end_POSTSUBSCRIPT ( over~ start_ARG italic_z end_ARG , over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = { start_ROW start_CELL divide start_ARG | over~ start_ARG italic_z end_ARG - over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | end_ARG start_ARG roman_max ( | over~ start_ARG italic_z end_ARG | , | over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | ) end_ARG end_CELL start_CELL if 0 < over~ start_ARG italic_z end_ARG , over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT < ∞ , end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL otherwise . end_CELL end_ROW

Primal-Dual Integral While the primal-dual gap captures a snapshot at a particular time, the primal-dual integral evaluates the solver’s progress over the entire solving process by aggregating the primal-dual gap over time. It is given by:

γpdi(t)=0tγpd(z~(τ),z~(τ))𝑑τ,subscript𝛾𝑝𝑑𝑖𝑡superscriptsubscript0𝑡subscript𝛾𝑝𝑑~𝑧𝜏superscript~𝑧𝜏differential-d𝜏\gamma_{pdi}(t)=\int_{0}^{t}\gamma_{pd}(\tilde{z}(\tau),\tilde{z}^{*}(\tau))\,% d\tau,italic_γ start_POSTSUBSCRIPT italic_p italic_d italic_i end_POSTSUBSCRIPT ( italic_t ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_p italic_d end_POSTSUBSCRIPT ( over~ start_ARG italic_z end_ARG ( italic_τ ) , over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_τ ) ) italic_d italic_τ ,

where γpd(z~(τ),z~(τ))subscript𝛾𝑝𝑑~𝑧𝜏superscript~𝑧𝜏\gamma_{pd}(\tilde{z}(\tau),\tilde{z}^{*}(\tau))italic_γ start_POSTSUBSCRIPT italic_p italic_d end_POSTSUBSCRIPT ( over~ start_ARG italic_z end_ARG ( italic_τ ) , over~ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_τ ) ) represents the Primal-Dual Gap at time τ𝜏\tauitalic_τ.

Primal Gap It is used to evaluate the effectiveness of diving heuristics, which primarily aim to improve the primal performance by guiding the search toward better feasible solutions. The relative primal gap is defined as the absolute difference between the current objective value z~~𝑧\tilde{z}over~ start_ARG italic_z end_ARG and the optimal solution zsuperscript𝑧z^{\dagger}italic_z start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT, normalized by the objective value of the optimal solution. The formula for the primal gap is given by:

γp(z~)=|z~z||z|,subscript𝛾𝑝~𝑧~𝑧superscript𝑧superscript𝑧\gamma_{p}(\tilde{z})=\frac{|\tilde{z}-z^{\dagger}|}{|z^{\dagger}|},italic_γ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( over~ start_ARG italic_z end_ARG ) = divide start_ARG | over~ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT | end_ARG start_ARG | italic_z start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT | end_ARG ,

where zsuperscript𝑧z^{\dagger}italic_z start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT is the objective value of the optimal solution obtained after presolving. In the case where |z|=0superscript𝑧0|z^{\dagger}|=0| italic_z start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT | = 0, we use the following modified primal gap:

γp(z~)=|z~z|.subscriptsuperscript𝛾𝑝~𝑧~𝑧superscript𝑧\gamma^{\prime}_{p}(\tilde{z})=|\tilde{z}-z^{\dagger}|.italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( over~ start_ARG italic_z end_ARG ) = | over~ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT | .

2.3 LLM for evolutionary computation

Evolutionary computation is a widely used method for solving optimization problems inspired by natural evolution (back1997handbook, ). Given an optimization problem, evolutionary algorithms (zhou2019evolutionary, ; eiben2015evolutionary, ) treat each candidate solution as an individual. They iteratively apply operations such as parent selection, crossover, mutation, fitness evaluation, and survivor selection to evolve the population toward better solutions.

In recent years, the capabilities of large language models have advanced significantly (naveed2023comprehensive, ). Researchers have started exploring the integration of LLMs into evolutionary computation frameworks to automatically generate heuristic algorithms (liu2024systematic, ; zhang2024understanding, ; wu2024evolutionary, ). For example, Funsearch (romera2024mathematical, ) has developed an evolutionary computation framework based on LLM to improve the quality of generated functions iteratively. EoH (liu2024evolution, ) further integrates thoughts with code to generate more effective algorithms and has achieved promising results, such as the online bin packing problem. Current methods typically train on a limited set of specific instances. This prevents large language models from learning the common characteristics of the problem class. As a result, the generated algorithms excel in similar instances but lack generalization with others within the same class.

3 Method

In this section, we introduce the problem setting and our key insights in subsection 3.1, and then present the proposed data-algorithm co-evolution framework in subsection 3.2. Finally, we detail the implementation process, including the specific evolutionary operation and the design of the MA-Evolution System in subsection 3.3.

3.1 Problem formulation and insight

In real-world MILP tasks, instances within the same problem class can differ significantly in their distributions, constraints, and structures. However, they often share common features, such as similar types of constraints, variable ranges, or recurring patterns in the objective function. As shown in Figure 2, we visualize 17 typical features from the four combinatorial optimization datasets, including variable-constraint ratios (detailed results can be found in the appendix). How to extract these shared features within the same problem class to improve algorithm performance has become a key focus in large language model evolutionary computation research.

Refer to caption
Figure 2: The visualization of instance features via t-SNE is presented as follows: Panel a represents the overall distribution of instances across four datasets, while panels b, c, d, and e show the distributions for Cauctions, Setcover, Facilities, and Indset, respectively. The Setcover (orange) and Facilities (green) datasets are represented as single points due to their fixed numbers and proportions of variables and constraints.

Current evolutionary computation methods typically evaluate heuristics by averaging their performance over a small number of randomly selected instances. Specifically, given a heuristic algorithm hhitalic_h and a set of instances ={I1,I2,,Ik}subscript𝐼1subscript𝐼2subscript𝐼𝑘\mathcal{I}=\{I_{1},I_{2},\dots,I_{k}\}caligraphic_I = { italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, the fitness of the heuristic is calculated as:

(h)=1ki=1kPerf(h,Ii),1𝑘superscriptsubscript𝑖1𝑘Perfsubscript𝐼𝑖\mathcal{F}(h)=\frac{1}{k}\sum_{i=1}^{k}\text{Perf}(h,I_{i}),caligraphic_F ( italic_h ) = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT Perf ( italic_h , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,

where Perf(h,Ii)Perfsubscript𝐼𝑖\text{Perf}(h,I_{i})Perf ( italic_h , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) represents the performance of the heuristic hhitalic_h on instance Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

This method assumes that all instances are equally representative, which is often not the case in MILP problems, where instances’ structure can vary significantly. To account for this, we introduce the variance of performance σ2(h)superscript𝜎2\sigma^{2}(h)italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_h ) over a broader set of instances 𝒟𝒟\mathcal{D}caligraphic_D. A high σ2(h)superscript𝜎2\sigma^{2}(h)italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_h ) indicates that the heuristic performs inconsistently across different instances, which suggests poor generalization. Inspired by few-shot learning (fewshot1, ; fewshot2, ; fewshot3, ; fewshot4, ), extensive research (wu2017towards, ; akbari2021does, ; jiang2019fantastic, ) has shown that starting with "simple" or "representative" samples in complex datasets often enhances both learning efficiency and generalization. In this context, to identify the most representative instances, we hypothesize that those with higher fitness scores in the evolutionary computation process will likely exhibit greater structural representativeness. The rationale behind this assumption is as follows:

Insight 1: High fitness scores indicate structural representativeness. Let I𝐼Iitalic_I be an MILP instance with LP relaxation ILPsubscript𝐼𝐿𝑃I_{LP}italic_I start_POSTSUBSCRIPT italic_L italic_P end_POSTSUBSCRIPT, and let Δ(I)=|zLPz|Δ𝐼subscriptsuperscript𝑧𝐿𝑃superscript𝑧\Delta(I)=|z^{*}_{LP}-z^{*}|roman_Δ ( italic_I ) = | italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L italic_P end_POSTSUBSCRIPT - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | denote the integrality gap of I𝐼Iitalic_I, where zLPsubscriptsuperscript𝑧𝐿𝑃z^{*}_{LP}italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L italic_P end_POSTSUBSCRIPT and zsuperscript𝑧z^{*}italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT represent the optimal solutions to the LP relaxation and the integer program, respectively. If the integrality gap is small, i.e., Δ(I)ϵΔ𝐼italic-ϵ\Delta(I)\leq\epsilonroman_Δ ( italic_I ) ≤ italic_ϵ, then heuristics trained on I𝐼Iitalic_I will achieve lower variance σ2(h)superscript𝜎2\sigma^{2}(h)italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_h ) to instances with similar LP relaxation tightness Δ(I)Δ𝐼\Delta(I)roman_Δ ( italic_I ). This suggests that higher fitness scores, which often correspond to tighter LP relaxations, indicate a more representative structure that enables better generalization across similar instances.

Insight 2: regular feasible regions enhance heuristic stability. Let (I)𝐼\mathcal{R}(I)caligraphic_R ( italic_I ) represent the feasible region of instance I𝐼Iitalic_I, and ν(I)𝜈𝐼\nu(I)italic_ν ( italic_I ) be a measure of the irregularity of (I)𝐼\mathcal{R}(I)caligraphic_R ( italic_I ), such as the number of disconnected components or redundant constraints. If ν(I)𝜈𝐼\nu(I)italic_ν ( italic_I ) is small, then heuristics trained on I𝐼Iitalic_I will perform more consistently on similar instances with small ν(I)𝜈𝐼\nu(I)italic_ν ( italic_I ). This suggests that instances with more regular feasible regions are more stable and, thus, their heuristics are likely to generalize better.

3.2 Data-Algorithm based heuristic evolution framework

As illustrated in Figure 1, our framework adopts a structured evolutionary process that tightly couples instance selection with heuristic generation and optimization. This process is designed to progressively improve both the solution quality and the generalization capability of the generated heuristics through data-algorithm co-evolution.

Algorithm 1 DHEvo framework
1:Problem distribution 𝒟𝒟\mathcal{D}caligraphic_D, population size m𝑚mitalic_m, number of instances n𝑛nitalic_n, top-k𝑘kitalic_k, total iterations T𝑇Titalic_T
2:Final heuristic algorithm population finalsuperscriptfinal\mathcal{H}^{\text{final}}caligraphic_H start_POSTSUPERSCRIPT final end_POSTSUPERSCRIPT
3:Initialization: Sample initial instance set 0{I1,,In}𝒟subscript0subscript𝐼1subscript𝐼𝑛similar-to𝒟\mathcal{I}_{0}\in\mathcal{\{}I_{1},\dots,I_{n}\}\sim\mathcal{D}caligraphic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ { italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ∼ caligraphic_D
4:Generate initial algorithm population 0={h1(0),,hm(0)}subscript0superscriptsubscript10superscriptsubscript𝑚0\mathcal{H}_{0}=\{h_{1}^{(0)},\dots,h_{m}^{(0)}\}caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT } via MA-Evolution System (MA-E)
5:for each hj(0)0superscriptsubscript𝑗0subscript0h_{j}^{(0)}\in\mathcal{H}_{0}italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT do
6:    Evaluate f(hj(0))=Perf(hj(0),Ij)𝑓superscriptsubscript𝑗0Perfsuperscriptsubscript𝑗0subscript𝐼𝑗f(h_{j}^{(0)})=\text{Perf}(h_{j}^{(0)},I_{j})italic_f ( italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) = Perf ( italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
7:end for
8:for each Ii0subscript𝐼𝑖subscript0I_{i}\in\mathcal{I}_{0}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT do
9:    Generate i(1)=MA-E(Ii,0)superscriptsubscript𝑖1MA-Esubscript𝐼𝑖subscript0\mathcal{H}_{i}^{(1)}=\text{MA-E}(I_{i},\mathcal{H}_{0})caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = MA-E ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )
10:    Evaluate fitness for each hi(1)superscriptsubscript𝑖1h\in\mathcal{H}_{i}^{(1)}italic_h ∈ caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT
11:    Select top performer hi=argmaxhPerf(h,Ii)superscriptsubscript𝑖subscriptPerfsubscript𝐼𝑖h_{i}^{*}=\arg\max_{h}\text{Perf}(h,I_{i})italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT Perf ( italic_h , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
12:end for
13:Let 𝒫top-k pairs in 𝒫(1) ranked by Perf(Ii,hi)superscript𝒫top-𝑘 pairs in superscript𝒫1 ranked by Perfsubscript𝐼𝑖superscriptsubscript𝑖\mathcal{P}^{*}\leftarrow\text{top-}k\text{ pairs in }\mathcal{P}^{(1)}\text{ % ranked by }\operatorname{Perf}(I_{i},h_{i}^{*})caligraphic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← top- italic_k pairs in caligraphic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ranked by roman_Perf ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
14:for iteration t=2𝑡2t=2italic_t = 2 to T𝑇Titalic_T do
15:    Re-Initialization:
16:    for each (Ij,hj)𝒫subscript𝐼𝑗superscriptsubscript𝑗superscript𝒫(I_{j},h_{j}^{*})\in\mathcal{P}^{*}( italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∈ caligraphic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT do
17:         Generate new candidates via prompt: j(t)=MA-E(Ij,Prompt(hj))superscriptsubscript𝑗𝑡MA-Esubscript𝐼𝑗Promptsuperscriptsubscript𝑗\mathcal{H}_{j}^{(t)}=\text{MA-E}(I_{j},\text{Prompt}(h_{j}^{*}))caligraphic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = MA-E ( italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , Prompt ( italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) )
18:         Evaluate each hj(t)superscriptsubscript𝑗𝑡h\in\mathcal{H}_{j}^{(t)}italic_h ∈ caligraphic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and select hj(t)=argmaxhPerf(h,Ij)superscriptsubscript𝑗𝑡subscriptPerfsubscript𝐼𝑗h_{j}^{(t)}=\arg\max_{h}\text{Perf}(h,I_{j})italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT Perf ( italic_h , italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
19:    end for
20:    Update 𝒫(t)={(Ij,hj(t))}superscript𝒫𝑡subscript𝐼𝑗superscriptsubscript𝑗𝑡\mathcal{P}^{(t)}=\{(I_{j},h_{j}^{(t)})\}caligraphic_P start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = { ( italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) }, smale top-k𝑘kitalic_k pairs 𝒫Smaple(Perf(h,Ii))superscript𝒫𝑆𝑚𝑎𝑝𝑙𝑒Perfsubscript𝐼𝑖\mathcal{P}^{*}\leftarrow Smaple(\text{Perf}(h,I_{i}))caligraphic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← italic_S italic_m italic_a italic_p italic_l italic_e ( Perf ( italic_h , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )
21:end for
22:Final Selection:
23:Compute favg(hj(T))=1ki=1kPerf(hj(T),Ii)subscript𝑓avgsuperscriptsubscript𝑗𝑇1𝑘superscriptsubscript𝑖1𝑘Perfsuperscriptsubscript𝑗𝑇subscript𝐼𝑖f_{\text{avg}}(h_{j}^{(T)})=\frac{1}{k}\sum_{i=1}^{k}\text{Perf}(h_{j}^{(T)},I% _{i})italic_f start_POSTSUBSCRIPT avg end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT Perf ( italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
24:Select top heuristics final={hj(T)}j=1ksuperscriptfinalsuperscriptsubscriptsuperscriptsubscript𝑗𝑇𝑗1𝑘\mathcal{H}^{\text{final}}=\{h_{j}^{(T)}\}_{j=1}^{k}caligraphic_H start_POSTSUPERSCRIPT final end_POSTSUPERSCRIPT = { italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT
25:return finalsuperscriptfinal\mathcal{H}^{\text{final}}caligraphic_H start_POSTSUPERSCRIPT final end_POSTSUPERSCRIPT

We assign fitness scores to each instance individually rather than averaging them across the dataset, which allows complex or unrepresentative instances to dominate the optimization process and ultimately undermines the generalization ability of the learned heuristics. As shown in Algorithm 1, it begins by selecting a subset of high-quality instance-heuristic pairs (,)superscriptsuperscript(\mathcal{I}^{*},\mathcal{H}^{*})( caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , caligraphic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), where each heuristic in superscript\mathcal{H}^{*}caligraphic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT demonstrates strong performance on its corresponding instance in superscript\mathcal{I}^{*}caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. These pairs are identified by evaluating all candidate heuristics against a sampled MILP dataset and ranking them based on a independent fitness score. These instances with higher scores indicate better structural representativeness for the problem class.

To balance exploration and exploitation during selection, we employ a temperature-controlled strategy. The retention probability 𝒮𝒮\mathcal{S}caligraphic_S for each candidate is modeled by a sigmoid function with temperature parameter T𝑇Titalic_T:

𝒮=11+exp(1T).𝒮111𝑇\mathcal{S}=\frac{1}{1+\exp\left(-\frac{1}{T}\right)}.caligraphic_S = divide start_ARG 1 end_ARG start_ARG 1 + roman_exp ( - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) end_ARG .

Higher temperatures encourage diversity by favoring a broader range of candidates, while lower temperatures promote exploitation of already promising heuristics. Once the top-performing pairs (,)superscriptsuperscript(\mathcal{I}^{*},\mathcal{H}^{*})( caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , caligraphic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) are selected, the heuristics in superscript\mathcal{H}^{*}caligraphic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT are used to extract prompts for generating new candidates via our MA-Evolution System, a multi-agent LLM-based generation and evaluation module. These new heuristics are then re-evaluated on the structurally representative instances from superscript\mathcal{I}^{*}caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, ensuring alignment with the overall problem distribution.

This cycle—comprising heuristic generation, evaluation, selection, and re-initialization—is repeated for a fixed number of iterations. Through this iterative process, the framework incrementally converges toward a set of heuristics that exhibit both strong instance-level performance and robust generalization across the problem class.

At the final stage, the top k𝑘kitalic_k instance-heuristic pairs are fixed, and their aggregated performance is assessed. The resulting heuristics constitute the final evolved algorithm portfolio for solving MILP problems.

Refer to caption
Figure 3: Illustration of MA-Evolution System.

3.3 Framework implementation

Evolution operation Our evolutionary framework comprises four main operations: initialization, crossover, mutation, and parent selection. As shown in Figure 3, we implement initialization, crossover, and mutation through tailored prompts, and leverage it to generate candidate individuals. Unlike traditional LLM-based evolutionary approaches, we leverage the MA-Evolution System to perform both crossover and mutation, enabling more targeted and problem-aware generation of new individuals. Specific prompts are employed only in the first generation during initialization to generate the initial population. In subsequent generations, the initialization step reuses high-quality algorithms obtained from the previous round. Parent heuristics are combined using specific prompts to create new candidate algorithms during crossover. Mutation then slightly alters these candidates to explore nearby solutions. To balance exploration and exploitation during parent selection, we adopt fitness-proportional selection (zhou2019evolutionary, ) by assigning selection probabilities to individuals based on their fitness scores.

MA-Evolution System To generate high-quality heuristics, we propose a multi-agent evolution system inspired by multi-agent system (mad1, ; mad2, ; mad3, ; mad4, ). As shown in Figure 3, the process includes three stages. In the first stage, the Designer agent receives the MILP task context, existing code, and the specified evolutionary operation. It produces a high-level design plan and procedural outline for a new heuristic. In the second stage, the Coder agent implements the algorithm based on the Designer’s plan. The Reviewer agent then checks the code by compiling it and performing logical analysis, providing feedback and suggestions. The Coder and Reviewer iteratively improve the code through several rounds of interaction. In the final stage, if no consensus is reached, the Judge agent reviews the full interaction history and feedback, and makes a final decision on the output code and its description.

Prompt engineering Our prompts are constructed based on three essential elements: the designated role of the large language model within the MA-E System, contextual information about the MILP problem, and evolution-specific operations intrinsic to evolutionary computation, such as mutation. The full prompt information is presented in the appendix.

4 Experiments

4.1 Experimental settings

To demonstrate the superiority of our method in the diving task, we conduct three sets of experiments across six MILP datasets. 1) The first set of experiments was designed to study the diving performance of our method and compare it against existing diving heuristics. 2) To evaluate the solving efficiency of our method, we conduct a second set of experiments comparing it against default SCIP (achterberg2007constraint, ), tuned SCIP, and EoH (liu2024evolution, ). 3) To evaluate the effectiveness of our method on real-world datasets, we conducted a third set of experiments. We replaced the default diving heuristics in the solver with our proposed method and aimed to improve overall performance on practical MILP tasks. The results of this benchmark experiment are reported in the Appendix.

4.2 Experiments for diving performance

Experimental setup With this first set of experiments, we evaluate a total of 11 publicly available diving methods on four combinatorial optimization benchmarks: cauctions, setcover, facilities, and indset. These heuristics include 6 human-designed algorithms integrated into the open-source solver SCIP (i.e., coefficient, fractional, linesearch, pseudocost, distributional, vectorlength, and farkas (witzig2021conflict, )), a learning-based GNN method L2DIVE (paulus2023learning, ), and 4 heuristics generated via large language models using evolutionary computation. Among the LLM-based methods, we include LLM4Solver (zhou2024llm4solver, ), FunSearch (romera2024mathematical, ), EoH (liu2024evolution, ) and HillClimb (HillClimb, ).

Table 1: The standard error and average relative primal gap of different diving heuristics. The results compare our method with other LLM-based evolutionary approaches, as well as seven human-designed and one learning-based baseline.
Category Method Cauctions Facilities Setcover Indset
LLM-based Evolution DHEvo(Ours) 1.92 (2.45) 0.70 (1.40) 9.74 (7.35) 1.12 (1.31)
LLM4Solver (zhou2024llm4solver, ) 2.50 (3.50) 0.85 (1.42) 18.33 (19.26) 1.13 (1.15)
Funsearch (romera2024mathematical, ) 3.04 (7.35) 1.18 (3.06) 77.99 (83.89) 1.61 (3.75)
HillClimb (HillClimb, ). 6.10 (60.30) 0.75 (1.40) 81.55 (343.17) 1.61 (3.75)
EoH (liu2024evolution, ) 3.15 (3.15) 0.80 (1.47) 20.39 (19.70) 0.92 (1.06)
Hand-crafted Heuristics Coeficient 23.67 (2.14) 3.20 (3.76) 68.58 (345.99) 4.23 (14.42)
Distributional 47.80 (71.56) 1.46 (2.12) 75.79 (325.90) 2.57 (10.59)
Farkas 23.32 (0.89) 1.04 (1.64) 8.13 (8.22) -
Pseudocost 22.51 (2.30) 1.06 (1.23) 23.56 (30.31) 3.31 (2.98)
Linesearch 22.95 (0.90) 13.80 (10.94) 68.59 (346.00) 3.31 (3.10)
Vectorlength 42.93 (83.57) 13.93 (10.61) 68.59 (346.01) 8.89 (7.61)
Learning-based L2DIVE (paulus2023learning, ) 2.60 0.71 3.58 1.37

To independently investigate the quality of heuristic algorithms, diving is applied solely at the root node of each instance. All other components of the solver, such as branching, cutting planes, and other primal heuristics, are disabled. For fitness evaluation during evolution, we generate 50 instances each for setcover, cauctions, and indset, and 25 instances for facilities. We test the discovered diving heuristics on 100 instances each. For fairness, all LLM-based evolutionary algorithms are trained on the same dataset mentioned above. They also use the same API interfaces. Moreover, we ensure that the prompts used for these methods are aligned in terms of task context, including MILP-specific background and diving-related objectives, to match the design of our proposed method.

Experimental results We compare our method to various baselines in Table 1. Under the metric of average primal gap, our method consistently achieves strong results across all datasets. In particular, on the Indset dataset, our approach improves over the best manually designed heuristic by 56.04%. Compared to other LLM-based algorithm design methods, our approach also achieves state-of-the-art performance. For example, on the Setcover dataset, our method surpasses the best LLM-based baseline by 61.8%. More importantly, in terms of performance variance, our method achieves the lowest variance across all four datasets. Notably, on the setcover dataset, our approach reduces variance by 46.9% compared to the best-performing LLM-based algorithm design method.

Considering both average performance and variance, our method consistently outperforms all baselines across the four combinatorial optimization datasets. This demonstrates the strong effectiveness and robustness of our approach in generating high-quality heuristics that generalize well across diverse problem instances.

4.3 Experiments for solving efficiency

Experimental setup To compare the solving efficiency of our method, we use three baselines: default SCIP, tuned SCIP (with adjusted freq and freqofs), and EoH. We run experiments on the same four combination optimization benchmark datasets. For each dataset, we randomly generate 1000 instances and select the 100 most challenging ones for testing. We set a time limit of Tlimit=900subscript𝑇limit900T_{\text{limit}}=900italic_T start_POSTSUBSCRIPT limit end_POSTSUBSCRIPT = 900 seconds for each instance and use the primal-dual integral to evaluate the solving efficiency.

Experimental results We compare our method to the default, tuned SCIP settings and EoH, as shown in Table 2. Results demonstrate that our approach not only improves solution quality but also leads to better solving efficiency. On the challenging facility dataset, our method outperforms the current state-of-the-art by 6.7% in solving time and 2.8% in primal-dual integral.

Table 2: Performance comparison of our method, EoH, default SCIP and tuned SCIPP. Each cell reports the solving time and the primal-dual integral (PDI) as time / PDI.
Method Cauctions Facility Setcover Indset
Default SCIP 4.08/55.87 301.20/506.71 2.43/117.65 21.07/230.33
Tuned SCIP 2.73/24.21 201.64/553.15 2.33/77.02 22.71/167.43
EoH (liu2024evolution, ) 2.62/37.12 197.35/504.56 2.76/96.75 20.32/151.34
DHEvo(Ours) 2.28/23.42 181.27/490.43 2.27/75.88 18.54/146.39

4.4 Ablation Study

To verify the effectiveness of each component in our method, we conduct an ablation analysis of our method using four combinatorial optimization datasets. Our framework contains two main components: data-algorithm co-evolution and multi-agent evolution system. Next, we will conduct ablation experiments on these two parts to verify their effectiveness on the validation dataset.

Analysis on data-algorithm co-evolution We conduct a comprehensive ablation study and comparative evaluation to assess the effectiveness of the data-algorithm co-evolution mechanism. When this mechanism is excluded, the fitness score is computed as the average relative primal gap across all training instances without any selection. As a result, the variance in the performance of evolved heuristics increases by approximately 10% on each dataset. Meanwhile, the overall solution quality declines by around 10% across all datasets. This degradation arises because all training instances are treated equally during evolution, allowing complex or unrepresentative instances to dominate the optimization process, which undermines the generalization of the learned heuristics. Additionally, we replace our MA-Evolution System with the existing best LLM-based evolutionary methods EoH. As shown in Table 3, the variants augmented with our co-evolution framework achieve markedly better generalization performance compared to their original versions. In particular, the variance of the evolved heuristics decreases by nearly 30% on the setcover dataset when the co-evolution mechanism is removed.

Table 3: Comparison of standard error and average relative primal gap on validation dataset, including our full method, its variant without co-evolution, EoH baseline, and EoH with co-evolution.
Method Cautions Facilities Setcover Indset
DHEvo 2.15(2.53) 0.83(1.30) 9.74(13.42) 1.01(1.03)
DHEvo_OFF 2.33(2.79) 0.93(1.45) 10.8(13.99) 1.23(1.11)
EoH_DH 2.90(5.60) 0.84(1.47) 18.31(17.48) 1.07(1.14)
EoH 4.38(6.15) 1.96(4.36) 26.14(28.89) 1.36(1.21)

Analysis on MA-Evolution System To verify the effectiveness of the MA-Evolution System in generating higher-quality diversity generated individual algorithms, we conduct an ablation study by removing this system from our framework and comparing it with the original version in the setcover dataset.

To evaluate the diversity of algorithms generated by the MA-Evolution System, inspired by diversity indicator metrics  (diversity1, ; diversity2, ), we introduce a diversity index defined as

DI=Hlog2N,𝐷𝐼𝐻subscript2𝑁DI=\frac{H}{\log_{2}N},italic_D italic_I = divide start_ARG italic_H end_ARG start_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_N end_ARG ,

where H𝐻Hitalic_H is the Shannon entropy of the score distribution over N𝑁Nitalic_N generated samples. A value closer to 1 indicates higher diversity among solutions.

Table 4: Ablation study of the MA-Evolution system in terms of average primal gap (APG), diversity index (DI), and primal gap standard deviation (PGSD).
Method APG DI PGSD
MA-Evolution OFF 9.14 0.76 8.75
MA-Evolution ON 8.00 0.88 4.78

As shown in the Table 4, the algorithms generated by the MA-Evolution System achieve significantly lower average primal gaps, improving by 12.4% compared to those without the MA-Evolution System. Additionally, they show a 15.8% improvement in the diversity index, demonstrating the superior diversity of the generated heuristics.

5 Conclusion

We present a novel data-algorithm co-evolution framework for solving MILP. By iteratively identifying the most representative instances and co-evolving heuristic algorithms based on them, our method significantly improves the generalization ability of the generated heuristics within the same problem class. Unlike traditional approaches that treat training data as static, our method selects representative instances during the evolutionary process, enabling the algorithm to generalize better across diverse problem distributions. Meanwhile, heuristic algorithms are co-evolved with the instance set, ensuring mutual adaptation and continuous improvement. We also introduce a multi-agent evolutionary system to improve generation quality and solution diversity. Experimental results show that our approach significantly outperforms existing human-designed, learning-based, and LLM-based baselines in both the primal gap and solving efficiency.

References

  • [1] Songsong Liu, Jose M Pinto, and Lazaros G Papageorgiou. A tsp-based milp model for medium-term planning of single-stage continuous multiproduct plants. Industrial & Engineering Chemistry Research, 47(20):7733–7743, 2008.
  • [2] Hyunju Jeong, Heidi L Sieverding, and James J Stone. Biodiesel supply chain optimization modeled with geographical information system (gis) and mixed-integer linear programming (milp) for the northern great plains region. BioEnergy research, 12:229–240, 2019.
  • [3] Raine Jokinen, Frank Pettersson, and Henrik Saxén. An milp model for optimization of a small-scale lng supply chain along a coastline. Applied energy, 138:423–431, 2015.
  • [4] Kefan Ma, Liquan Xiao, Jianmin Zhang, and Tiejun Li. Accelerating an fpga-based sat solver by software and hardware co-design. Chinese Journal of Electronics, 28(5):953–961, 2019.
  • [5] Lou Hafer. Constraint improvements for milp-based hardware synthesis. In Proceedings of the 28th ACM/IEEE Design Automation Conference, pages 14–19, 1991.
  • [6] Zhi-Long Chen. Integrated production and outbound distribution scheduling: review and extensions. Operations research, 58(1):130–148, 2010.
  • [7] Anthony Caumond, Philippe Lacomme, Aziz Moukrim, and Nikolay Tchernev. An milp for scheduling problems in an fms with one vehicle. European Journal of Operational Research, 199(3):706–722, 2009.
  • [8] Francesco Superchi, Nathan Giovannini, Antonis Moustakis, George Pechlivanoglou, and Alessandro Bianchini. Optimization of the power output scheduling of a renewables-based hybrid power station using milp approach: The case of tilos island. Renewable Energy, 220:119685, 2024.
  • [9] Gary W Chang, YD Tsai, CY Lai, and JS Chung. A practical mixed integer linear programming based approach for unit commitment. In IEEE Power Engineering Society General Meeting., pages 221–225. IEEE, 2004.
  • [10] Fadi Agha Kassab, Berk Celik, Fabrice Locment, Manuela Sechilariu, Sheroze Liaquat, and Timothy M Hansen. Optimal sizing and energy management of a microgrid: A joint milp approach for minimization of energy cost and carbon emission. Renewable Energy, 224:120186, 2024.
  • [11] Peyman Zare, Abdolmajid Dejamkhooy, and Iraj Faraji Davoudkhani. Efficient expansion planning of modern multi-energy distribution networks with electric vehicle charging stations: A stochastic milp model. Sustainable Energy, Grids and Networks, 38:101225, 2024.
  • [12] Hoon Liong Ong and JB Moore. Worst-case analysis of two travelling salesman heuristics. Operations Research Letters, 2(6):273–277, 1984.
  • [13] Egon Balas, Stefan Schmieta, and Christopher Wallace. Pivot and shift—a mixed integer programming heuristic. Discrete Optimization, 1(1):3–12, 2004.
  • [14] Timo Berthold. Primal heuristics for mixed integer programs. PhD thesis, Zuse Institute Berlin, 2006.
  • [15] Chris Wallace. Zi round, a mip rounding heuristic. Journal of Heuristics, 16:715–722, 2010.
  • [16] Jakob Witzig and Ambros Gleixner. Conflict-driven heuristics for mixed integer programming. INFORMS Journal on Computing, 33(2):706–720, 2021.
  • [17] Ambros Gleixner, Leon Eifler, Tristan Gally, Gerald Gamrath, Patrick Gemander, Robert Lion Gottwald, Gregor Hendel, Christopher Hojny, Thorsten Koch, Matthias Miltenberger, et al. The scip optimization suite 5.0. Optimization Online, 2017.
  • [18] Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023.
  • [19] Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
  • [20] Elliot Meyerson, Mark J Nelson, Herbie Bradley, Adam Gaier, Arash Moradi, Amy K Hoover, and Joel Lehman. Language model crossover: Variation through few-shot prompting. ACM Transactions on Evolutionary Learning, 4(4):1–40, 2024.
  • [21] Angelica Chen, David Dohan, and David So. Evoprompting: Language models for code-level neural architecture search. Advances in neural information processing systems, 36:7787–7817, 2023.
  • [22] Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, 2024.
  • [23] Fei Liu, Yiming Yao, Ping Guo, Zhiyuan Yang, Zhe Zhao, Xi Lin, Xialiang Tong, Mingxuan Yuan, Zhichao Lu, Zhenkun Wang, et al. A systematic survey on large language models for algorithm design. arXiv preprint arXiv:2410.14716, 2024.
  • [24] Fei Liu, Xialiang Tong, Mingxuan Yuan, Xi Lin, Fu Luo, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang. Evolution of heuristics: Towards efficient automatic algorithm design using large language model. arXiv preprint arXiv:2401.02051, 2024.
  • [25] Yuyan Zhou, Jie Wang, Yufei Kuang, Xijun Li, Weilin Luo, Jianye HAO, and Feng Wu. Llm4solver: Large language model for efficient algorithm design of combinatorial optimization solver. https://openreview.net/pdf?id=XTxdDEFR6D, 2024.
  • [26] Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander Hauptmann. Self-paced curriculum learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.
  • [27] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenenbaum, Hugo Larochelle, and Richard S Zemel. Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676, 2018.
  • [28] Ryoma Sato, Makoto Yamada, and Hisashi Kashima. Learning to sample hard instances for graph algorithms. In Asian Conference on Machine Learning, pages 503–518. PMLR, 2019.
  • [29] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of annual international conference on machine learning, pages 41–48, 2009.
  • [30] Ailsa H Land and Alison G Doig. An automatic method for solving discrete programming problems. In 50 Years of Integer Programming 1958-2008: From the Early Years to the State-of-the-Art, pages 105–132. Springer, 2009.
  • [31] Thomas Bäck, David B Fogel, and Zbigniew Michalewicz. Handbook of evolutionary computation. Release, 97(1):B1, 1997.
  • [32] Zhi-Hua Zhou, Yang Yu, and Chao Qian. Evolutionary learning: Advances in theories and algorithms. Springer, 2019.
  • [33] Agoston E Eiben and Jim Smith. From evolutionary computation to the evolution of things. Nature, 521(7553):476–482, 2015.
  • [34] Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435, 2023.
  • [35] Rui Zhang, Fei Liu, Xi Lin, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang. Understanding the importance of evolutionary search in automated heuristic design with large language models. In International Conference on Parallel Problem Solving from Nature, pages 185–202. Springer, 2024.
  • [36] Xingyu Wu, Sheng-hao Wu, Jibin Wu, Liang Feng, and Kay Chen Tan. Evolutionary computation in the era of large language model: Survey and roadmap. IEEE Transactions on Evolutionary Computation, 2024.
  • [37] Lei Wu, Zhanxing Zhu, et al. Towards understanding generalization of deep learning: Perspective of loss landscapes. arXiv preprint arXiv:1706.10239, 2017.
  • [38] Ali Akbari, Muhammad Awais, Manijeh Bashar, and Josef Kittler. How does loss function affect generalization performance of deep learning? application to human age estimation. In International Conference on Machine Learning, pages 141–151. PMLR, 2021.
  • [39] Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. arXiv preprint arXiv:1912.02178, 2019.
  • [40] Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118, 2023.
  • [41] Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023.
  • [42] Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation. arXiv preprint arXiv:2410.10762, 2024.
  • [43] Yunxuan Li, Yibing Du, Jiageng Zhang, Le Hou, Peter Grabowski, Yeqing Li, and Eugene Ie. Improving multi-agent debate with sparse communication topology. arXiv preprint arXiv:2406.11776, 2024.
  • [44] Tobias Achterberg. Constraint integer programming. PhD thesis, technical university of berlin, 2007.
  • [45] Max Paulus and Andreas Krause. Learning to dive in branch and bound. Advances in Neural Information Processing Systems, 36:34260–34277, 2023.
  • [46] Rui Zhang, Fei Liu, Xi Lin, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang. Understanding the importance of evolutionary search in automated heuristic design with large language models. In International Conference on Parallel Problem Solving from Nature, pages 185–202. Springer, 2024.
  • [47] Mark Wineberg and Franz Oppacher. The underlying similarity of diversity measures used in evolutionary computation. In Genetic and Evolutionary Computation, pages 1493–1504. Springer, 2003.
  • [48] Adel Nikfarjam, Jakob Bossek, Aneta Neumann, and Frank Neumann. Entropy-based evolutionary diversity optimisation for the traveling salesperson problem. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 600–608, 2021.

Appendix A Diving Heuristics

Diving heuristics are primal heuristics that iteratively fix variables based on LP relaxation solutions, simulating a depth-first search in the branch-and-bound tree. Given the LP relaxation of an MILP:

zLP:=minxPLPcx,PLP={xnAx<b,π¯xπ¯},formulae-sequenceassignsuperscriptsubscript𝑧𝐿𝑃subscript𝑥superscriptsubscript𝑃𝐿𝑃superscript𝑐top𝑥superscriptsubscript𝑃𝐿𝑃conditional-set𝑥superscript𝑛formulae-sequence𝐴𝑥𝑏¯𝜋𝑥¯𝜋z_{LP}^{\dagger}:=\min_{x\in P_{LP}^{\dagger}}c^{\top}x,\quad P_{LP}^{\dagger}% =\left\{x\in\mathbb{R}^{n}\mid Ax<b,\underline{\pi}\leq x\leq\overline{\pi}% \right\},italic_z start_POSTSUBSCRIPT italic_L italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT := roman_min start_POSTSUBSCRIPT italic_x ∈ italic_P start_POSTSUBSCRIPT italic_L italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x , italic_P start_POSTSUBSCRIPT italic_L italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT = { italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_A italic_x < italic_b , under¯ start_ARG italic_π end_ARG ≤ italic_x ≤ over¯ start_ARG italic_π end_ARG } ,

the algorithm starts from an LP solution x^PLP^𝑥superscriptsubscript𝑃𝐿𝑃\hat{x}\in P_{LP}^{\dagger}over^ start_ARG italic_x end_ARG ∈ italic_P start_POSTSUBSCRIPT italic_L italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT and incrementally fixes fractional variables xjsubscript𝑥𝑗x_{j}\notin\mathbb{Z}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∉ blackboard_Z to integer values. At each step, the feasible region is updated with new bound constraints, and the relaxed problem is re-solved. This process emulates a depth-first traversal of the search space, aiming to quickly construct a feasible integer solution. In general, a generic diving heuristic can be described by Algorithm 2. The only difference among various diving heuristics lies in the scoring function s()𝑠s(\cdot)italic_s ( ⋅ ), which determines the variable to round and the direction of rounding at each iteration.

Algorithm 2 Generic Diving Heuristic
1:MILP with relaxed feasible region Psuperscript𝑃P^{*}italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, LP solution xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, maximum depth dmaxsubscript𝑑maxd_{\text{max}}italic_d start_POSTSUBSCRIPT max end_POSTSUBSCRIPT
2:A set of feasible solutions 𝒳𝒳\mathcal{X}caligraphic_X (if found)
3:A scoring function s𝑠sitalic_s for selecting branching variables and their rounding direction
4:Initialize depth d1𝑑1d\leftarrow 1italic_d ← 1, candidate set 𝒞{jxj}𝒞conditional-set𝑗subscriptsuperscript𝑥𝑗\mathcal{C}\leftarrow\{j\in\mathcal{I}\mid x^{*}_{j}\notin\mathbb{Z}\}caligraphic_C ← { italic_j ∈ caligraphic_I ∣ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∉ blackboard_Z }
5:while ddmax𝑑subscript𝑑maxd\leq d_{\text{max}}italic_d ≤ italic_d start_POSTSUBSCRIPT max end_POSTSUBSCRIPT do
6:     jargmaxi𝒞s(xi)𝑗subscript𝑖𝒞𝑠subscript𝑥𝑖j\leftarrow\arg\max_{i\in\mathcal{C}}s(x_{i})italic_j ← roman_arg roman_max start_POSTSUBSCRIPT italic_i ∈ caligraphic_C end_POSTSUBSCRIPT italic_s ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
7:     if round up then
8:         ljxjsubscript𝑙𝑗subscriptsuperscript𝑥𝑗l_{j}\leftarrow\lceil x^{*}_{j}\rceilitalic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← ⌈ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⌉
9:     else
10:         ujxjsubscript𝑢𝑗subscriptsuperscript𝑥𝑗u_{j}\leftarrow\lfloor x^{*}_{j}\rflooritalic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← ⌊ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⌋
11:     end if
12:     PP{ljxjuj}superscript𝑃superscript𝑃subscript𝑙𝑗subscript𝑥𝑗subscript𝑢𝑗P^{*}\leftarrow P^{*}\cap\{l_{j}\leq x_{j}\leq u_{j}\}italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∩ { italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }
13:     if Psuperscript𝑃P^{*}italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is infeasible then
14:         break
15:     end if
16:     xargminxPcxsuperscript𝑥subscript𝑥superscript𝑃superscript𝑐top𝑥x^{*}\leftarrow\arg\min_{x\in P^{*}}c^{\top}xitalic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← roman_arg roman_min start_POSTSUBSCRIPT italic_x ∈ italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x
17:     if xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is roundable then
18:         𝒳𝒳round(x)𝒳𝒳roundsuperscript𝑥\mathcal{X}\leftarrow\mathcal{X}\cup\text{round}(x^{*})caligraphic_X ← caligraphic_X ∪ round ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
19:     end if
20:     dd+1𝑑𝑑1d\leftarrow d+1italic_d ← italic_d + 1
21:     Update candidate variable index set 𝒞𝒞\mathcal{C}caligraphic_C
22:end while

Here are some diving heuristic algorithms included in SCIP.

Coefficient. This strategy selects a variable that has the smallest number of positive up-locks or down-locks. These locks represent how many constraints would prevent increasing or decreasing the variable, respectively. The variable is then fixed in the direction where fewer locks occur. If there is a tie between multiple variables, the method uses a secondary rule called fractional diving to break the tie.

Distribution. This method is based on the empirical distribution of fractional values observed in historical solutions. It favors variables that are more frequently fractional in previous LP relaxations. The idea is that such variables are likely to remain fractional and therefore more useful for branching.

Farkas. This strategy tries to construct a Farkas proof to show the infeasibility of the current LP relaxation after branching. It selects the variable whose rounding, in the direction that improves the objective, is predicted to cause the largest gain. This prediction is based on LP dual information or inference from constraint violation. The method is designed to make branching decisions that quickly lead to pruning.

Fractional. This method selects the variable that is closest to an integer value, but still fractional. The measure used is |xjxj+0.5|superscriptsubscript𝑥𝑗superscriptsubscript𝑥𝑗0.5\left|x_{j}^{*}-\lfloor x_{j}^{*}+0.5\rfloor\right|| italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - ⌊ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + 0.5 ⌋ |, which captures how far the variable’s value is from the nearest integer. The selected variable is then rounded in the direction that brings it closest to an integer. This approach is simple and focuses on reducing the integrality gap.

Linesearch. This method traces a straight line (ray) from the root node LP solution to the current LP solution xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. It identifies which integer hyperplane—either xj=xjsubscript𝑥𝑗superscriptsubscript𝑥𝑗x_{j}=\lfloor x_{j}^{*}\rflooritalic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ⌊ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⌋ or xj=xjsubscript𝑥𝑗superscriptsubscript𝑥𝑗x_{j}=\lceil x_{j}^{*}\rceilitalic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ⌈ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⌉—is intersected first along this ray. The variable defining that hyperplane is selected for branching. This approach can be seen as a geometric way to decide which variable will influence the search path as soon as possible.

Pseudocost. This strategy uses historical data, called pseudocosts, to guide branching. For each variable, it records the average objective improvement caused by previous up- or down-branching decisions. The variable and branching direction with the highest expected improvement are selected. This method also considers the current fractionality of the variable to refine the choice. It is widely used due to its balance between accuracy and efficiency.

Vectorlength. This method is inspired by set-partitioning problems. It evaluates the trade-off between how much rounding a variable is expected to degrade the objective, and how many constraints the variable appears in. The selected variable minimizes the ratio between the expected degradation and its constraint count. This helps prioritize variables that have a broad structural impact while limiting damage to the objective.

To guide our learned diving score function, we use variable-level features that are inspired by those employed in existing human-designed diving heuristics. These include 13 features in total, which are listed and described in Table 5.

Table 5: Description of the 13 input features used in the diving score function.
Feature Name Feature Description
mayrounddown Boolean; indicates whether the variable can be rounded down while maintaining feasibility.
mayroundup Boolean; indicates whether the variable can be rounded up while maintaining feasibility.
candsfrac Float; fractional part of the variable’s value in the LP relaxation, i.e., |xjxj|superscriptsubscript𝑥𝑗superscriptsubscript𝑥𝑗|x_{j}^{*}-\lfloor x_{j}^{*}\rfloor|| italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - ⌊ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⌋ |.
candsol Float; value of the variable in the current LP relaxation solution.
nlocksdown Integer; number of down-locks, i.e., constraints that would be violated by decreasing the variable.
nlocksup Integer; number of up-locks, i.e., constraints that would be violated by increasing the variable.
obj Float; coefficient of the variable in the objective function.
objnorm Float; Euclidean norm of the objective function coefficient vector.
pscostdown Float; pseudocost for decreasing the variable’s value.
pscostup Float; pseudocost for increasing the variable’s value.
rootsolval Float; value of the variable in the LP relaxation at the root node.
nNonz Integer; number of nonzero entries in the variable’s column in the constraint matrix.
isBinary Boolean; TRUE if the variable is binary, i.e., has domain {0,1}01\{0,1\}{ 0 , 1 }.

Appendix B More experiment details

In all the experiments, we evaluate the performance of agents driven by GPT-4o mini across various tasks. We run all the experiments with three random seeds on Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz and NVIDIA A100.

Note: Since the code for L2DIVE is currently not open-source and specific hyperparameters are unavailable, we officially report the performance of L2DIVE based on its ratio to the best human-designed heuristic as presented in the original article.

B.1 SCIP settings

To construct our Tuned baseline, we incorporated domain knowledge and performed a randomized search over key diving-related parameters in SCIP 7.0.2. The primary parameters that govern the invocation of individual diving heuristics are freq and freqofs. These parameters determine when and how frequently a given diving heuristic is triggered during the branch-and-bound process. By adjusting their values, we can generate diverse solver behaviors that vary the timing and intensity of heuristic application. For each diving heuristic, we independently sampled its configuration by setting freq to one of four values with equal probability: 11-1- 1 (disabled), 0.5×freqdefault0.5subscriptfreqdefault\lfloor 0.5\times\textit{freq}_{\text{default}}\rfloor⌊ 0.5 × freq start_POSTSUBSCRIPT default end_POSTSUBSCRIPT ⌋ (increased frequency), freqdefaultsubscriptfreqdefault\textit{freq}_{\text{default}}freq start_POSTSUBSCRIPT default end_POSTSUBSCRIPT (default frequency), or 2×freqdefault2subscriptfreqdefault\lfloor 2\times\textit{freq}_{\text{default}}\rfloor⌊ 2 × freq start_POSTSUBSCRIPT default end_POSTSUBSCRIPT ⌋ (reduced frequency). In parallel, we randomly set freqofs to either zero or its default value, also with equal probability. This approach allows us to sample a wide range of heuristic schedules while maintaining compatibility with established SCIP parameter semantics.

B.2 Datasets

We evaluate our method on seven benchmark datasets, including four synthetic combinatorial optimization problems and three real-world MILP tasks. The datasets are widely used in prior work and include:

  • Setcover: A classical combinatorial problem where the objective is to select a minimum number of subsets such that their union covers all elements. Instances are represented as binary matrices with rows corresponding to elements and columns to subsets. Easy instances have 500 rows and 1000 columns, while hard instances increase the size to 2000 rows and 1000 columns.

  • Cauctions: A combinatorial auction problem where bidders submit bids on bundles of items, aiming to maximize total revenue without violating item availability constraints. Easy instances contain 100 items and 500 bids, while hard instances include 300 items and 1500 bids.

  • Facilities: A capacitated facility location problem involving the selection of facility sites and the assignment of customers to minimize facility opening and service costs. Easy instances consist of 100 facilities and 100 customers, whereas hard instances have 100 facilities and 400 customers.

  • Indset: The maximum independent set problem, which seeks the largest possible set of mutually non-adjacent vertices within a graph. Easy instances feature 500 nodes with an affinity of 4, and hard instances have 1500 nodes with the same affinity.

  • LoadBalance: A server load balancing problem arising in distributed systems, modeled as an MILP.

  • NNVerify: A verification problem for neural networks, where constraints encode input-output relationships that must be satisfied.

  • MIPLIB: It contains a diverse collection of real-world and academic instances spanning various domains such as scheduling, network design, logistics, and combinatorial optimization. We selected 20 instances for experimental comparison.

The first and second experimental groups are conducted on the four synthetic datasets, focusing on diving performance and solving efficiency. The third group uses the three real-world datasets to demonstrate the effectiveness of our method in practical applications.

Table 6: Table 10: Used MIPLIB instance names
air05 beasleyC3 binkar10_1 cod105 dano3_3
eil33-2 hypothyroid-k1 istanbul-no-cutoff markshare_4_0 mas76
mc11 mik-250-20-75-4 n5-3 neos-860300 neos-957323
neos-1445765 nw04 piperout-27 pk1 seymour1

Appendix C Experiments for practical MILP tasks.

C.1 Experimental settings

To evaluate the effectiveness and generalization ability of the proposed heuristic framework, we conduct experiments on three representative datasets: LoadBalance, MILPLIB, and NNVerify, which cover a broad range of MILP problem structures. Across all datasets, we adopt two standard performance metrics: the primal-dual integral (PDI), which captures convergence behavior and solution quality over time, and the solving time (T), which measures how quickly a feasible or optimal solution is found. For LoadBalance, we use 100 instances for validation and another 100 for testing, with Tlimit=3600subscript𝑇limit3600T_{\text{limit}}=3600italic_T start_POSTSUBSCRIPT limit end_POSTSUBSCRIPT = 3600 seconds as the standard setting and Tlimit=1800subscript𝑇limit1800T_{\text{limit}}=1800italic_T start_POSTSUBSCRIPT limit end_POSTSUBSCRIPT = 1800 seconds for additional robustness evaluation under tighter budgets. For MILPLIB, we select 20 relatively simple benchmark instances as a test set to evaluate generalization performance on classical MILP formulations; the instance names are listed in Table 6. For NNVerify, we evaluate on 100 testing instances derived from neural network verification problems, using a time limit of Tlimit=900subscript𝑇limit900T_{\text{limit}}=900italic_T start_POSTSUBSCRIPT limit end_POSTSUBSCRIPT = 900 seconds and considering only instances successfully solved within the limit. To isolate the contribution of the learned diving heuristics, we perform all experiments under both cut-selection enabled and disabled configurations. In all settings, the heuristics are integrated into SCIP, and the best-performing variant is selected on the validation set based on either PDI or solving time before being applied to the testing set, mirroring realistic deployment scenarios.

C.2 Experimental results

We compare our method to the default SCIP settings, as shown in Table 7. The experimental results demonstrate that our method achieves competitive improvements across all datasets. With cut plane selection enabled, our method achieves a 26.1% improvement in the primal dual integral (PDI) on the LoadBalance dataset. On the NNVerify dataset, our approach more than doubles the solving efficiency regardless of whether cut selection is enabled. For the MIPLIB dataset, our method improves solving efficiency by 36% and achieves a 12% reduction in PDI compared to the baseline.

Table 7: Comparison of solving time and primal-dual integral (PDI) across different methods and datasets. The suffix -off-cut indicates that cut plane selection is disabled, while -on-cut means cut plane selection is enabled. All methods fail to obtain the optimal solution on the Load balancing dataset within the time limit(3600 or 1800).
Load balancing Neural network verif MIPLIB
Solving time PDI Solving time PDI Solving time PDI
Ours-off-cut 3600 346980.53 72.42 5413.32 263.48 12101.11
Scip-off-cut 3600 347597.70 669.15 38455.17 469.22 18127.57
Ours-on-cut 1800 7305.2 35.67 2744.21 117.67 5599.62
Scip-on-cut 1800 9881.29 137.19 8210.46 184.3 6339.64

Appendix D Prompts

D.1 Prompts design

Refer to caption
Figure 4: The prompts of contextual information of MILP.

Our prompt design adopts a structured and modular format to effectively guide LLMs in performing evolutionary search within the multi-agent evolutionary framework. Each prompt is composed of three essential components, designed to provide the LLM with both domain-specific grounding and a clear operational goal.

As shown in Figure 4, background prompts contain <Introduction of MILP>, <Diving Heuristics>, <Instuction> , <in_out of Heuristic>, <Features description>. Together, they provide enough background knowledge of diving heuristics for the downstream tasks. Prompts in MA-Evolution System are modular and follow a structured template to ensure consistency across generations, as shown in Figure 5. At the core of each prompt are three elements: (1) the functional role of the agent, which defines the nature of the task (e.g., proposing a new heuristic or reviewing existing code); (2) a formal or semi-formal description of the MILP problem to ground the response in the relevant optimization context; and (3) a specification of the evolutionary operation that informs the agent’s goal in the current generation cycle.

Our evolution operation’s prompt includes three main types: initialization, mutation, and crossover. Each type corresponds to a distinct stage in the evolutionary search process and is designed to guide LLMs in generating or improving heuristics for MILP diving.

Refer to caption
Figure 5: The prompts in MA-Evolution System.

Initialization The LLM is instructed to create a new scoring function from scratch. The function should assign a score and a rounding direction to each fractional variable, based only on the LP relaxation and objective function. This stage initializes the population with diverse and problem-aware heuristics.

Example prompt: Please create a new Python scoring function for a Generic Diving Heuristic. The function should assign a score and rounding direction to each fractional variable, using only information from the LP relaxation and the objective function.

Mutation The LLM receives an existing scoring function and modifies it slightly. The modification should be meaningful and aimed at improving performance or exploring nearby variants in the heuristic space. This enables local search around known good solutions.

Example prompt: Please make a small but meaningful change that may improve performance or explore alternative behavior. Ensure the result is syntactically correct and remains within the MILP context.
Original function: [insert code]

Crossover The LLM combines two existing scoring functions into a new one. It should preserve useful components from both parents while ensuring the resulting function is coherent and consistent. This enables global search by recombining successful patterns.

Example prompt: You are creating a new heuristic by combining two existing ones. Please synthesize a scoring function that inherits effective components from both parents while maintaining logical consistency.
Heuristic A: [insert code]
Heuristic B: [insert code]

D.2 Example

The following is an example of our method applied within DHEvo:

Designer You are a Coder Designer. Hello and welcome to the program. Read the prompt content, propose your algorithm ideas, and provide detail implementation procedure, similar to the role of a project manager. The topic is stated as follows:Diving heuristics are one of the most important categories of primal heuristics in SCIP framework for Mixed Integer Linear Programming (MILP) problem. It starts from the current LP solution and iteratively fix an integer variable to an integral value and resolve the LP. You should create a totally new Python scoring function for me (different from the heuristics in the literature) to choose the fractional variable and corresponding rounding direction using the information of the LP relaxation and objective function. The function is used for every variable to decide the variable’s score and rounding direction. Specifically, you have these features to use in the score function: "mayrounddown" and "mayroundup" (bool, indicate whether it is possible to round variable down/up and stay feasible, it should be penalized because we need more exploration); "candsfrac" (float, fractional part of solution value of variable); "candsol" (float, solution value of variable in LP relaxation solution); "nlocksdown" and "nlocksup" (int, the number of locks for rounding down/up of a special type); "obj" (float, objective function value of variable); "objnorm" (float, the Euclidean norm of the objective function vector); "pscostdown" and "pscostup" (float, the variable’s pseudo cost value for the given change of the variable’s LP value); "rootsolval" (float, the solution of the variable in the last root node’s relaxation, if the root relaxation is not yet completely solved, zero is returned); "nNonz" (int, the number of nonzero entries in variable); "isBinary" (bool, TRUE if the variable is of binary type). Provide a brief description of the new score function’s logic and its corresponding Python code. The description must start with ’<start_des>’ and end with ’</end_des>’. The code must start with ’<start_code>’ and end with ’</end_code>’. The code score function must call ’myheurdiving’ that takes 13 inputs ’mayrounddown’, ’mayroundup’, ’candsfrac’, ’candsol’, ’nlocksdown’, ’nlocksup’, ’obj’, ’objnorm’, ’pscostdown’, ’pscostup’, ’rootsolval’, ’nNonz’, and ’isBinary’. The function must output the ’score’ and ’roundup’, where ’score’ is a float type indicating the variable’s score, the more the better, and the ’roundup’ is a bool type indicating whether we should round the variable up, True for rounding up. Be creative and do not give additional explanations.

Coder Your role is to create a new algorithm based on the original content and Designer‘s needs. The original content is stated as follows: Diving heuristics are one of the most important categories of primal heuristics in the SCIP framework for Mixed Integer Linear Programming (MILP) problems. It starts from the current LP solution and iteratively fixes an integer variable to an integral value and resolves the LP. You should create a new Python scoring function for me (different from the heuristics in the literature) to choose the fractional variable and corresponding rounding direction using the information of the LP relaxation and objective function. The function is used for every variable to decide the variable’s score and rounding direction. Specifically, you have these features to use in the score function: "mayrounddown" and "mayroundup" (bool, indicate whether it is possible to round variable down/up and stay feasible, it should be penalized because we need more exploration); "candsfrac" (float, fractional part of solution value of variable); "candsol" (float, solution value of variable in LP relaxation solution); "nlocksdown" and "nlocksup" (int, the number of locks for rounding down/up of a special type); "obj" (float, objective function value of variable); "objnorm" (float, the Euclidean norm of the objective function vector); "pscostdown" and "pscostup" (float, the variable’s pseudo cost value for the given change of the variable’s LP value); "rootsolval" (float, the solution of the variable in the last root node’s relaxation, if the root relaxation is not yet completely solved, zero is returned); "nNonz" (int, the number of nonzero entries in variable); "isBinary" (bool, TRUE if the variable is of binary type). Provide a brief description of the new score function’s logic and its corresponding Python code. The description must start with ’<start_des>’ and end with ’</end_des>’. The code must start with ’<start_code>’ and end with ’</end_code>’. The code score function must call ’myheurdiving’ that takes 13 inputs: ’mayrounddown’, ’mayroundup’, ’candsfrac’, ’candsol’, ’nlocksdown’, ’nlocksup’, ’obj’, ’objnorm’, ’pscostdown’, ’pscostup’, ’rootsolval’, ’nNonz’, and ’isBinary’. The function must output the ’score’ and ’roundup’, where ’score’ is a float type indicating the variable’s score, the more the better, and the ’roundup’ is a bool type indicating whether we should round the variable up, True for rounding up. Be creative and do not give additional explanations. Designer idea: Allow weighting coefficients for each component.

Reviewer You are a code evaluator who tries to evaluate the code, giving relevant comments in the direction of code redundancy and code logic, etc. <code>

Judger You are the final arbiter in the MA-Evolution System. Evaluate the heuristic code and review feedback to determine if it should be accepted, revised, or discarded. Output the final code and description. <historical information >

Appendix E Generated heuristics

This section presents the best heuristics generated by DHEvo.

def myheurdiving(mayrounddown, mayroundup, candsfrac, candsol,
nlocksdown, nlocksup, obj, objnorm,
pscostdown, pscostup, rootsolval,
nNonz, isBinary):
score = 0.0
roundup = False
# Penalize if both rounding options are feasible
if mayrounddown and mayroundup:
score = -40
# Evaluate candidate based on fractional part
if candsfrac > 0.5:
score += candsfrac * 80
roundup = True
if pscostup > 0.5:
score += pscostup * 50
else:
score += (1 - candsfrac) * 60
if pscostdown < -0.3:
score -= abs(pscostdown) * 25
# Normalize objective contribution
score += (obj / (objnorm + 1e-6)) * 90
# Adjust for locking counts
score += (nlocksdown * 25 - nlocksup * 15)
# Reward for non-zero entries and binary variable nature
if nNonz > 2:
score += nNonz * 20
if isBinary:
score += 50
return score, roundup
Listing 1: Heuristic for cauctions
def myheurdiving(mayrounddown, mayroundup, candsfrac, candsol, nlocksdown, nlocksup, obj, objnorm, pscostdown, pscostup, rootsolval, nNonz, isBinary):
score = 0.0
roundup = False
# Base score weighted by normalized objective contribution
score += (obj/(objnorm + 1e-9)) * 5 if objnorm >0 else 0
# Penalize rounding options to encourage exploration
score -= nlocksdown * 7 if mayrounddown else 0
score -= nlocksup *7 if mayroundup else 0
# Favor large fractions away from 0.5 for exploration
score += (abs(candsfrac - 0.5) * 10)
# Adjust score based on solution value and its contribution
score += (candsol / (1 + abs(rootsolval) * obj)) * 4 if rootsolval != 0 else 0
# Employ pseudo costs to influence rounding decisions
if pscostdown < 0 and mayrounddown:
score += -pscostdown * 3 # Favor rounding down with negative pseudo costs
if pscostup < 0 and mayroundup:
score -= -pscostup * 3 # Discourage rounding up with negative pseudo costs
#Determine rounding direction based on fractional part and exploration potential
if candsfrac >= 0.7 and mayroundup:
roundup = True
elif candsfrac <= 0.3 and mayrounddown:
roundup = False
# Encourage solutions with fewer nonzero entries
score += (1 / (nNonz + 1)) * 2 if nNonz > 0 else 0
return score, roundup
Listing 2: Heuristic for facility
def myheurdiving(mayrounddown, mayroundup, candsfrac, candsol, nlocksdown, nlocksup, obj, objnorm, pscostdown, pscostup, rootsolval, nNonz, isBinary):
score = 0.0
# Strongly penalize feasible rounding options
if mayrounddown:
score -= 3.0
if mayroundup:
score -= -3.0
# Incorporate fractional part and objective value
score += (1.0 - candsfrac) * obj * 0.5 if mayrounddown else 0
score += candsfrac * obj *0.5 if mayroundup else 0
# Adjust with pseudo costs
score += pscostdown * candsfrac * 1.5 if mayrounddown else 0
score += pscostup * (1 - candsfrac) * 1.5 if mayroundup else 0
# Apply less severe penalty for distance from the root solution
score -= abs(rootsolval - candsol) * 0.1
# Normalize the score
if objnorm > 0 :
score /= objnorm
# Reward more for binary variables
score += nlocksup * 0.3 - nlocksdown * 0.3
if isBinary:
score += 1.0
# Determine rounding direction
roundup = (score > 0) and (not isBinary or mayroundup)
return score, roundup
Listing 3: Heuristic for indset
def myheurdiving(mayrounddown, mayroundup, candsfrac, candsol, nlocksdown, nlocksup, obj, objnorm, pscostdown, pscostup, rootsolval, nNonz, isBinary):
score = 0.0
# Strongly penalize feasible rounding options
if mayrounddown:
score -= 3.0
if mayroundup:
score -= -3.0
# Incorporate fractional part and objective value
score += (1.0 - candsfrac) * obj * 0.5 if mayrounddown else 0
score += candsfrac * obj *0.5 if mayroundup else 0
# Adjust with pseudo costs
score += pscostdown * candsfrac * 1.5 if mayrounddown else 0
score += pscostup * (1 - candsfrac) * 1.5 if mayroundup else 0
# Apply less severe penalty for distance from the root solution
score -= abs(rootsolval - candsol) * 0.1
# Normalize the score
if objnorm > 0 :
score /= objnorm
# Reward more for binary variables
score += nlocksup * 0.3 - nlocksdown * 0.3
if isBinary:
score += 1.0
# Determine rounding direction
roundup = (score > 0) and (not isBinary or mayroundup)
return score, roundup
Listing 4: Heuristic for indset
def myheurdiving(mayrounddown, mayroundup, candsfrac, candsol, nlocksdown, nlocksup, obj, objnorm, pscostdown, pscostup, rootsolval, nNonz, isBinary):
score = 0.0
# Penalties for feasible rounding options to promote exploration
if mayrounddown:
score -= 10.0
if mayroundup:
score -= 10.0
# Favor fractional values at extremes (0 or 1)
score += (1 - abs(candsfrac - 0.5)) * 30.0
# Normalize impact of the objective function
score += (obj/(objnorm + 1e-5)) * 0.5
#Include pseudo cost adjustments for better decision-making
score += pscostup if mayroundup else 0.0
score -= pscostdown if mayrounddown else 0.0
# Integrate root solution value adjusted by variable complexity
score += rootsolval / (nNonz + 1)
# Amplify score for binary variables to encourage decisive rounding
if isBinary:
score *= 2.0
# Determine rounding direction based on computed score and pseudo costs
roundup = (mayrounddown and (pscostup <= pscostdown or not mayrounddown))
return score, roundup
Listing 5: Heuristic for setcover