Yongqiang Yao*, Jingru Tan*, Kaihuan Liang*, Feizhao Zhang, Jiahao Hu, Yazhe Niu, Shuo Wu, Ruihao Gong📧, Dahua Lin , Ningyi Xu📧 (* denotes equal contribution, 📧 denotes corresponding author.)
This is the official implementation of our paper HBP.
Hybrid training with short + long sequences causes:
- 🚨 Workload imbalance (padding waste, uneven device utilization)
- 🚨 Imbalanced attention computation (short vs. long variance)
- 🚨 Wasted communication overhead (short data forced into SP)
- 🚨 Training instability (loss normalization bias)
Key components of Hierarchical Balance Packing (HBP):
- 📦 Hierarchical Packing Groups (16K, 32K, 128K)
- ⚖️ Balanced Packing (GreedyFill + attention-balanced batching)
- 🔄 Dynamic Training Pipeline (adaptive SP + curriculum learning)
- 📏 Stable Loss Normalizer (equal token contribution across batches)
Problem | HBP Solution |
---|---|
Workload imbalance: Excessive padding in batches and uneven computation across devices | Hierarchical Packing Groups: Assigns data to multi-level groups (16K, 32K, 128K) with optimal configs |
Imbalanced attention computation: Mixing short & long sequences causes high variance in attention cost | Balanced Packing: GreedyFill + attention-based batching to equalize computation load |
Wasted communication overhead: Short data forced into costly sequence parallelism (SP) | Optimized Grouping: Short and long data trained in separate groups, reducing unnecessary SP communication |
Training instability: Loss normalization biased by sequence length differences | Stable Loss Normalizer + Curriculum Learning: Equal token contribution and smooth transition from short to long sequences |
✨ In short: HBP combines multi-level packing, balanced batching, adaptive SP, curriculum learning, and stable loss to achieve up to 2.4× faster training with no performance trade-off.
pip install -r requirements.txt
Calculate sample info
# sample_info_path: ./cache/data_llama3_1_128k_info.pkl in config is the generate path
sh ./tools/cal_length.sh ./config/llama3_1_8b_128k_isf.yaml
Run ISF
sh ./scripts/train.sh ./config/llama3_1_8b_128k_isf.yaml 4
Run HBP
sh ./scripts/train.sh ./config/llama3_1_8b_128k_hbp.yaml 4
We conducted a comprehensive evaluation of the LLM’s performance using OpenCompass for general tasks and long-context tasks including MMLU, MMLU PRO, CMMLU, BBH, Math, GPQA Diamond, GSM8K, HellaSwag, MathBench, HumanEval, MBPP, IFEval, Drop, Ruler and NeedleBench.We use LongBench-Cite to measure the citation quality as well as response correctness in long-context QA scenarios and LongBench-Write to measure the long output quality as well as the output length.
We use this commit of opencompass for evaluation. First, you need to add the sglang API to opencompass/opencompass/models (in eval/sglang_api.py).
Then, use the evaluation scripts to start the service and perform the evaluation:
sh eval/run_auto.sh /path/to/model mode_name num_node
If you need to evaluate long texts, you will need to replace the config_filename in run_request_sg.sh
.
This repository is released under the Apache-2.0 license.
We learned a lot from the following projects when developing HBP.