Ensemble-Hub is an open-source toolkit for large language model (LLM) ensemble inference. It is designed to support and unify multiple ensemble strategies for LLMs, including existing methods such as LLM-Blender, GaC, and UniTE. The project is under active development.
Why? | How? |
---|---|
Boost answer quality by letting several LLMs compete. | Each round, every generator writes a short segment → a reward model (Qwen 2.5-Math-PRM-7B) scores them → best segment is kept. |
Stay fast & memory-friendly with model caching. | ModelPool loads each generator/reward model once, then re-uses it for every call (CLI, notebook or API). |
Provide plug-and-play usage for research & services. | Python EnsembleFramework class or a production-grade FastAPI server (ensemblehub/api.py ). |
- Unlimited generators – mix and match multiple models (HF and vLLM backends supported).
- Reward-guided selection – uses a reward model (e.g. Qwen2.5-Math-PRM-7B) to score candidates and pick the best output each round.
- EOS-based early stop – if a model outputs its end-of-sequence token, the loop exits early.
- Context accumulation – optionally carry forward previously chosen segments into the next round (builds a running conversation context).
- Clean prompt template – minimal prompt format with no extraneous instructions (no stray “600 words” artifacts).
- Singleton caches – models load once and are reused on repeated calls (even across API requests).
Ensemble-Hub supports multiple ensemble strategies that can be easily configured:
zscore
: Statistical selection based on perplexity and confidence scoresall
: Use all available models (no selection)random
: Randomly select a subset of models
reward_based
: Reward-based selection using scoring models (default)progressive
: Length or token-based model switching during generation- Length-based: switch models based on output length thresholds
- Token-based: switch models when encountering special tokens
random
: Random selection from model outputsloop
: Round-robin cycling through modelsgac
: GAC token-level aggregationdistribution
: Distribution-based token aggregation
Ensemble-Hub/
├── ensemblehub/ # Main package
│ ├── api/ # FastAPI server module
│ │ ├── __main__.py # Command line entry point
│ │ └── app.py # FastAPI application
│ ├── ensemble_methods/ # Ensemble method implementations
│ │ ├── ensemble.py # Unified ensemble framework
│ │ ├── model_selection/ # Model selection strategies
│ │ │ ├── base.py # Base selector interface
│ │ │ ├── statistical.py # Z-score, random selection
│ │ │ └── learned.py # LLM-Blender, meta-learning
│ │ └── output_aggregation/ # Output aggregation methods
│ │ ├── token_level/ # Token-level aggregation (GAC, distribution)
│ │ ├── sentence_level/ # Sentence-level aggregation
│ │ │ ├── loop_selector.py # Round-robin selection
│ │ │ ├── random_selector.py # Random selection
│ │ │ ├── reward_based.py # Reward-based selection
│ │ │ └── progressive_selector.py # Progressive selection
│ │ └── response_level/ # Response-level aggregation
│ ├── generators/ # Model generators (HF, vLLM backends)
│ │ ├── base.py # Base generator interface
│ │ ├── hf.py # Hugging Face transformers
│ │ ├── vllm.py # vLLM backend
│ │ └── pool.py # Generator pool management
│ ├── scorers/ # Reward models and scoring
│ │ └── base.py # Base scorer interface
│ ├── inference.py # High-level inference pipeline
│ └── utils.py # Utility functions
├── data/ # Datasets (AIME, GSM8K, MATH, etc.)
├── docs/ # Documentation
│ ├── api_usage.md # Complete API usage guide
│ ├── benchmark_single_model.md # Single model benchmarking
│ └── progressive_selector_usage.md # Progressive selector guide
├── examples/ # Usage examples
│ └── test_single_model.py # Single model testing
├── scripts/ # Utility scripts
│ ├── vllm_infer.py # vLLM inference script
│ └── grader.py # Answer grading
├── requirements.txt # Dependencies
└── README.md # You're here!
conda create -n ensemble python=3.12
conda activate ensemble
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation
cd ..
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
cd ..
git clone https://github.com/Fzkuji/Ensemble-Hub.git
cd Ensemble-Hub
pip install -r requirements.txt
Note
The inference script now supports both YAML configuration files and command-line arguments.
Using YAML configuration (recommended):
python -m ensemblehub.inference \
--config examples/all_progressive.yaml \
--input_path data/AIME2024/aime/aime24.json \
--output_path saves/aime24.jsonl \
--max_examples 10 \
--batch_size 1
Using command-line arguments only:
python -m ensemblehub.inference \
--input_path data/AIME2024/aime/aime24.json \
--output_path saves/aime24.jsonl \
--max_examples 500 \
--batch_size 4 \
--output_aggregation_method progressive \
--max_tokens 2048 \
--model_specs "Qwen/Qwen2.5-0.5B-Instruct:hf:auto" \
--model_specs "Qwen/Qwen2.5-1.5B-Instruct:hf:auto"
Under the hood: models are loaded once → the reward model scores each round → loop stops when the selected segment ends with an EOS token.
# Start with example configuration
python ensemblehub/api.py examples/all_loop.yaml
# Or use progressive ensemble
python ensemblehub/api.py examples/all_progressive.yaml
# Start with default settings
python ensemblehub/api.py
lm_eval --model hf \
--tasks arc_challenge_chat \
--model_args pretrained=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--batch_size 2 \
--num_fewshot 5
or through the Ensemble-Hub API proxy:
# Start API server
python ensemblehub/api.py examples/all_loop.yaml
# Run evaluation in another terminal
export OPENAI_API_KEY=dummy_key
lm_eval --model openai-completions \
--tasks arc_challenge_chat \
--model_args model=ensemble,base_url=http://localhost:8000/v1/completions,tokenizer_backend=None \
--batch_size 2 \
--num_fewshot 5
# For longer completions (e.g. MBPP) extend the generation budget
lm_eval --model openai-completions \
--tasks mbpp \
--model_args model=ensemble,base_url=http://localhost:8000/v1/completions,tokenizer_backend=None,max_gen_toks=1024 \
--batch_size 2 \
--limit 1 \
--confirm_run_unsafe_code
Note: Server configuration is controlled via environment variables API_HOST and API_PORT.
# Health check
curl http://localhost:8000/status
# Chat completion
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "ensemble", "messages": [{"role": "user", "content": "Hello"}]}'
# Text completion
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "ensemble", "prompt": "Hello", "max_tokens": 50}'
- Multi-model inference
- HuggingFace backend
- FastAPI server with OpenAI-compatible endpoints
- Ray Serve integration
- Command line configuration for ensemble methods
- LM-evaluation-harness compatibility
- Reward model selection
- vLLM backends
- API support for closed-source models
- Streaming API interface (SSE)
- Advanced scorer aggregation methods
- Enable Thinking Mode: Refactored
enable_thinking
parameter to be configured at model initialization level instead of generation time. This allows better integration with LLaMA-Factory's template system and supports reasoning models like DeepSeek-R1. - Consistent Length Handling: Updated tokenizer calls to use
cutoff_len
from DataArguments for consistent max_length handling across all generation methods. - API Improvements: Added
--enable_thinking
command line flag for easy configuration of reasoning models.
Apache-2.0. See the LICENSE file for details.
Relies on DeepSeek, Qwen model weights, Hugging Face Transformers, LLaMA-Factory, and the incredible open-source community.