Repo for SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting (ISCA25)
Paper website: https://arxiv.org/pdf/2504.08850
The SpecEE is implemented based on the HuggingFace framework in the cloud scenario and the llama.cpp scenario in the edge scenario. We modify the part of code to support the SpecEE, and we will conduct the faster research iteration in the future.
Left: SpecEE & Right: llama.cpp
please enter in the SpecEE_PC directory
torch~=2.2.1 numpy~=1.26.4 sentencepiece~=0.2.0 transformers>=4.45.1,<5.0.0 gguf>=0.1.0 protobuf>=4.21.0,<5.0.0
cd SpecEE_PC
conda create -n specee python=3.12
conda activate specee
pip install -r ./requirements/requirements-convert_hf_to_gguf.txt
huggingface-cli login
huggingface-cli download YYDH2333/SpecEE-7b-chat-hf
- Download Llama-7b-chat-hf and EAGLE-llama2-chat-7B
huggingface-cli login
huggingface-cli download meta-llama/Llama-2-7b-chat-hf
huggingface-cli download yuhuili/EAGLE-llama2-chat-7B
- Copy Llama-7b-chat-hf weights to ./specee_weights
cp Llama-7b-chat-hf/config.json specee_weights/ specee_weights/
cp Llama-7b-chat-hf/generation_config.json specee_weights/
cp Llama-7b-chat-hf/model-00001-of-00002.safetensors specee_weights/
cp Llama-7b-chat-hf/model-00002-of-00002.safetensors specee_weights/
cp Llama-7b-chat-hf/special_tokens_map.json specee_weights/
cp Llama-7b-chat-hf/tokenizer.json specee_weights/
cp Llama-7b-chat-hf/tokenizer.model specee_weights/
cp Llama-7b-chat-hf/tokenizer_config.json specee_weights/
- Run ./specee_weights/add_eagle.py
python ./specee_weights/add_eagle.py
- Convert safetensors to gguf
python convert_hf_to_gguf.py Llama-2-7b-chat-hf --outtype f16 --outfile models/Llama-2-7b-chat-hf.gguf --verbose
python convert_hf_to_gguf.py specee_weights --outtype f16 --outfile models/SpecEE-7b-chat-hf.gguf --verbose
make GGML_CUDA=1 -j$(nproc)
CUDA_VISIBLE_DEVICES=0 \
./llama-cli \
-m models/SpecEE-7b-chat-hf.gguf \
-p "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions." \
-e -ngl 16 -t 4 -n 256 -c 512 -s 8 --top_k 0 --temp 0
python eval_on_alpaca_dataset.py
python eval_on_gsm8k_dataset.py
python eval_on_humaneval_dataset.py
python eval_on_mt-bench_dataset.py
python eval_on_qa_dataset.py
python eval_on_sum_dataset.py
please enter in the SpecEE_cloud directory
First setup the environments.
We need to setup three environments to evaluate SpecEE in huggingface and awq.
cd SpecEE-cloud
conda create -n SpecEE python==3.10
conda activate SpecEE
pip install -r requirements.txt
cd SpecEE-cloud
conda create -n awq python==3.10
conda activate awq
pip install -r requirements_awq.txt
cd SpecEE-cloud
conda create -n specee_awq python==3.10
conda activate specee_awq
pip install -r requirements_awq.txt
cd AutoAWQ-0.2.6
pip install -e .
Second download the models needed.
huggingface-cli login
huggingface-cli download meta-llama/Llama-2-7b-chat-hf
huggingface-cli download TheBloke/Llama-2-7B-Chat-AWQ
huggingface-cli download yuhuili/EAGLE-llama2-chat-7B
In our code, we can evaluate the speed and accuracy performance based on Huggingface.
Firstly, activate SpecEE environment.
conda activate SpecEE
The example command to evaluate speed.
CUDA_VISIBLE_DEVICES=0 python EEInference.py --base-model-path meta-llama/Llama-2-7b-chat-hf --draft-model-path yuhuili/EAGLE-llama2-chat-7B --dataset mt_bench --task speed --predictor-path [the local path of ./llama-7b]
The example command to evaluate accuracy.
CUDA_VISIBLE_DEVICES=0 python EEInference.py --base-model-path meta-llama/Llama-2-7b-chat-hf --draft-model-path yuhuili/EAGLE-llama2-chat-7B --dataset sst2 --task accuracy --predictor-path [the local path of ./llama-7b]
You can replace the model path information in run_hf_specee.sh to complete all AE experiments replication in one go.
./run_hf_specee.sh
In our AE code, we can evaluate the speed and accuracy performance based on awq.
Firstly, activate SpecEE environment.
conda activate SpecEE
The example command.
CUDA_VISIBLE_DEVICES=0 python EEInference_awq.py --base-model-path TheBloke/Llama-2-7B-Chat-AWQ --draft-model-path yuhuili/EAGLE-llama2-chat-7B --dataset commonsenseqa --task accuracy --predictor-path [the local path of ./llama-7b]
Firstly, activate awq environment.
conda activate awq
You can get all awq speed via:
CUDA_VISIBLE_DEVICES=0 python AwqInference.py --base-model-path TheBloke/Llama-2-7B-Chat-AWQ
Next, activate specee_awq environment.
conda activate specee_awq
You can get all awq+specee speed via:
CUDA_VISIBLE_DEVICES=0 python AwqEEInference.py --base-model-path TheBloke/Llama-2-7B-Chat-AWQ --draft-model-path yuhuili/EAGLE-llama2-chat-7B
The speed evaluation results of awq and awq+specee are in raw_awq.json and specee_awq.json.
You can run calculate_awq_speedup.py to get speedup ratio of awq+specee.
> python calculate_awq_speedup.py