eval_gsm8k

This repository offers a lightweight and flexible solution for evaluating models on the GSM8K benchmark. The results are generally consistent with those obtained using lm-evaluation-harness.

few-shot

8-shot

The 8-shot prompt is from the lm-evaluation-harness gsm8k-cot

python eval_gsm8k_few_shot.py --model <model_name>

Model	Accuracy	Harness Accuracy
Mistral-7B-v0.1	41.02	42.99 (1.36)
Llama-2-7b-hf	13.72	14.33 (0.97)

8-shot maj1@8

python eval_gsm8k_zero_shot.py --model <model_name> --use_majority_vote --temp 0.2 --n_votes 8

Model	Accuracy	Harness Accuracy
Mistral-7B-v0.1	47.84	44.96 (1.37)

python eval_gsm8k_zero_shot.py --model <model_name> --use_majority_vote --temp 0.4 --n_votes 8

Model	Accuracy
Mistral-7B-v0.1	50.57

zero-shot

cot zero-shot

use the Chain of Thought prompt "Let's think step by step." before answering the question.

python eval_gsmk_zero_shot.py --model <model_name> --use_cot_prompt

Model	Accuracy	Harness Accuracy
Mistral-7B-v0.1	22.06	15.85 (1.01)

zero-shot

python eval_gsmk_zero_shot.py --model <model_name>

Model	Accuracy
Mistral-7B-v0.1	10.31

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
eval_results		eval_results
README.md		README.md
eval_gsm8k_few_shot.py		eval_gsm8k_few_shot.py
eval_gsm8k_zero_shot.py		eval_gsm8k_zero_shot.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

eval_gsm8k

few-shot

8-shot

8-shot maj1@8

zero-shot

cot zero-shot

zero-shot

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

tianlwang/eval_gsm8k

Folders and files

Latest commit

History

Repository files navigation

eval_gsm8k

few-shot

8-shot

8-shot maj1@8

zero-shot

cot zero-shot

zero-shot

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages