v0.4.9

@baberabb

lm-eval v0.4.9 Release Notes

Key Improvements

Enhanced Backend Support:
- SGLang Generate API by @baberabb in #2997
- vLLM enhancements: Added support for enable_thinking argument (#2947) and data parallel for V1 (#3011) by @anmarques and @baberabb
- Chat template improvements: Extended vLLM chat template support (#2902) and fixed HF chat template resolution (#2992) by @anmarques and @fxmarty-amd
Multimodal Capabilities:
- Audio modality support for Qwen2 Audio models by @artemorloff in #2689
- Image processing improvements: Added resize images support (#2958) and enabled multimodal API usage (#2981) by @artemorloff and @baberabb
- ChartQA multimodal task implementation by @baberabb in #2544
Performance & Reliability:
- Quantization support added via quantization_config by @jerryzh168 in #2842
- Memory optimization: Use yaml.CLoader for faster YAML loading by @giuliolovisotto in #2777
- Bug fixes: Resolved MMLU generative metric aggregation (#2761) and context length handling issues (#2972)

New Benchmarks & Tasks

Code Evaluation

HumanEval Instruct - Instruction-following code generation benchmark by @baberabb in #2650
MBPP Instruct - Instruction-based Python programming evaluation by @baberabb in #2995

Language Modeling

C4 Dataset Support - Added perplexity evaluation on C4 web crawl dataset by @Zephyr271828 in #2889

Long Context Benchmarks

RULER and Longbench - Long-context evaluation suites added by @baberabb in #2629

Mathematical & Reasoning

GSM8K Platinum - Enhanced mathematical reasoning benchmark by @Qubitium in #2771
MastermindEval - Logic reasoning evaluation by @whoisjones in #2788
JSONSchemaBench - Structured output evaluation by @Saibo-creator in #2865

Llama Reference Implementations

Llama Reference Implementations - Added task variants for Multilingual MMLU, MMLU CoT, GSM8K, and ARC Challenge based on Llama evaluation standards by @anmarques in #2797, #2826, #2829

Multilingual Expansion

Asian Languages:

Korean MMLU (KMMLU) multiple-choice task by @Aprilistic in #2849
MMLU-ProX extended evaluation by @heli-qi in #2811
KBL 2025 Dataset - Updated Korean benchmark evaluation by @abzb1 in #3000

European Languages:

NorEval - Comprehensive Norwegian benchmark by @vmkhlv in #2919

African Languages:

AfroBench - Multi-African language evaluation by @JessicaOjo in #2825
Darija tasks - Moroccan dialect benchmarks (DarijaMMLU, DarijaHellaSwag, Darija_Bench) by @hadi-abdine in #2521

Arabic Languages:

Arab Culture task for cultural understanding by @bodasadallah in #3006

Domain-Specific Benchmarks

CareQA - Healthcare evaluation benchmark by @PabloAgustin in #2714
ACPBench & ACPBench Hard - Automated code generation evaluation by @harshakokel in #2807, #2980
INCLUDE tasks - Inclusivity evaluation suite by @agromanou in #2769
Cocoteros VA dataset by @sgs97ua in #2787

Social & Bias Evaluation

Various social bias tasks for fairness assessment by @oskarvanderwal in #1185

Technical Enhancements

Fine-grained evaluation: Added --examples argument for efficient multi-prompt evaluation by @felipemaiapolo and @mirianfsilva in #2520
Improved tokenization: Better handling of add_bos_token initialization by @baberabb in #2781
Memory management: Enhanced softmax computations with softmax_dtype argument for HFLM by @Avelina9X in #2921

Critical Bug Fixes

Collating Queries Fix - Resolved error with different continuation lengths that was causing evaluation failures by @ameyagodbole in #2987
Mutual Information Metric - Fixed acc_mutual_info calculation bug that affected metric accuracy by @baberabb in #3035

Breaking Changes & Important Updates

MMLU dataset migration: Switched to cais/mmlu dataset source by @baberabb in #2918
Default parameter updates: Increased max_gen_toks to 2048 and max_length to 8192 for MMLU Pro tests by @dazipe in #2824
Temperature defaults: Set default temperature to 0.0 for vLLM and SGLang backends by @baberabb in #2819

We extend our heartfelt thanks to all contributors who made this release possible, including 43 first-time contributors who brought fresh perspectives and valuable improvements to the evaluation harness.

What's Changed

fix mmlu (generative) metric aggregation by @wangcho2k in #2761
Bugfix by @baberabb in #2762
fix verbosity typo by @baberabb in #2765
docs: Fix typos in README.md by @ruivieira in #2778
initialize tokenizer with add_bos_token by @baberabb in #2781
improvement: Use yaml.CLoader to load yaml files when available. by @giuliolovisotto in #2777
Consistency Fix: Filter new leaderboard_math_hard dataset to "Level 5" only by @perlitz in #2773
Fix for mc2 calculation by @kdymkiewicz in #2768
New healthcare benchmark: careqa by @PabloAgustin in #2714
Capture gen_kwargs from CLI in squad_completion by @ksurya in #2727
humaneval instruct by @baberabb in #2650
Update evaluator.py by @zhuzeyuan in #2786
change piqa dataset path (uses parquet rather than dataset script) by @baberabb in #2790
use verify_certificate flag in batch requests by @daniel-salib in #2785
add audio modality (qwen2 audio only) by @artemorloff in #2689
Add various social bias tasks by @oskarvanderwal in #1185
update pre-commit by @baberabb in #2799
Update Legacy OpenLLM leaderboard to use "train" split for ARC fewshot by @Avelina9X in #2802
Add INCLUDE tasks by @agromanou in #2769
Add support for token-based auth for watsonx models by @kiersten-stokes in #2796
add version by @baberabb in #2808
Add cocoteros_va dataset by @sgs97ua in #2787
Add MastermindEval by @whoisjones in #2788
Add loncxt tasks by @baberabb in #2629
[hf-multimodal] pass kwargs to self.processor by @baberabb in #2667
[MM] Chartqa by @baberabb in #2544
Allow writing config to wandb by @ksurya in #2736
[change] group -> tag on afrimgsm, afrimmlu, afrixnli dataset by @jd730 in #2813
Clean up README and pyproject.toml by @kiersten-stokes in #2814
Llama3 mmlu correction by @anmarques in #2797
Add Markdown linter by @kiersten-stokes in #2818
Configure the pad tokens for Qwen when using vLLM by @zhangruoxu in #2810
fix typo in humaneval by @baberabb in #2820
default temp=0.0 for vllm and slang by @baberabb in #2819
Fixes to mmlu_pro_llama by @anmarques in #2816
Add MMLU-ProX task by @heli-qi in #2811
Quick fix for mmlu_pro_llama by @anmarques in #2827
Fix: tj-actions/changed-files is compromised by @Tautorn in #2828
Multilingual MMLU for Llama instruct models by @anmarques in #2826
bbh - changed dataset to parquet version by @baberabb in #2845
Fix typo in longbench metrics by @djwackey in #2854
Add kmmlu multiple-choice(accuracy) task #2848 by @Aprilistic in #2849
Adding ACPBench task by @harshakokel in #2807
add Darija (Moroccan dialects) tasks including darijammlu. darijahellaswag and darija_bench by @hadi-abdine in #2521
Increase default max_gen_toks to 2048 and max_length to 8192 for MMLU Pro tests by @dazipe in #2824
doc by @baberabb in #2857
Fix: ACPBench Link by @harshakokel in #2860
Adds MMLU CoT, gsm8k and arc_challenge for llama instruct by @anmarques in #2829
[leaderboard] math - sync with repo by @baberabb in #2817
Update supported models by @danielholanda in #2866
Add JSONSchemaBench: A Benchmark for Evaluating Structured Output from LLMs by @Saibo-creator in #2865
leaderboard - add subtask scores by @baberabb in #2867
Fix the deps of longbench from jeiba to jieba by @houseroad in #2873
Optimization for evalita-llm rouge computation by @m-resta in #2878
Update authentications methods, add support for deployment_id for IBM watsonx_ai by @Medokins in #2877
Add GSM8K Platinum by @Qubitium in #2771
Add --examples Argument for Fine-Grained Task Evaluation in lm-evaluation-harness. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] by @felipemaiapolo in #2520
Extend support for chat template in vLLM by @anmarques in #2902
tasks README: fix dead link by @dtrifiro in #2899
Add support for quantization_config by @jerryzh168 in #2842
Fix a typo in README for tasks by @eldarkurtic in #2910
fix resolve_hf_chat_template version by @baberabb in #2917
mmlu - switch dataset to cais/mmlu; fix tests by @baberabb in #2918
init pixels before tokenizer creation by @artemorloff in #2911
Longbench bugfix by @baberabb in #2895
Added softmax_dtype argument to HFLM to coerce log_softmax computations by @Avelina9X in #2921
[bbh] use np.nan for numpy > 2.0 by @baberabb in #2937
Add support for enable_thinking argument in vllm model by @anmarques in #2947
Added NorEval, a novel Norwegian benchmark by @vmkhlv in #2919
Fix import error for eval_logger in score utils by @annafontanaa in #2940
Include all test files in sdist by @booxter in #2634
Change citation name by @StellaAthena in #2956
[vllm] add warning on truncation by @baberabb in #2962
fix: type error while checking context length by @llsj14 in #2972
Fix import error for deepcopy by @kiersten-stokes in #2969
Pin unitxt to most recent minor version to avoid test failures by @kiersten-stokes in #2970
mmlu pro generation_kwargs until Q: -> Question: by @yoonniverse in #2945
AfroBench: How Good are Large Language Models on African Languages? by @JessicaOjo in #2825
Added C4 Support by @Zephyr271828 in #2889
Fixed a bug that in MMLU-Pro utils.py that throw index error if one choice was removed by @sleepingcat4 in #2870
Add question suffix before the <|assistant|> tag by @TingchenFu in #2876
Add device arg to model_args passed to LLM object in VLLM model class by @momentino in #2879
paws-x fix formatting by @baberabb in #2759
Delete scripts/cost_estimate.py by @StellaAthena in #2985
Adding ACPBench Hard tasks by @harshakokel in #2980
[SGLANG] Add the SGLANG generate API by @baberabb in #2997
fix example notebook by @baberabb in #2998
Log tokenized request warning only once by @RobGeada in #3002
[Add Dataset Update] KBL 2025 by @abzb1 in #3000
Output path fix by @Niccolo-Ajroldi in #2993
use images with api models by @baberabb in #2981
Adding resize images support by @artemorloff in #2958
Revert "feat: add question suffix (#2876)" by @baberabb in #3007
[hotfix] modify multimodal check in evaluate by @baberabb in #3013
[Fix] Update resolve_hf_chat_template arguments by @fxmarty-amd in #2992
Fix error due in Collating queries with different continuation lengths (fixes #2984) by @ameyagodbole in #2987
[vllm] data parallel for V1 by @baberabb in #3011
add arab_culture task by @bodasadallah in #3006
chore: clean up and extend .gitignore rules by @e1washere in #3030
Enable text-only evals for VLM models by @ysulsky in #2999
[Fix] acc_mutual_info metric calculation bug by @baberabb in #3035
Fix: fix vllm issue with DP>1 by @younesbelkada in #3025
add Mbpp instruct by @baberabb in #2995
remove prints by @baberabb in #3041
[longbench] fix metric calculation by @baberabb in #2983
Fallback to super implementation in fewshot_context for Unitxt tasks by @kiersten-stokes in #3023
Fix Typo in README and Comment in utils_mcq.py by @vtjl10 in #3057
fix longbech citation by @baberabb in #3061
mmlu task: update README.md by @annafontanaa in #3070
Fix typos in docstrings in instructions.py by @maximevtush in #3060
bump version to 0.4.9 by @baberabb in #3073

New Contributors

@wangcho2k made their first contribution in #2761
@ruivieira made their first contribution in #2778
@perlitz made their first contribution in #2773
@kdymkiewicz made their first contribution in #2768
@PabloAgustin made their first contribution in #2714
@ksurya made their first contribution in #2727
@zhuzeyuan made their first contribution in #2786
@daniel-salib made their first contribution in #2785
@oskarvanderwal made their first contribution in #1185
@Avelina9X made their first contribution in #2802
@agromanou made their first contribution in #2769
@whoisjones made their first contribution in #2788
@jd730 made their first contribution in #2813
@anmarques made their first contribution in #2797
@zhangruoxu made their first contribution in #2810
@heli-qi made their first contribution in #2811
@Tautorn made their first contribution in #2828
@djwackey made their first contribution in #2854
@Aprilistic made their first contribution in #2849
@harshakokel made their first contribution in #2807
@hadi-abdine made their first contribution in #2521
@dazipe made their first contribution in #2824
@danielholanda made their first contribution in #2866
@Saibo-creator made their first contribution in #2865
@houseroad made their first contribution in #2873
@felipemaiapolo made their first contribution in #2520
@dtrifiro made their first contribution in #2899
@jerryzh168 made their first contribution in #2842
@vmkhlv made their first contribution in #2919
@annafontanaa made their first contribution in #2940
@booxter made their first contribution in #2634
@llsj14 made their first contribution in #2972
@yoonniverse made their first contribution in #2945
@Zephyr271828 made their first contribution in #2889
@sleepingcat4 made their first contribution in #2870
@TingchenFu made their first contribution in #2876
@momentino made their first contribution in #2879
@abzb1 made their first contribution in #3000
@Niccolo-Ajroldi made their first contribution in #2993
@fxmarty-amd made their first contribution in #2992
@ameyagodbole made their first contribution in #2987
@e1washere made their first contribution in #3030
@ysulsky made their first contribution in #2999
@younesbelkada made their first contribution in #3025
@vtjl10 made their first contribution in #3057
@maximevtush made their first contribution in #3060

Full Changelog: v0.4.8...v0.4.9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!