lm-eval v0.4.9 Release Notes
Key Improvements
-
Enhanced Backend Support:
- SGLang Generate API by @baberabb in #2997
- vLLM enhancements: Added support for
enable_thinking
argument (#2947) and data parallel for V1 (#3011) by @anmarques and @baberabb - Chat template improvements: Extended vLLM chat template support (#2902) and fixed HF chat template resolution (#2992) by @anmarques and @fxmarty-amd
-
Multimodal Capabilities:
- Audio modality support for Qwen2 Audio models by @artemorloff in #2689
- Image processing improvements: Added resize images support (#2958) and enabled multimodal API usage (#2981) by @artemorloff and @baberabb
- ChartQA multimodal task implementation by @baberabb in #2544
-
Performance & Reliability:
- Quantization support added via
quantization_config
by @jerryzh168 in #2842 - Memory optimization: Use
yaml.CLoader
for faster YAML loading by @giuliolovisotto in #2777 - Bug fixes: Resolved MMLU generative metric aggregation (#2761) and context length handling issues (#2972)
- Quantization support added via
New Benchmarks & Tasks
Code Evaluation
- HumanEval Instruct - Instruction-following code generation benchmark by @baberabb in #2650
- MBPP Instruct - Instruction-based Python programming evaluation by @baberabb in #2995
Language Modeling
- C4 Dataset Support - Added perplexity evaluation on C4 web crawl dataset by @Zephyr271828 in #2889
Long Context Benchmarks
Mathematical & Reasoning
- GSM8K Platinum - Enhanced mathematical reasoning benchmark by @Qubitium in #2771
- MastermindEval - Logic reasoning evaluation by @whoisjones in #2788
- JSONSchemaBench - Structured output evaluation by @Saibo-creator in #2865
Llama Reference Implementations
- Llama Reference Implementations - Added task variants for Multilingual MMLU, MMLU CoT, GSM8K, and ARC Challenge based on Llama evaluation standards by @anmarques in #2797, #2826, #2829
Multilingual Expansion
Asian Languages:
- Korean MMLU (KMMLU) multiple-choice task by @Aprilistic in #2849
- MMLU-ProX extended evaluation by @heli-qi in #2811
- KBL 2025 Dataset - Updated Korean benchmark evaluation by @abzb1 in #3000
European Languages:
African Languages:
- AfroBench - Multi-African language evaluation by @JessicaOjo in #2825
- Darija tasks - Moroccan dialect benchmarks (DarijaMMLU, DarijaHellaSwag, Darija_Bench) by @hadi-abdine in #2521
Arabic Languages:
- Arab Culture task for cultural understanding by @bodasadallah in #3006
Domain-Specific Benchmarks
- CareQA - Healthcare evaluation benchmark by @PabloAgustin in #2714
- ACPBench & ACPBench Hard - Automated code generation evaluation by @harshakokel in #2807, #2980
- INCLUDE tasks - Inclusivity evaluation suite by @agromanou in #2769
- Cocoteros VA dataset by @sgs97ua in #2787
Social & Bias Evaluation
- Various social bias tasks for fairness assessment by @oskarvanderwal in #1185
Technical Enhancements
- Fine-grained evaluation: Added
--examples
argument for efficient multi-prompt evaluation by @felipemaiapolo and @mirianfsilva in #2520 - Improved tokenization: Better handling of
add_bos_token
initialization by @baberabb in #2781 - Memory management: Enhanced softmax computations with
softmax_dtype
argument forHFLM
by @Avelina9X in #2921
Critical Bug Fixes
- Collating Queries Fix - Resolved error with different continuation lengths that was causing evaluation failures by @ameyagodbole in #2987
- Mutual Information Metric - Fixed acc_mutual_info calculation bug that affected metric accuracy by @baberabb in #3035
Breaking Changes & Important Updates
- MMLU dataset migration: Switched to
cais/mmlu
dataset source by @baberabb in #2918 - Default parameter updates: Increased
max_gen_toks
to 2048 andmax_length
to 8192 for MMLU Pro tests by @dazipe in #2824 - Temperature defaults: Set default temperature to 0.0 for vLLM and SGLang backends by @baberabb in #2819
We extend our heartfelt thanks to all contributors who made this release possible, including 43 first-time contributors who brought fresh perspectives and valuable improvements to the evaluation harness.
What's Changed
- fix mmlu (generative) metric aggregation by @wangcho2k in #2761
- Bugfix by @baberabb in #2762
- fix verbosity typo by @baberabb in #2765
- docs: Fix typos in README.md by @ruivieira in #2778
- initialize tokenizer with
add_bos_token
by @baberabb in #2781 - improvement: Use yaml.CLoader to load yaml files when available. by @giuliolovisotto in #2777
- Consistency Fix: Filter new leaderboard_math_hard dataset to "Level 5" only by @perlitz in #2773
- Fix for mc2 calculation by @kdymkiewicz in #2768
- New healthcare benchmark: careqa by @PabloAgustin in #2714
- Capture gen_kwargs from CLI in squad_completion by @ksurya in #2727
- humaneval instruct by @baberabb in #2650
- Update evaluator.py by @zhuzeyuan in #2786
- change piqa dataset path (uses parquet rather than dataset script) by @baberabb in #2790
- use verify_certificate flag in batch requests by @daniel-salib in #2785
- add audio modality (qwen2 audio only) by @artemorloff in #2689
- Add various social bias tasks by @oskarvanderwal in #1185
- update pre-commit by @baberabb in #2799
- Update Legacy OpenLLM leaderboard to use "train" split for ARC fewshot by @Avelina9X in #2802
- Add INCLUDE tasks by @agromanou in #2769
- Add support for token-based auth for watsonx models by @kiersten-stokes in #2796
- add version by @baberabb in #2808
- Add cocoteros_va dataset by @sgs97ua in #2787
- Add MastermindEval by @whoisjones in #2788
- Add loncxt tasks by @baberabb in #2629
- [hf-multimodal] pass kwargs to self.processor by @baberabb in #2667
- [MM] Chartqa by @baberabb in #2544
- Allow writing config to wandb by @ksurya in #2736
- [change] group -> tag on afrimgsm, afrimmlu, afrixnli dataset by @jd730 in #2813
- Clean up README and pyproject.toml by @kiersten-stokes in #2814
- Llama3 mmlu correction by @anmarques in #2797
- Add Markdown linter by @kiersten-stokes in #2818
- Configure the pad tokens for Qwen when using vLLM by @zhangruoxu in #2810
- fix typo in humaneval by @baberabb in #2820
- default temp=0.0 for vllm and slang by @baberabb in #2819
- Fixes to mmlu_pro_llama by @anmarques in #2816
- Add MMLU-ProX task by @heli-qi in #2811
- Quick fix for mmlu_pro_llama by @anmarques in #2827
- Fix: tj-actions/changed-files is compromised by @Tautorn in #2828
- Multilingual MMLU for Llama instruct models by @anmarques in #2826
- bbh - changed dataset to parquet version by @baberabb in #2845
- Fix typo in longbench metrics by @djwackey in #2854
- Add kmmlu multiple-choice(accuracy) task #2848 by @Aprilistic in #2849
- Adding ACPBench task by @harshakokel in #2807
- add Darija (Moroccan dialects) tasks including darijammlu. darijahellaswag and darija_bench by @hadi-abdine in #2521
- Increase default max_gen_toks to 2048 and max_length to 8192 for MMLU Pro tests by @dazipe in #2824
- doc by @baberabb in #2857
- Fix: ACPBench Link by @harshakokel in #2860
- Adds MMLU CoT, gsm8k and arc_challenge for llama instruct by @anmarques in #2829
- [leaderboard] math - sync with repo by @baberabb in #2817
- Update supported models by @danielholanda in #2866
- Add JSONSchemaBench: A Benchmark for Evaluating Structured Output from LLMs by @Saibo-creator in #2865
- leaderboard - add subtask scores by @baberabb in #2867
- Fix the deps of longbench from jeiba to jieba by @houseroad in #2873
- Optimization for evalita-llm rouge computation by @m-resta in #2878
- Update authentications methods, add support for deployment_id for IBM watsonx_ai by @Medokins in #2877
- Add GSM8K Platinum by @Qubitium in #2771
- Add
--examples
Argument for Fine-Grained Task Evaluation inlm-evaluation-harness
. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] by @felipemaiapolo in #2520 - Extend support for chat template in vLLM by @anmarques in #2902
- tasks README: fix dead link by @dtrifiro in #2899
- Add support for quantization_config by @jerryzh168 in #2842
- Fix a typo in README for tasks by @eldarkurtic in #2910
- fix resolve_hf_chat_template version by @baberabb in #2917
- mmlu - switch dataset to cais/mmlu; fix tests by @baberabb in #2918
- init pixels before tokenizer creation by @artemorloff in #2911
- Longbench bugfix by @baberabb in #2895
- Added softmax_dtype argument to HFLM to coerce log_softmax computations by @Avelina9X in #2921
- [bbh] use np.nan for numpy > 2.0 by @baberabb in #2937
- Add support for enable_thinking argument in vllm model by @anmarques in #2947
- Added NorEval, a novel Norwegian benchmark by @vmkhlv in #2919
- Fix import error for eval_logger in score utils by @annafontanaa in #2940
- Include all test files in sdist by @booxter in #2634
- Change citation name by @StellaAthena in #2956
- [vllm] add warning on truncation by @baberabb in #2962
- fix: type error while checking context length by @llsj14 in #2972
- Fix import error for deepcopy by @kiersten-stokes in #2969
- Pin unitxt to most recent minor version to avoid test failures by @kiersten-stokes in #2970
- mmlu pro generation_kwargs until Q: -> Question: by @yoonniverse in #2945
- AfroBench: How Good are Large Language Models on African Languages? by @JessicaOjo in #2825
- Added C4 Support by @Zephyr271828 in #2889
- Fixed a bug that in MMLU-Pro utils.py that throw index error if one choice was removed by @sleepingcat4 in #2870
- Add question suffix before the <|assistant|> tag by @TingchenFu in #2876
- Add device arg to model_args passed to LLM object in VLLM model class by @momentino in #2879
- paws-x fix formatting by @baberabb in #2759
- Delete scripts/cost_estimate.py by @StellaAthena in #2985
- Adding ACPBench Hard tasks by @harshakokel in #2980
- [SGLANG] Add the SGLANG generate API by @baberabb in #2997
- fix example notebook by @baberabb in #2998
- Log tokenized request warning only once by @RobGeada in #3002
- [Add Dataset Update] KBL 2025 by @abzb1 in #3000
- Output path fix by @Niccolo-Ajroldi in #2993
- use images with api models by @baberabb in #2981
- Adding resize images support by @artemorloff in #2958
- Revert "feat: add question suffix (#2876)" by @baberabb in #3007
- [hotfix] modify multimodal check in evaluate by @baberabb in #3013
- [Fix] Update
resolve_hf_chat_template
arguments by @fxmarty-amd in #2992 - Fix error due in Collating queries with different continuation lengths (fixes #2984) by @ameyagodbole in #2987
- [vllm] data parallel for V1 by @baberabb in #3011
- add arab_culture task by @bodasadallah in #3006
- chore: clean up and extend .gitignore rules by @e1washere in #3030
- Enable text-only evals for VLM models by @ysulsky in #2999
- [Fix] acc_mutual_info metric calculation bug by @baberabb in #3035
- Fix: fix vllm issue with DP>1 by @younesbelkada in #3025
- add Mbpp instruct by @baberabb in #2995
- remove prints by @baberabb in #3041
- [longbench] fix metric calculation by @baberabb in #2983
- Fallback to super implementation in
fewshot_context
for Unitxt tasks by @kiersten-stokes in #3023 - Fix Typo in README and Comment in utils_mcq.py by @vtjl10 in #3057
- fix longbech citation by @baberabb in #3061
- mmlu task: update README.md by @annafontanaa in #3070
- Fix typos in docstrings in instructions.py by @maximevtush in #3060
- bump version to
0.4.9
by @baberabb in #3073
New Contributors
- @wangcho2k made their first contribution in #2761
- @ruivieira made their first contribution in #2778
- @perlitz made their first contribution in #2773
- @kdymkiewicz made their first contribution in #2768
- @PabloAgustin made their first contribution in #2714
- @ksurya made their first contribution in #2727
- @zhuzeyuan made their first contribution in #2786
- @daniel-salib made their first contribution in #2785
- @oskarvanderwal made their first contribution in #1185
- @Avelina9X made their first contribution in #2802
- @agromanou made their first contribution in #2769
- @whoisjones made their first contribution in #2788
- @jd730 made their first contribution in #2813
- @anmarques made their first contribution in #2797
- @zhangruoxu made their first contribution in #2810
- @heli-qi made their first contribution in #2811
- @Tautorn made their first contribution in #2828
- @djwackey made their first contribution in #2854
- @Aprilistic made their first contribution in #2849
- @harshakokel made their first contribution in #2807
- @hadi-abdine made their first contribution in #2521
- @dazipe made their first contribution in #2824
- @danielholanda made their first contribution in #2866
- @Saibo-creator made their first contribution in #2865
- @houseroad made their first contribution in #2873
- @felipemaiapolo made their first contribution in #2520
- @dtrifiro made their first contribution in #2899
- @jerryzh168 made their first contribution in #2842
- @vmkhlv made their first contribution in #2919
- @annafontanaa made their first contribution in #2940
- @booxter made their first contribution in #2634
- @llsj14 made their first contribution in #2972
- @yoonniverse made their first contribution in #2945
- @Zephyr271828 made their first contribution in #2889
- @sleepingcat4 made their first contribution in #2870
- @TingchenFu made their first contribution in #2876
- @momentino made their first contribution in #2879
- @abzb1 made their first contribution in #3000
- @Niccolo-Ajroldi made their first contribution in #2993
- @fxmarty-amd made their first contribution in #2992
- @ameyagodbole made their first contribution in #2987
- @e1washere made their first contribution in #3030
- @ysulsky made their first contribution in #2999
- @younesbelkada made their first contribution in #3025
- @vtjl10 made their first contribution in #3057
- @maximevtush made their first contribution in #3060
Full Changelog: v0.4.8...v0.4.9