这是indexloc提供的服务,不要输入任何密码
Skip to content

v0.4.9

Latest
Compare
Choose a tag to compare
@baberabb baberabb released this 19 Jun 14:18
· 42 commits to main since this release
4527495

lm-eval v0.4.9 Release Notes

Key Improvements

  • Enhanced Backend Support:

  • Multimodal Capabilities:

  • Performance & Reliability:

    • Quantization support added via quantization_config by @jerryzh168 in #2842
    • Memory optimization: Use yaml.CLoader for faster YAML loading by @giuliolovisotto in #2777
    • Bug fixes: Resolved MMLU generative metric aggregation (#2761) and context length handling issues (#2972)

New Benchmarks & Tasks

Code Evaluation

  • HumanEval Instruct - Instruction-following code generation benchmark by @baberabb in #2650
  • MBPP Instruct - Instruction-based Python programming evaluation by @baberabb in #2995

Language Modeling

  • C4 Dataset Support - Added perplexity evaluation on C4 web crawl dataset by @Zephyr271828 in #2889

Long Context Benchmarks

  • RULER and Longbench - Long-context evaluation suites added by @baberabb in #2629

Mathematical & Reasoning

Llama Reference Implementations

  • Llama Reference Implementations - Added task variants for Multilingual MMLU, MMLU CoT, GSM8K, and ARC Challenge based on Llama evaluation standards by @anmarques in #2797, #2826, #2829

Multilingual Expansion

Asian Languages:

European Languages:

  • NorEval - Comprehensive Norwegian benchmark by @vmkhlv in #2919

African Languages:

  • AfroBench - Multi-African language evaluation by @JessicaOjo in #2825
  • Darija tasks - Moroccan dialect benchmarks (DarijaMMLU, DarijaHellaSwag, Darija_Bench) by @hadi-abdine in #2521

Arabic Languages:

Domain-Specific Benchmarks

Social & Bias Evaluation

Technical Enhancements

  • Fine-grained evaluation: Added --examples argument for efficient multi-prompt evaluation by @felipemaiapolo and @mirianfsilva in #2520
  • Improved tokenization: Better handling of add_bos_token initialization by @baberabb in #2781
  • Memory management: Enhanced softmax computations with softmax_dtype argument for HFLM by @Avelina9X in #2921

Critical Bug Fixes

  • Collating Queries Fix - Resolved error with different continuation lengths that was causing evaluation failures by @ameyagodbole in #2987
  • Mutual Information Metric - Fixed acc_mutual_info calculation bug that affected metric accuracy by @baberabb in #3035

Breaking Changes & Important Updates

  • MMLU dataset migration: Switched to cais/mmlu dataset source by @baberabb in #2918
  • Default parameter updates: Increased max_gen_toks to 2048 and max_length to 8192 for MMLU Pro tests by @dazipe in #2824
  • Temperature defaults: Set default temperature to 0.0 for vLLM and SGLang backends by @baberabb in #2819

We extend our heartfelt thanks to all contributors who made this release possible, including 43 first-time contributors who brought fresh perspectives and valuable improvements to the evaluation harness.

What's Changed

New Contributors

Full Changelog: v0.4.8...v0.4.9