Stars
Ongoing research training transformer models at scale
Matplotlib styles for scientific plotting
Code for the Paper: "STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings"
Code release to accompany the paper "Persistent Pre-training Poisoning of LLMs"
[NeurIPS D&B '25] The one-stop repository for large language model (LLM) unlearning. Supports TOFU, MUSE, WMDP, and many unlearning methods with easily feature extensibility.
Open-source framework for the research and development of foundation models.
PyTorch building blocks for the OLMo ecosystem
The code for creating the iGSM datasets in papers "Physics of Language Models Part 2.1, Grade-School Math and the Hidden Reasoning Process" (arxiv 2407.20311) and "Physics of Language Models Part 2…
What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets
EleutherAI / nanoGPT-mup
Forked from karpathy/nanoGPTThe simplest, fastest repository for training/finetuning medium-sized GPTs.
A library for unit scaling in PyTorch
An extremely fast Python package and project manager, written in Rust.
Train transformer language models with reinforcement learning.
Fully open reproduction of DeepSeek-R1
A comprehensive repository of reasoning tasks for LLMs (and beyond)
Minimal reproduction of DeepSeek R1-Zero
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
How much can we forget about Data Contamination? (ICML 2025)
Meta Lingua: a lean, efficient, and easy-to-hack codebase to research LLMs.
The Paper List on Data Contamination for Large Language Models Evaluation.
A toolkit for quantitative evaluation of data attribution methods.