Convert HuggingFace datasets to MosaicML Streaming format (MDS) for efficient cloud-based training.
pip install datasets huggingface_hub mosaicml-streaming
Batch convert entire dataset:
python batch_to_mds.py \
--src wikimedia/wikipedia \
--out-hub bgub/wikipedia-mds-test \
--out-local ./mds-local-2/wikipedia \
--procs 10
Convert single config/split:
python hf_to_mds_streaming.py \
--repo-id HuggingFaceFW/fineweb \
--split train \
--out-local /mnt/mds/fineweb \
--out-hub ben-gubler/fineweb-mds \
--procs 16 \
--streaming
batch_to_mds.py
- Batch convert all configs/splits:
--src
/--out-hub
: Source and destination repos (required)--procs
: Worker processes (default: 16)--compression
: e.g.,zstd
,zstd:11
--include-config
/--exclude-config
: Regex filters--dry-run
: Preview without executing--force
: Rebuild existing datasets
hf_to_mds_streaming.py
- Single config/split converter (called by batch script)
# Convert specific language only
python batch_to_mds.py \
--src wikimedia/wikipedia \
--out-hub your-username/wikipedia-en-mds \
--include-config "^20231101\.en$"
# Preview what would be processed
python batch_to_mds.py \
--src microsoft/orca-math-word-problems-200k \
--out-hub your-username/orca-math-mds \
--dry-run
from streaming import StreamingDataset
from torch.utils.data import DataLoader
dataset = StreamingDataset(remote='hf://your-username/dataset-mds')
dataloader = DataLoader(dataset, batch_size=32)
MDS format provides:
- Elastic Determinism: Reproducible across hardware configs
- Fast Resumption: Resume training in seconds
- High Throughput: Optimized for cloud streaming
- Effective Shuffling: Maintains quality while reducing costs