The file requirements.txt contains the pre-requisites that are needed to enable a clean run of the scripts. Below is how to setup the environment:
pip install -r requirements.txt
Note if your dataset is more than 1b lines then use the first method otherwise use the second method
We have two options to replicate SlimPajama in arabic or any language:
run the following command:
python main.py ArabicText/
It took ~2.5 days to process the 1.21T token RedPajama dataset on a machine with 64 CPU cores, but note that using main.py will take longer as it doesn't parallelize some steps across the data chunks. The highest RAM consumption that we observed was ~1.4TB.
Here give it the folder that contains the data, and it will do the following:
- Convert Text to Jsonl
- NFC normalization
- Filter short documents
- Deduplication
- Interleave & Shuffle
- Split Dataset into Train and Holdout
- Deduplicate Train against Holdout
The folder should contain all the data files in txt format.
First step is to build the dataset using the text_dataset builder script text_dataset using the following command:
from datasets import load_dataset
dataset = load_dataset("text_dataset", data_dir="ArabicText/", split="train")
Then just run to build the dataset:
python hf_load_pretrained.py --dataset_dir text_dataset \
--data_dir ArabicText \
--data_cache_dir <cache_dir> \
--save_dir pretrain_data \
--do_tokenize True \
--block_size 512 \
--tokenizer_name meta-llama/Llama-2-7b-chat-hf \
This script will do the following:
- Data Loading
- Cleaning up whitespaces
- Filter short documents
- NFC normalization
- Deduplication
- Interleave & Shuffle
- Split Dataset into Train and Holdout
- Deduplicate Train against Holdout
- Tokenization and Packing
To cite our work please use:
@misc{cerebras2023slimpajama,
author = {Soboleva, Daria and Al-Khateeb, Faisal and Myers, Robert, Steeves, Jacob R and Hestness, Joel and Dey, Nolan},
title = {{SlimPajama: A 627B token cleaned and deduplicated version of RedPajama}},
month = June,
year = 2023,
howpublished = {\url{https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama}},
url = {https://huggingface.co/datasets/cerebras/SlimPajama-627B},
}
@misc{GitHub - CarperAI/squeakily,
url={https://github.com/CarperAI/squeakily},
abstractNote={A library for squeakily cleaning and filtering language datasets. - GitHub - CarperAI/squeakily: A library for squeakily cleaning and filtering language datasets.},
journal={GitHub},
language={en} }