Dataset Deduplication and Cleaning for Efficient Language Model Training

Environment Setup

The file requirements.txt contains the pre-requisites that are needed to enable a clean run of the scripts. Below is how to setup the environment:

pip install -r requirements.txt

Note if your dataset is more than 1b lines then use the first method otherwise use the second method

Slimpajama Method

We have two options to replicate SlimPajama in arabic or any language:

run the following command:

python main.py ArabicText/

It took ~2.5 days to process the 1.21T token RedPajama dataset on a machine with 64 CPU cores, but note that using main.py will take longer as it doesn't parallelize some steps across the data chunks. The highest RAM consumption that we observed was ~1.4TB.

Here give it the folder that contains the data, and it will do the following:

Convert Text to Jsonl
NFC normalization
Filter short documents
Deduplication
Interleave & Shuffle
Split Dataset into Train and Holdout
Deduplicate Train against Holdout

The folder should contain all the data files in txt format.

Huggingface Method

First step is to build the dataset using the text_dataset builder script text_dataset using the following command:

from datasets import load_dataset

dataset = load_dataset("text_dataset", data_dir="ArabicText/", split="train")

Then just run to build the dataset:

python hf_load_pretrained.py --dataset_dir text_dataset \
                            --data_dir ArabicText \
                            --data_cache_dir <cache_dir> \
                            --save_dir pretrain_data \
                            --do_tokenize True \
                            --block_size 512 \
                            --tokenizer_name meta-llama/Llama-2-7b-chat-hf \

This script will do the following:

Data Loading
Cleaning up whitespaces
Filter short documents
NFC normalization
Deduplication
Interleave & Shuffle
Split Dataset into Train and Holdout
Deduplicate Train against Holdout
Tokenization and Packing

Citation

To cite our work please use:

@misc{cerebras2023slimpajama,
  author = {Soboleva, Daria and Al-Khateeb, Faisal and Myers, Robert, Steeves, Jacob R and Hestness, Joel and Dey, Nolan},
  title = {{SlimPajama: A 627B token cleaned and deduplicated version of RedPajama}},
  month = June,
  year = 2023,
  howpublished = {\url{https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama}},
  url = {https://huggingface.co/datasets/cerebras/SlimPajama-627B},
}

@misc{GitHub - CarperAI/squeakily,
 url={https://github.com/CarperAI/squeakily}, 
 abstractNote={A library for squeakily cleaning and filtering language datasets. - GitHub - CarperAI/squeakily: A library for squeakily cleaning and filtering language datasets.},
  journal={GitHub},
   language={en} }

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
dedup		dedup
preprocessing		preprocessing
text_dataset		text_dataset
utils		utils
README.md		README.md
convert_to_json.py		convert_to_json.py
hf_load_pretrained.py		hf_load_pretrained.py
main.py		main.py
requirements.txt		requirements.txt
slimar.sh		slimar.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dataset Deduplication and Cleaning for Efficient Language Model Training

Environment Setup

Slimpajama Method

Huggingface Method

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

gagan3012/data

Folders and files

Latest commit

History

Repository files navigation

Dataset Deduplication and Cleaning for Efficient Language Model Training

Environment Setup

Slimpajama Method

Huggingface Method

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages