+
Skip to content

gagan3012/data

Repository files navigation

Dataset Deduplication and Cleaning for Efficient Language Model Training

Environment Setup

The file requirements.txt contains the pre-requisites that are needed to enable a clean run of the scripts. Below is how to setup the environment:

pip install -r requirements.txt

Note if your dataset is more than 1b lines then use the first method otherwise use the second method

Slimpajama Method

We have two options to replicate SlimPajama in arabic or any language:

run the following command:

python main.py ArabicText/ 

It took ~2.5 days to process the 1.21T token RedPajama dataset on a machine with 64 CPU cores, but note that using main.py will take longer as it doesn't parallelize some steps across the data chunks. The highest RAM consumption that we observed was ~1.4TB.

Here give it the folder that contains the data, and it will do the following:

  1. Convert Text to Jsonl
  2. NFC normalization
  3. Filter short documents
  4. Deduplication
  5. Interleave & Shuffle
  6. Split Dataset into Train and Holdout
  7. Deduplicate Train against Holdout

The folder should contain all the data files in txt format.

Huggingface Method

First step is to build the dataset using the text_dataset builder script text_dataset using the following command:

from datasets import load_dataset

dataset = load_dataset("text_dataset", data_dir="ArabicText/", split="train")

Then just run to build the dataset:

python hf_load_pretrained.py --dataset_dir text_dataset \
                            --data_dir ArabicText \
                            --data_cache_dir <cache_dir> \
                            --save_dir pretrain_data \
                            --do_tokenize True \
                            --block_size 512 \
                            --tokenizer_name meta-llama/Llama-2-7b-chat-hf \

This script will do the following:

  1. Data Loading
  2. Cleaning up whitespaces
  3. Filter short documents
  4. NFC normalization
  5. Deduplication
  6. Interleave & Shuffle
  7. Split Dataset into Train and Holdout
  8. Deduplicate Train against Holdout
  9. Tokenization and Packing

Citation

To cite our work please use:

@misc{cerebras2023slimpajama,
  author = {Soboleva, Daria and Al-Khateeb, Faisal and Myers, Robert, Steeves, Jacob R and Hestness, Joel and Dey, Nolan},
  title = {{SlimPajama: A 627B token cleaned and deduplicated version of RedPajama}},
  month = June,
  year = 2023,
  howpublished = {\url{https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama}},
  url = {https://huggingface.co/datasets/cerebras/SlimPajama-627B},
}
@misc{GitHub - CarperAI/squeakily,
 url={https://github.com/CarperAI/squeakily}, 
 abstractNote={A library for squeakily cleaning and filtering language datasets. - GitHub - CarperAI/squeakily: A library for squeakily cleaning and filtering language datasets.},
  journal={GitHub},
   language={en} }

About

Data Cleaning Modules

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载