LUMI FineWeb replication

Scripts and instructions for partially replicating the original FineWeb experiments on LUMI using Megatron-LM.

These instructions assume access to LUMI and a custom container, and are unlikely to be particularly useful otherwise.

Setup

As these steps involve fairly large datasets, you should probably be working on a project scratch directory. These instructions assume that the project is project_462000353, and if you are using a different project, you will need to replace that identifier with your project identifier in examples and slurm scripts for this to work.

Create and enter working directory in project scratch

mkdir -p /scratch/project_462000353/$USER/fineweb-repro
cd /scratch/project_462000353/$USER/fineweb-repro

The rest of the instructions assume the above is your working directory unless stated otherwise.

Load pytorch module

The following module provided by CSC includes most of the required libraries, including pytorch and transformers.

module use /appl/local/csc/modulefiles
module load pytorch

These instructions assume that this module is loaded.

Clone ROCm Megatron

On LUMI, we'll need the ROCm fork of Megatron.

git clone https://github.com/ROCm/Megatron-LM

The latest commit when writing these instructions was 99bb7a9. If something breaks in Megatron, try to check out this specific commit.

Download FineWeb sample

We'll here use the 10 billion token FineWeb sample as an example to keep download and processing relatively quick. As this is comparatively small, we'll here simply work on a login node and download the data with load_dataset and save with to_json. If you want to use a larger sample or the entire dataset, you should probably use a compute node and e.g. datatrove (see example here).

Set the HF cache to a subdirectory of the working directory so that the download doesn't exhaust the limited space that's available in your home directory. (If you already have HF_HOME set to e.g. some shared cache directory, skip this step.)

export HF_HOME=$PWD/cache

Download in an interactive Python session (start with python3). This should take about 30 minutes.

from datasets import load_dataset
d = load_dataset('HuggingFaceFW/fineweb', 'sample-10BT', split='train')
d.to_json('fineweb-10BT.jsonl')

Check that the downloaded data has the expected number of lines.

wc -l fineweb-10BT.jsonl

This should give 14868862. (md5sum was 1086778b352dacb729517ca328b14c62.)

Preprocess data

Megatron uses a specialized binary format for input data, and we'll use the script preprocess_data.py provided with Megatron to convert the JSONL into this format.

Start interactive session on CPU node

We'll use several workers to speed up the conversion, so we'll start an interactive session on a compute node to avoid using too much CPU on a login node.

(As of this writing the queues on the standard partition were shorter than those on small, so this uses standard, but small may be faster for you.)

srun --account=project_462000353 --partition=standard --nodes=1 --cpus-per-task=32 --time=00:30:00 --mem=100G --pty bash

Run preprocessing script

This should be executed on the compute node and take about 30 minutes.

python3 Megatron-LM/tools/preprocess_data.py --input fineweb-10BT.jsonl --output fineweb-10BT --tokenizer-type HuggingFaceTokenizer --tokenizer-model gpt2 --append-eod --log-interval 10000 --workers 32

After the preprocessing completes, terminate the interactive session on the compute node and return to the login node (exit or CTRL-D).

The preprocessing should have created two files, fineweb-10BT_text_document.bin and fineweb-10BT_text_document.idx. The sizes of these should be as follows (use e.g. du -h fineweb-10BT_text_document.*):

20G	fineweb-10BT_text_document.bin
284M	fineweb-10BT_text_document.idx

Let's move these to a subdirectory to keep things cleaner. (This is also expected by the training script.)

mkdir megatron-data
mv fineweb-10BT_text_document.* megatron-data/

Schedule pretraining run

If you haven't already, clone this repository:

git clone https://github.com/spyysalo/lumi-fineweb-replication.git

Then copy the scripts train-gpt.sh and launch.sh into your working directory:

cp lumi-fineweb-replication/train-gpt.sh .
cp lumi-fineweb-replication/launch.sh .

You may want to edit train-gpt.sh to set the account, time, number of nodes (etc.) to values appropriate for your setup. Then, you can schedule the run with either

./train-gpt.sh

or

mkdir logs
sbatch train-gpt.sh

You can then use e.g. squeue --me to see the status of the scheduled job and squeue --me --start for a chance to see a bad estimate of when the job might start if it hasn't already.

Once the job is running, you can use e.g. tail -f logs/latest.sh to follow the logs.

The logs should show a throughput of approximately 80 TFLOP/s/GPU (i.e. approx. 160 TFLOP/s/MI250X device) and your training loss curve should closely resemble the following:

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
decode_bin.py		decode_bin.py
launch.sh		launch.sh
loss.png		loss.png
train-gpt.sh		train-gpt.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LUMI FineWeb replication

Setup

Create and enter working directory in project scratch

Load pytorch module

Clone ROCm Megatron

Download FineWeb sample

Preprocess data

Start interactive session on CPU node

Run preprocessing script

Schedule pretraining run

About

Uh oh!

Releases

Packages

Languages

License

spyysalo/lumi-fineweb-replication

Folders and files

Latest commit

History

Repository files navigation

LUMI FineWeb replication

Setup

Create and enter working directory in project scratch

Load pytorch module

Clone ROCm Megatron

Download FineWeb sample

Preprocess data

Start interactive session on CPU node

Run preprocessing script

Schedule pretraining run

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages