Moment [SC 25]

$ git clone https://github.com/RC4ML/Moment.git

1. Preparation of Hardware and Software [Skip this section during SC AE]

1.1 Hardware Recommended

Local bare-metal machine. Table 1

Platform	CPU-Info	#sockets	#NUMA nodes	CPU Memory	PCIe	GPUs	SSD
A	104*Intel(R) Xeon(R) Gold 5320 CPU @2.2GHZ	2	2	768GB	PCIe 4.0x16	80GB-PCIe-A100	Intel P5510, Sansumg 980 pro

1.2 Tested Software Environments

Nvidia Driver Version: 515.43.04
CUDA 11.7 - 12.4
GCC/G++ 11.4.0
OS: Ubuntu 22.04, Linux version 5.15.72 (customized, see BaM's requirements)
pytorch (according to your CUDA toolkit version), torchmetrics

$ pip install torch-cu124

dgl 1.1.0 - 2.x (according to pytorch and CUDA version)

$ pip install dgl -f https://data.dgl.ai/wheels/cu1xx/repo.html

1.3 Driver for GPU Direct Storage Access

We reuse the BaM (https://github.com/ZaidQureshi/bam) Kernel Module to enable GPU Direct Storage Access.

Step 1 Disable IOMMU in Linux

$ cat /proc/cmdline | grep iommu If either iommu=on or intel_iommu=on is found by grep, the IOMMU is enabled. Disable it by removing iommu=on and intel_iommu=on from the CMDLINE variable in /etc/default/grub and then reconfiguring GRUB. The next time you reboot, the IOMMU will be disabled.

Step 2 Compiling Nvidia Driver Kernel Symbols

$ cd /usr/src/nvidia-515.43.04/
$ sudo make

Step 3 Building BaM Project

From the project root directory, do the following:

$ git clone https://github.com/ZaidQureshi/bam.git
$ cd bam
$ git submodule update --init --recursive
$ mkdir -p build; cd build
$ cmake ..
$ make libnvm                         # builds library
$ make benchmarks                     # builds benchmark program
$ cd build/module
$ make

Step 4 Loading/Unloading the Kernel Module

Unbind the NVMe drivers according to your needs (customize unload_ssd.py):

$ sudo python /path/Moment/unload_ssd.py 
$ cd /path/BaM/build/module
$ sudo make load

Check whether it's successful This should create a /dev/libnvm* device file for each controller that isn't bound to the NVMe driver.

$ ls /dev/

The module can be unloaded from the project root directory with the following:

$ cd build/module
$ sudo make unload

2. Prepare Datasets for GNN Training

Datasets are from OGB (https://ogb.stanford.edu/), Standford-snap (https://snap.stanford.edu/), and Webgraph (https://webgraph.di.unimi.it/). Here is an example of preparing datasets for Moment.

Uk-Union Datasets

Refer to README in dataset directory for more instructions

$ bash prepare_datasets.sh

3. Run Moment

3.1 Build Moment from Source

$ bash build.sh

There are two steps to train a GNN model in Moment. In these steps, you need to change to root/sudo user for GPU Direct SSD Access.

3.2 Run Moment's Automatic Module

$ sudo python3 automatic_module.py

Customize dataset path, GNN feature dimension, number of GPUs, and number of SSDs in the automatic_module.py:

file_path = "/share/gnn_data/igb260m/IGB-Datasets/data/"  # Replace with your file path
feature_dim = 1024
num_gpu = 2
num_ssd = 6

The automatic module will execute three main steps:

3.2.1 Get hardware and run profiling

Output:

3.2.2 Solve maxflow and recommend the best hardware configuration setting

Output:

3.2.3 Run DDAk

3.3 Start Moment Server

Execute the following instruction:

$ sudo python3 moment_server.py --dataset_name igb --train_batch_size 8000 --fanout [25,10] --epoch 2

Customize the hyperparameters of the Moment server:

    argparser.add_argument('--dataset_path', type=str, default="/share/gnn_data/igb260m/IGB-Datasets/data")
    argparser.add_argument('--dataset_name', type=str, default="igb")
    argparser.add_argument('--train_batch_size', type=int, default=8000)
    argparser.add_argument('--fanout', type=list, default=[25, 10])
    argparser.add_argument('--gpu_number', type=int, default=2)
    argparser.add_argument('--epoch', type=int, default=2)
    argparser.add_argument('--ssd_number', type=int, default=6)
    argparser.add_argument('--num_queues_per_ssd', type=int, default=128)

Note that the dataset_path should be the same as the automatic module. When the system outputs the following, start training in another session:

3.4 Run Moment Training

(Optional) Configure SM Utilization of Training Backend:

$ export CUDA_VISIBLE_DEVICES=0         # Example using GPU0, adjust for other GPUs as needed
$ sudo nvidia-smi -i 0 -c EXCLUSIVE_PROCESS  # Set GPU0 to exclusive process mode
$ sudo nvidia-cuda-mps-control -d            # Start the MPS service
# ====== check =========
$ ps -ef | grep mps                     # After starting successfully, the corresponding process can be seen
# ====== configure =========
$ export CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=80 # Assign the percentage of action SMs for the training backend
# ====== stop =========
$ sudo nvidia-smi -i 0 -c DEFAULT       # Restore GPU to default mode
$ echo quit | nvidia-cuda-mps-control   # Stop the MPS service

After Moment outputs "System is ready for serving", then start training by:

$ sudo python3 training_backend/moment_graphsage.py --class_num 2  --features_num 1024 --hidden_dim 256 --epoch 2

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
IOStack		IOStack
dataset		dataset
prune_tree		prune_tree
sampling_server		sampling_server
training_backend		training_backend
.gitignore		.gitignore
README.md		README.md
automatic_module.py		automatic_module.py
build.sh		build.sh
ddak.py		ddak.py
get_pcie.py		get_pcie.py
maxflow.py		maxflow.py
meta_config		meta_config
moment_server.py		moment_server.py
pcie_subset.json		pcie_subset.json
prepare_dataset.sh		prepare_dataset.sh
profiler.py		profiler.py
pruned_tree.txt		pruned_tree.txt
unload_ssd.py		unload_ssd.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Moment [SC 25]

1. Preparation of Hardware and Software [Skip this section during SC AE]

1.1 Hardware Recommended

1.2 Tested Software Environments

1.3 Driver for GPU Direct Storage Access

Step 1 Disable IOMMU in Linux

Step 2 Compiling Nvidia Driver Kernel Symbols

Step 3 Building BaM Project

Step 4 Loading/Unloading the Kernel Module

2. Prepare Datasets for GNN Training

Uk-Union Datasets

3. Run Moment

3.1 Build Moment from Source

3.2 Run Moment's Automatic Module

3.2.1 Get hardware and run profiling

3.2.2 Solve maxflow and recommend the best hardware configuration setting

3.2.3 Run DDAk

3.3 Start Moment Server

3.4 Run Moment Training

(Optional) Configure SM Utilization of Training Backend:

About

Uh oh!

Releases

Packages

Languages

RC4ML/Moment

Folders and files

Latest commit

History

Repository files navigation

Moment [SC 25]

1. Preparation of Hardware and Software [Skip this section during SC AE]

1.1 Hardware Recommended

1.2 Tested Software Environments

1.3 Driver for GPU Direct Storage Access

Step 1 Disable IOMMU in Linux

Step 2 Compiling Nvidia Driver Kernel Symbols

Step 3 Building BaM Project

Step 4 Loading/Unloading the Kernel Module

2. Prepare Datasets for GNN Training

Uk-Union Datasets

3. Run Moment

3.1 Build Moment from Source

3.2 Run Moment's Automatic Module

3.2.1 Get hardware and run profiling

3.2.2 Solve maxflow and recommend the best hardware configuration setting

3.2.3 Run DDAk

3.3 Start Moment Server

3.4 Run Moment Training

(Optional) Configure SM Utilization of Training Backend:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages