$ git clone https://github.com/RC4ML/Moment.git
Local bare-metal machine. Table 1
Platform | CPU-Info | #sockets | #NUMA nodes | CPU Memory | PCIe | GPUs | SSD |
---|---|---|---|---|---|---|---|
A | 104*Intel(R) Xeon(R) Gold 5320 CPU @2.2GHZ | 2 | 2 | 768GB | PCIe 4.0x16 | 80GB-PCIe-A100 | Intel P5510, Sansumg 980 pro |
- Nvidia Driver Version: 515.43.04
- CUDA 11.7 - 12.4
- GCC/G++ 11.4.0
- OS: Ubuntu 22.04, Linux version 5.15.72 (customized, see BaM's requirements)
- pytorch (according to your CUDA toolkit version), torchmetrics
$ pip install torch-cu124
- dgl 1.1.0 - 2.x (according to pytorch and CUDA version)
$ pip install dgl -f https://data.dgl.ai/wheels/cu1xx/repo.html
We reuse the BaM (https://github.com/ZaidQureshi/bam) Kernel Module to enable GPU Direct Storage Access.
$ cat /proc/cmdline | grep iommu If either iommu=on or intel_iommu=on is found by grep, the IOMMU is enabled. Disable it by removing iommu=on and intel_iommu=on from the CMDLINE variable in /etc/default/grub and then reconfiguring GRUB. The next time you reboot, the IOMMU will be disabled.
$ cd /usr/src/nvidia-515.43.04/
$ sudo make
From the project root directory, do the following:
$ git clone https://github.com/ZaidQureshi/bam.git
$ cd bam
$ git submodule update --init --recursive
$ mkdir -p build; cd build
$ cmake ..
$ make libnvm # builds library
$ make benchmarks # builds benchmark program
$ cd build/module
$ make
Unbind the NVMe drivers according to your needs (customize unload_ssd.py):
$ sudo python /path/Moment/unload_ssd.py
$ cd /path/BaM/build/module
$ sudo make load
Check whether it's successful This should create a /dev/libnvm* device file for each controller that isn't bound to the NVMe driver.
$ ls /dev/
The module can be unloaded from the project root directory with the following:
$ cd build/module
$ sudo make unload
Datasets are from OGB (https://ogb.stanford.edu/), Standford-snap (https://snap.stanford.edu/), and Webgraph (https://webgraph.di.unimi.it/). Here is an example of preparing datasets for Moment.
Refer to README in dataset directory for more instructions
$ bash prepare_datasets.sh
$ bash build.sh
There are two steps to train a GNN model in Moment. In these steps, you need to change to root/sudo user for GPU Direct SSD Access.
$ sudo python3 automatic_module.py
Customize dataset path, GNN feature dimension, number of GPUs, and number of SSDs in the automatic_module.py:
file_path = "/share/gnn_data/igb260m/IGB-Datasets/data/" # Replace with your file path
feature_dim = 1024
num_gpu = 2
num_ssd = 6
The automatic module will execute three main steps:
Execute the following instruction:
$ sudo python3 moment_server.py --dataset_name igb --train_batch_size 8000 --fanout [25,10] --epoch 2
Customize the hyperparameters of the Moment server:
argparser.add_argument('--dataset_path', type=str, default="/share/gnn_data/igb260m/IGB-Datasets/data")
argparser.add_argument('--dataset_name', type=str, default="igb")
argparser.add_argument('--train_batch_size', type=int, default=8000)
argparser.add_argument('--fanout', type=list, default=[25, 10])
argparser.add_argument('--gpu_number', type=int, default=2)
argparser.add_argument('--epoch', type=int, default=2)
argparser.add_argument('--ssd_number', type=int, default=6)
argparser.add_argument('--num_queues_per_ssd', type=int, default=128)
Note that the dataset_path should be the same as the automatic module.
When the system outputs the following, start training in another session:
$ export CUDA_VISIBLE_DEVICES=0 # Example using GPU0, adjust for other GPUs as needed
$ sudo nvidia-smi -i 0 -c EXCLUSIVE_PROCESS # Set GPU0 to exclusive process mode
$ sudo nvidia-cuda-mps-control -d # Start the MPS service
# ====== check =========
$ ps -ef | grep mps # After starting successfully, the corresponding process can be seen
# ====== configure =========
$ export CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=80 # Assign the percentage of action SMs for the training backend
# ====== stop =========
$ sudo nvidia-smi -i 0 -c DEFAULT # Restore GPU to default mode
$ echo quit | nvidia-cuda-mps-control # Stop the MPS service
After Moment outputs "System is ready for serving", then start training by:
$ sudo python3 training_backend/moment_graphsage.py --class_num 2 --features_num 1024 --hidden_dim 256 --epoch 2