This repository contains the source code for High Scale Checkpointing Replicator for ML training jobs.
There is a helper shell script to build and publish Docker image:
deploy\docker-build.sh <REGISTRY_PATH> [<IMAGE_NAME>]
deploy/: Docker buildingsrc/replicator: Replicator source code
Checkpoint Replicator is designed to be hosted in multiple runtime environments.
For using it on Google Kubernetes Engine (GKE) you don't have to build/deploy it yourself as fully-managed GKE addon is available. See the following docs (MTC stands for Multi-Tier Checkpointing):
-
MaxText MTC documentation (for training on TPUs)
-
NeMo MTC recipe (for training on GPUs)
Fully-managed GKE hosting controller is Open-Source as well.