这是indexloc提供的服务,不要输入任何密码
Skip to content

GoogleCloudPlatform/checkpoint-replicator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

High Scale Checkpointing Replicator

This repository contains the source code for High Scale Checkpointing Replicator for ML training jobs.

Deployment

There is a helper shell script to build and publish Docker image:

deploy\docker-build.sh <REGISTRY_PATH> [<IMAGE_NAME>]

Directory Organization

  • deploy/: Docker building
  • src/replicator: Replicator source code

Documentation

Checkpoint Replicator is designed to be hosted in multiple runtime environments.

For using it on Google Kubernetes Engine (GKE) you don't have to build/deploy it yourself as fully-managed GKE addon is available. See the following docs (MTC stands for Multi-Tier Checkpointing):

Fully-managed GKE hosting controller is Open-Source as well.

About

High Scale Checkpointing Replicator

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published