CoMMA (Collective coMMunication Analyzer)

This is the repository for CoMMA, a library that collects collective telemetry through NCCL's profiler plugin API.

Overview

CoMMA (Collective coMMunication Analyzer) is a library designed for use with NVIDIA’s NCCL profiler plugin API to collect NCCL telemetry for use by Google Cloud services. It uses NCCL's profiler API, available since NCCL version 2.23, to extract and export detailed collective and network events from NCCL, including timelines of operations, data transfer sizes, and algorithm choices.

CoMMA is designed for low-overhead tracing, making it ideal for performance-sensitive and long-running machine learning workloads such as large language model (LLM) training. For more information, see the AI Hypercomputer documentation.

Getting Started

CoMMA is often pre-installed as part of most GCP AI Hypercomputer container and OS images and is enabled by default. For more information, see the AI Hypercomputer documentation.

If you use any of these OS or container images and want to disable CoMMA from collecting NCCL telemetry, see Disable CoMMA.

Installation

If you don't use any of these images and want to install CoMMA, use one of the following methods. For full installation instructions, see AI Hypercomputer documentation.

Installation method	Supported machine types
Use NCCL gIB image (Recommended for newer machine types)	A4X, A4 High, and A3 Ultra
Use CoMMA installer image	A4X, A4 High, and A3 Ultra
Build from source (Required for older machine types)	A3 Mega, A3 High, A3 Edge, A2 Ultra, A2 Standard, and N1 with attached GPUs

Using CoMMA outside of GCP

CoMMA could also be used on non-GCP environments with NCCL version >= v2.23. We recommend either using a CoMMA installer image or building from source for those use cases.

Note that CoMMA integrates with GCP services to enable GCP-specific features. When running on non-GCP environment, those integrations need to be disabled by setting the following environment variables:

NCCL_PROFILER_USE_GPUVIZ=false

Understanding CoMMA Output

You can view the raw data collected by exporting the CoMMA output to a local file. This could be done by setting the following environment varaibles:

NCCL_PROFILER_LATENCY_FILE=/tmp/latency-%p.txt

For detailed instructions on configuraing the granularity of telemetry export, see AI Hypercomputer documentation.

The output is a list JSON objects providing detailed information about communication operations.

Contributing

Please read CONTRIBUTING.md for details on how to contribute to this project.

Licensing

CoMMA is licensed under the terms of the Apache license. See LICENSE for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
GPUViz		GPUViz
benches		benches
c-helpers		c-helpers
docs		docs
src		src
third_party		third_party
.dockerignore		.dockerignore
.gitmodules		.gitmodules
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CoMMA (Collective coMMunication Analyzer)

Overview

Getting Started

Installation

Using CoMMA outside of GCP

Understanding CoMMA Output

Contributing

Licensing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

google/CoMMA

Folders and files

Latest commit

History

Repository files navigation

CoMMA (Collective coMMunication Analyzer)

Overview

Getting Started

Installation

Using CoMMA outside of GCP

Understanding CoMMA Output

Contributing

Licensing

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages