Releases: GoogleCloudPlatform/cluster-toolkit
Release v1.59.1
What's Changed
Version Updates ⏫
- Hotfix release. Update Slurm images to
6.9 > 6.10
, Ubuntu20.04 > 22.04
, Debian11 > 12
by @mr0re1 in #4442
Full Changelog: v1.59.0...v1.59.1
Release v1.59.0
What's Changed
Module Improvements 🔨
- Adding ips_per_nat config for A family blueprints by @vikramvs-gg in #4366
Improvements 🛠
- Symlink
/var/lib/mysql
to "state disk" by @mr0re1 in #4374 - update chs commit by @RachaelSTamakloe in #4395
- Implement accelerator topology by @alyssa-sm in #4404
Version Updates ⏫
- gke-node-pool module to use "google" instead of "google-beta" provider by @kadupoornima in #4368
Bug fixes 🐞
- Fix additional disks for login nodes by @annuay-google in #4403
- Fix nvidia version mismatch for service images by @harshthakkar01 in #4413
New Contributors
- @sarthakag made their first contribution in #4406
- @rachit-google made their first contribution in #4408
Full Changelog: v1.58.1...v1.59.0
v1.58.1 Hotfix: Resolve a3u/a4h slurm nvidia version mismatch error
This is a hotfix to resolve the NVIDIA driver and library version mismatch error on a3-ultragpu and a4-highgpu Slurm clusters.
What's Changed
Bug fixes 🐞
- Resolve a3u/a4h slurm nvidia version mismatch error by @RachaelSTamakloe in #4409
Full Changelog: v1.58.0...v1.58.1
Release v1.58.0
Highlights
- Support for GKE H4D instances: A new blueprint has been added for deploying GKE clusters with H4D instances
- Deprecation of Parallelstore blueprints: The blueprints for deploying Parallelstore have been deprecated and have been removed.
What's Changed
Key New Features 🎉
- Add Kueue 0.12.2 and make it the new default by @mwysokin in #4312
- Add GKE H4D blueprint and integration test by @SwarnaBharathiMantena in #4396
Module Improvements 🔨
- [Bugfix] Applying K8s manifests to GKE clusters via URL by @shubpal07 in #4292
- Revert "[Bugfix] Applying K8s manifests to GKE clusters via URL" by @shubpal07 in #4350
Improvements 🛠
- Install CHS on A3m and Common image by @RachaelSTamakloe in #4334
- Remove IMEX and use default GPU driver in gke-a4x by @parulbajaj01 in #4367
- Implement async suspend.py by @alyssa-sm in #4363
- Eliminate code duplication and move chs download to shared.yaml by @RachaelSTamakloe in #4364
- Add support for Kueue/TAS in gke-a4x by @parulbajaj01 in #4375
Deprecations 💤
Bug fixes 🐞
- Remove Kueue topology annotation as DWS does not work with TAS (yet) by @SwarnaBharathiMantena in #4335
- Fix bug in tensorflow example 'text input must be of type str by @nick-stroud in #4391
New Contributors
- @PayalJakhar made their first contribution in #4353
- @jhpriy made their first contribution in #4361
- @agrawalkhushi18 made their first contribution in #4358
Full Changelog: v1.57.2...v1.58.0
v1.57.2: automate nvidia-bug-report collection on GCE COS VM
v1.57.1: Add Kueue-0.12.2 and make it as default
Release v1.57.0
Highlights
- CHS integrations to GKE blueprints A3 Mega, A3 Ultra, and A4 by @ishitachail in #4293 #4321 #4323 #4328 #4330
What's Changed
Breaking changes 🚨
As part of #4275 the install_cloud_rdma_drivers.sh
startup script will now be removed from H4D blueprints, users should update to this version of Cluster Toolkit as the latest HPC VM/Slurm images will have compatible versions of the RDMA packages pre-installed
Key New Features 🎉
- CHS Integration for A3 Mega by @ishitachail in #4293
- CHS Integration for A3 Ultra by @ishitachail in #4321
- CHS Integration for A4 by @ishitachail in #4323
- CHS for A3 Ultra by @ishitachail in #4328
- CHS for a4 by @ishitachail in #4330
- [update] new weight request form URL by @fschuerm in #4320
New Modules 🧱
Improvements 🛠
- update nccl to 1.0.6 by @cboneti in #4303
- enable wait_for_rollout for kubectl dependencies by @ighosh98 in #4305
Deprecations 💤
- Revamp install_cloud_rdma_drivers startup script by @abbas1902 in #4275
Bug fixes 🐞
- enable tas for kueue v0.11.4 by @ighosh98 in #4304
- Fix TAS Flag in v0.11.4 by @ighosh98 in #4319
- Remove Kueue topology annotation as DWS does not work with TAS (yet) by @SwarnaBharathiMantena in #4336
Full Changelog: v1.56.0...v1.57.0
Release v1.56.0
What's Changed
Breaking changes 🚨
There was a schema change introduced for load_bq.py in v1.56.0
- Fix job row insertion on load_bq.py by @abbas1902 in #4257
Improvements 🛠
- SlurmGCP Resume Improvements by @alyssa-sm in #4276
Version Updates ⏫
- Bump urllib3 from 2.3.0 to 2.5.0 in /community/front-end/ofe by @dependabot in #4296
- Bump protobuf from 5.29.3 to 5.29.5 in /community/front-end/ofe by @dependabot in #4286
- Bump requests from 2.32.3 to 2.32.4 in /community/front-end/ofe by @dependabot in #4285
Bug fixes 🐞
- Fix job row insertion on load_bq.py by @abbas1902 in #4257
Full Changelog: v1.55.1...v1.56.0
v1.55.1 Hotfix: Reduce the severity of missed metadata fetches
This is a hotfix in order to reduce the severity of missed metadata fetches for new supported metadata fields in Slurm-GCP.
What's Changed
- Reduce log pollution of failed Metadata fetch by @abbas1902 in #4290
Full Changelog: v1.55.0...v1.55.1
Release v1.55.0
Highlights
- New blueprint example that lets you create a high-throughput execution environment for Google Deepmind's AlphaFold 3
- Updated A3-Ultra GCSFuse example blueprint to align with best practices
What's Changed
Key New Features 🎉
Improvements 🛠
- Removing MGLRU dependency from Google cloud cluster toolkit by @shubpal07 in #4255
- Modify reservation variable to accommodate different reservation options by @SwarnaBharathiMantena in #4253
- kubernetes provider module implementation by @ighosh98 in #4247
- Align GCSFuse configurations with best practices by @samskillman in #4263
- Information on DWS Calendar consumption option in GKE blueprint by @SwarnaBharathiMantena in #4259
Version Updates ⏫
- Bump django from 5.1.9 to 5.1.10 in /community/front-end/ofe by @dependabot in #4248
Bug fixes 🐞
- Kueue Config Integration Tests incorporating different Accelerator types for different machines by @ishitachail in #4252
Full Changelog: v1.54.0...v1.55.0