这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@shubpal07
Copy link
Contributor

This PR refactors the Kueue installation to use the official Helm chart instead of manually maintained, static YAML files. This significantly improves maintainability, simplifies future upgrades, and aligns our toolkit with standard practices.

The previous file-based system was brittle and made updating Kueue a risky, manual process.

Implementation

This change implements a two-stage installation pattern:

Stage 1 (Helm Install): The old file-based install_kueue module is replaced with a new one that uses our generic helm_install module to deploy the official Kueue OCI chart.

Stage 2 (Kubectl Configure): The existing configure_kueue module is preserved to apply our dynamic, blueprint-specific CRDs (ResourceFlavor, ClusterQueue) after the Helm installation completes. A depends_on block was added to ensure correct ordering.

As part of this, the old kueue-*.yaml manifest files and the local variables referencing them have been removed.

Validation

Successfully deployed and validated end-to-end using the gke-a3-ultragpu blueprint.

Helm Release: Verified helm list shows the kueue release as deployed.

Custom Config: Verified kubectl get resourceflavor ,clusterqueue shows the correct, dynamically configured resources.

Performance: a final NCCL benchmark was run successfully, confirming expected hardware performance (~800 GB/s bus bandwidth).

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@shubpal07 shubpal07 requested a review from vikramvs-gg August 19, 2025 06:26
@shubpal07 shubpal07 self-assigned this Aug 19, 2025
@shubpal07 shubpal07 requested review from a team and samskillman as code owners August 19, 2025 06:26
@shubpal07 shubpal07 added the release-module-improvements Added to release notes under the "Module Improvements" heading. label Aug 19, 2025
@shubpal07 shubpal07 requested a review from bytetwin October 6, 2025 07:51
@shubpal07
Copy link
Contributor Author

@shubpal07 shubpal07 force-pushed the shubham/kueue-on-helm branch 2 times, most recently from 01cc91b to e707661 Compare October 6, 2025 15:02
This commit refactors the Kueue scheduler installation, migrating it from manually managed static YAML files to the official Kueue Helm chart.
@shubpal07 shubpal07 force-pushed the shubham/kueue-on-helm branch from e707661 to 3dba1f2 Compare October 7, 2025 09:46
@shubpal07 shubpal07 requested a review from bytetwin October 7, 2025 09:56
@shubpal07
Copy link
Contributor Author

PR-test-gke-a2-highgpu-kueue (hpc-toolkit-dev)
PR-test-gke-a2-highgpu-kueue (hpc-toolkit-dev)Successful in 35m — Summary
PR-test-gke-a3-megagpu (hpc-toolkit-dev)
PR-test-gke-a3-megagpu (hpc-toolkit-dev)Successful in 424m — Summary
PR-test-gke-a3-ultragpu (hpc-toolkit-dev)
PR-test-gke-a3-ultragpu (hpc-toolkit-dev)Successful in 42m — Summary

@shubpal07 shubpal07 merged commit 9cd6551 into GoogleCloudPlatform:develop Oct 9, 2025
16 of 65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-module-improvements Added to release notes under the "Module Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants