这是indexloc提供的服务,不要输入任何密码
Skip to content

Support NGC Pytorch containers #741

@terrykong

Description

@terrykong

Right now NeMo RL uses uv and that has worked for us so far. With newer fixes and next gen hardware, we'll have to support pytorch coming from NGC containers as opposed to official wheels on torch index.

As an example, we won't be able to use torch 2.8 wheels (when they come out) b/c they do not ship with a new enough nccl for blackwell

This requires some design, but initial thoughts of directions to try (in order of complexity) are:

  1. Adding a second dockerfile (w/ NGC pytorch base) and add a switch in NeMo RL to use a "non-isolated" environment
    • this would mean we try to install everything in the system environment
    • may need to think about how to ensure libraries like vllm do not override system torch
    • need a switch in nemo-rl to use the PY_EXECUTABLES.SYSTEM everywhere
  2. Keep the existing isolated environments, but somehow constrain the uv resolver to not install torch (and its transitive dependencies, e.g., pypi nccl/nvidia wheels)
  • This path should also not be made default in NeMo RL which prevents us from being "pip installable"
  • Ideally we keep this functionality in the main branch.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions