Initialize and shutdown ray session in each executor #844

praateekmahajan · 2025-07-22T23:33:09Z

Description

This PRs goal is to ensure we are able to run different executor pipeline in the same python process e.g. Xenna pipeline followed by Ray Data pipeline (and vice versa). It is currently not possible to do so without shutdown because ray preserve envvars from the previous sessions.

If https://github.com/nvidia-cosmos/cosmos-xenna/pull/6/files does not merge, even then this PR works as expected as Xenna sets os.environ[RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES] which means if Ray Data runs after Xenna then Ray Data will not set CUDA_VISIBLE_DEVICES i.e. use gpu_id=0. We solve that by forcefully setting RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES="" inside envvars while doing ray.init inside Ray Data.

Workarounds needed

However because we introduce ray.shutdown() now

running two subsequent Ray Data pipelines will fail due to [data] Failed to submit task to actor after ray.shutdown() and re-ray.init() in data pipeline for an existing cluster ray-project/ray#54841
Whenever we create a ray cluster inside the process, it'll be killed by the first pipeline. To avoid that we can use subprocess.run to create our cluster

See the following for clearer understanding

You start a Ray Cluster outside the process and run pipelines in the same process

ray start --head --port 1234

os.environ["RAY_ADDRESS"] = "localhost:1234"
ray.init(ignore_reinit=True) # this will connect to 1234
run_pipeline("xenna")
ray.shutdown()

ray.init(ignore_reinit=True) # this will connect to 1234
run_pipeline("ray data")
ray.shutdown()

You start a ray cluster inside the process and run pipelines in the same process

ray.init(ignore_reinit=True) # this will start a new cluster
run_pipeline("xenna")
ray.shutdown() # this will kill the ray cluster


ray.init(ignore_reinit=True) # this will start a new cluster
run_pipeline("xenna")
ray.shutdown()

You start a ray cluster explicitly inside the new process

ray.init(ignore_reinit=True) # this will start a new cluster A

ray.init(ignore_reinit=True) # this will connect to cluster A
run_pipeline("xenna")
ray.shutdown() # this will kill the cluster A


ray.init(ignore_reinit=True) # this will start cluster B
run_pipeline("xenna")
ray.shutdown()

You use subprocess to start a ray cluster

subprocess.run(["ray", "start", "--head", "--port", "1234"]) # this will start a new cluster A

ray.init(ignore_reinit=True) # this will connect to cluster A @ 1234
run_pipeline("xenna")
ray.shutdown() # this will exit session

ray.init(ignore_reinit=True) # this will connect to cluster A @ 1234
run_pipeline("xenna")
ray.shutdown()

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Praateek <praateekm@gmail.com>

…an/NeMo-Curator into praateek/test-cuda-context Signed-off-by: Praateek <praateekm@gmail.com>

Signed-off-by: Praateek <praateekm@gmail.com>

abhinavg4

Left some minor comments. Should add a bunch of TODOs but apart from that looks nice.

abhinavg4 · 2025-07-24T22:29:30Z

ray-curator/ray_curator/backends/experimental/ray_data/executor.py

+        output_tasks: list[Task] = []
+        try:
+            # Initialize ray
+            ray.init(ignore_reinit_error=True)


NIT: Can we please add a comment here? Why are we doing this again here? Since getting the client also might have ray init/ray start.

Also this should have the loguru serialiser?

This does have a loguru few lines above. Added a comment too

abhinavg4 · 2025-07-24T22:30:17Z

ray-curator/ray_curator/backends/experimental/ray_data/executor.py

+            output_tasks = self._dataset_to_tasks(current_dataset)
+            logger.info(f"Pipeline completed. Final results: {len(output_tasks)} tasks")
+        finally:
+            ray.shutdown()


Same as above. A comment here would be useful. If you have made a issue out of our findings, we can just link that here.

abhinavg4 · 2025-07-24T22:32:34Z

ray-curator/ray_curator/backends/xenna/executor.py

@@ -113,13 +114,14 @@ def execute(self, stages: list[ProcessingStage], initial_tasks: list[Task] | Non
        logger.info(f"Execution mode: {exec_mode.name}")

        try:
-            # Run the pipeline
+            # Run the pipeline (this will initialize ray)


My recommendation would be to add ray. init here as well, along with Ray Loguru serializer, so that we are not dependent on Xenna. And the call is exactly the same across executors.
If we add ray.init here, the ray init inside Xenna will effectively be useless, and I think that is what we want?

Yup made that change, and in fact that helps us solve this PR without Xenna changes too (along with one more change)

abhinavg4 · 2025-07-24T22:33:10Z

ray-curator/tests/backends/conftest.py

-from ray.cluster_utils import Cluster
+
+
+def find_free_port():


I think this function might be present in the actual code too. Can we add a TODO to use that maybe and remove this ?

the reason we find the port ourselves is so that subsequently we can connect to that port using ray.init (instead of address="auto" to avoid connecting to another cluster in case multiple are running)..

abhinavg4 · 2025-07-24T22:33:28Z

ray-curator/tests/backends/conftest.py

+            str(2 * ONE_GB),
+            "--block",
+        ],
+        env={**os.environ, "RAY_MAX_LIMIT_FROM_API_SERVER": "40000", "RAY_MAX_LIMIT_FROM_DATA_SOURCE": "40000"},


Add a TODO to use get_client here in the future?

Added, i hope we can. the only nuance is that we need to know the pid of the process that was started by ray start so that we can kill it without doing ray stop which might kill all ray processes

Signed-off-by: Praateek <praateekm@gmail.com>

praateekmahajan added 4 commits July 22, 2025 16:28

bump ray

45a4258

Signed-off-by: Praateek <praateekm@gmail.com>

fixes

b4f1cc1

Signed-off-by: Praateek <praateekm@gmail.com>

fixes

ae41b06

Signed-off-by: Praateek <praateekm@gmail.com>

Merge branch 'praateek/test-cuda-context' of github.com:praateekmahaj…

97cdaf8

…an/NeMo-Curator into praateek/test-cuda-context Signed-off-by: Praateek <praateekm@gmail.com>

copy-pr-bot bot temporarily deployed to test July 22, 2025 23:33 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci July 22, 2025 23:33 Failure

praateekmahajan requested review from ayushdg, abhinavg4 and sarahyurick July 22, 2025 23:33

fix tests

2f04d0e

Signed-off-by: Praateek <praateekm@gmail.com>

copy-pr-bot bot temporarily deployed to test July 24, 2025 01:31 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci July 24, 2025 01:31 Inactive

praateekmahajan mentioned this pull request Jul 24, 2025

Adding get client feature #834

Open

3 tasks

sarahyurick approved these changes Jul 24, 2025

View reviewed changes

abhinavg4 approved these changes Jul 24, 2025

View reviewed changes

works w/o xenna changes

0d12f68

Signed-off-by: Praateek <praateekm@gmail.com>

copy-pr-bot bot temporarily deployed to test July 25, 2025 22:26 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci July 25, 2025 22:26 Failure

pr suggestions

e419dfa

Signed-off-by: Praateek <praateekm@gmail.com>

copy-pr-bot bot temporarily deployed to test July 25, 2025 22:43 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci July 25, 2025 22:43 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Initialize and shutdown ray session in each executor #844

Initialize and shutdown ray session in each executor #844

Uh oh!

praateekmahajan commented Jul 22, 2025 •

edited

Loading

Uh oh!

abhinavg4 left a comment

Uh oh!

abhinavg4 Jul 24, 2025

Uh oh!

abhinavg4 Jul 24, 2025

Uh oh!

praateekmahajan Jul 25, 2025

Uh oh!

abhinavg4 Jul 24, 2025

Uh oh!

abhinavg4 Jul 24, 2025

Uh oh!

praateekmahajan Jul 25, 2025

Uh oh!

abhinavg4 Jul 24, 2025

Uh oh!

praateekmahajan Jul 25, 2025

Uh oh!

abhinavg4 Jul 24, 2025

Uh oh!

praateekmahajan Jul 25, 2025

Uh oh!

Uh oh!

Initialize and shutdown ray session in each executor #844

Are you sure you want to change the base?

Initialize and shutdown ray session in each executor #844

Uh oh!

Conversation

praateekmahajan commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Workarounds needed

Usage

Checklist

Uh oh!

abhinavg4 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

praateekmahajan commented Jul 22, 2025 •

edited

Loading