Adding get client feature #834

abhinavg4 · 2025-07-20T07:47:04Z

Description

This PR adds a get client function.

It sets up prometheus and graphana for the user and also allows customizing almost any parameter for Ray cluster.

Usage

get_client()

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

…ersion control. Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>

…ation - Introduced `grafana.ini` for Grafana security and paths configuration. - Added `xenna_graphana_dashboard.json` for a comprehensive dashboard setup. - Created provisioning files for dashboards and data sources to streamline Grafana setup. - Implemented a script to launch Prometheus and Grafana with necessary configurations. This setup enhances the monitoring capabilities for Ray metrics, providing a structured approach to visualize and manage performance data. Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>

…tions - Added support for starting Ray through the script with the `--start_ray` option. - Implemented a help function to guide users on script usage and available options. - Improved error handling for unknown command line arguments. This update streamlines the process of launching monitoring tools for Ray metrics, allowing for a more flexible and user-friendly experience. Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>

Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>

….gitignore by removing the .metrics/ entry. This streamlines the project by eliminating obsolete files and ensuring that unnecessary directories are not tracked in version control. Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>

- Expanded the docstring for the get_ray_client function to include additional arguments such as executor, ray_temp_dir, prometheus_web_port, grafana_web_port, ray_dashboard_host, num_gpus, num_cpus, and enable_object_spilling. - This enhancement improves clarity and usability for developers utilizing the function, ensuring they understand the parameters required for proper configuration. Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>

ayushdg · 2025-07-21T19:40:16Z

ray-curator/ray_curator/core/client.py

+    ray_dashboard_host: str = DEFAULT_RAY_DASHBOARD_HOST,
+    num_gpus: int | None = None,
+    num_cpus: int | None = None,
+    enable_object_spilling: bool = False,


Do we want to expose kwargs here to also pass in kwargs to ray init?

We need to use ray start so there is no 1-1 mapping of the python and the command line arguments. I can use something like onto_slurm for making RayConfig and we can add stuff to it if required .

ayushdg · 2025-07-21T19:41:43Z

ray-curator/ray_curator/core/client.py

+        downloaded, file_name = download_prometheus()
+        if not downloaded:
+            logger.error("Failed to download Prometheus.")
+            sys.exit(1)


Should Curator be responsible for downloading prometheus? Maybe it's a requirement for users to do instead?

Hmmm, we can have a parameter, but I think in the spirit of making things e2e, I would prefer we do download

ayushdg · 2025-07-21T19:44:32Z

ray-curator/ray_curator/core/client.py

+        prometheus_config_path = os.path.join(DEFAULT_NEMO_CURATOR_METRICS_PATH, "prometheus.yml")
+        with open(prometheus_config_path, "w") as f:
+            f.write(
+                PROMETHEUS_YAML_TEMPLATE.format(
+                    service_discovery_path=os.path.join(ray_temp_dir, "prom_metrics_service_discovery.json")
+                )
+            )
+
+        prometheus_cmd = [
+            f"{prometheus_dir}/prometheus",
+            "--config.file",
+            str(prometheus_config_path),
+            "--web.enable-lifecycle",
+            f"--web.listen-address=:{prometheus_web_port}",


might be cleaner to write a separate method responsible for this logic. Similar to download_prometheus() something like start_prometheus()?

ayushdg · 2025-07-21T19:44:49Z

ray-curator/ray_curator/core/client.py

+            arch = platform.machine()
+            if arch not in ("x86_64", "amd64"):
+                logger.warning(
+                    "Automatic Grafana installation is only tested on x86_64/amd64 architectures. "
+                    "Please install Grafana manually if the following steps fail."
+                )


Similar comment as above

ayushdg · 2025-07-21T19:45:05Z

ray-curator/ray_curator/core/client.py

+                    "Please install Grafana manually if the following steps fail."
+                )
+
+            grafana_version = "12.0.2"


Should this go to constants?

ayushdg · 2025-07-21T19:45:51Z

ray-curator/ray_curator/core/client.py

+            msg = "num_gpus is only supported for Xenna. For other executors, use CUDA_VISIBLE_DEVICES"
+            raise ValueError(
+                msg,
+            )


Can you elaborate?

This will change

ayushdg · 2025-07-21T19:47:30Z

ray-curator/ray_curator/core/client.py

+    else:
+        ray_dashboard_port = get_next_free_port(ray_dashboard_port)
+        ray_metrics_port = get_next_free_port(ray_metrics_port)
+        ray_port = get_next_free_port(ray_port)


This should reuse the args provided from the spec/function args

ayushdg · 2025-07-21T19:50:04Z

ray-curator/ray_curator/core/utils.py

+    if executor == "Xenna":
+        from cosmos_xenna.ray_utils.cluster import API_LIMIT
+
+        # We need to set this env var to avoid ray from setting CUDA_VISIBLE_DEVICES.
+        # We set these manually in Xenna because we allocate the gpus manually instead of relying on ray's mechanisms.
+        # This will *only* get picked up from here if the cluster is started from this script. In the case of previously
+        # existing clusters, this needs to be set in the processes that set up the cluster.
+        os.environ["RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES"] = "0"
+        # These need to be set to allow listing debug info about more than 10k actors.
+        os.environ["RAY_MAX_LIMIT_FROM_API_SERVER"] = str(API_LIMIT)
+        os.environ["RAY_MAX_LIMIT_FROM_DATA_SOURCE"] = str(API_LIMIT)
+        os.environ["XENNA_RAY_METRICS_PORT"] = str(ray_metrics_port)


There may be a possibility of moving some of these args like RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES into actor/stage runtime_env args in the xenna wrapper or executor so that the get_client call is executor agnostic.

ayushdg · 2025-07-21T19:50:35Z

ray-curator/ray_curator/core/utils.py

+    ip_address = socket.gethostbyname(socket.gethostname())
+    ray_command = ["ray", "start", "--head"]
+    ray_command.extend(["--node-ip-address", ip_address])
+    ray_command.extend(["--port", str(ray_port)])
+    ray_command.extend(["--metrics-export-port", str(ray_metrics_port)])
+    ray_command.extend(["--dashboard-host", ray_dashboard_host])
+    ray_command.extend(["--dashboard-port", str(ray_dashboard_port)])
+    ray_command.extend(["--temp-dir", ray_temp_dir])
+    ray_command.extend(["--disable-usage-stats"])
+    if enable_object_spilling:
+        ray_command.extend(
+            [
+                "--system-config",
+                '{"local_fs_capacity_threshold": 0.95, "object_spilling_config": "{ "type": "filesystem", "params": {"directory_path": "/tmp/ray_spill", "buffer_size": 1000000 } }"}',


Can we not use ray.init() instead of starting via CLI?

I initially had ray init only but I think it does not support some of the features we need. Same was observed with cosmos curate:

https://github.com/nvidia-cosmos/cosmos-curate/blob/21a1f6b7414a7f3ab3ab7b9369689543e47bb85e/cosmos_curate/scripts/onto_slurm.py#L23

sarahyurick

I was able to run the quickstart and access the dashboards. I was wondering if you can create a user doc for this as well?

sarahyurick · 2025-07-21T19:35:38Z

ray-curator/ray_curator/core/client.py

+    GRAPHANA_DASHBOARD_YAML_TEMPLATE,
+    GRAPHANA_DATASOURCE_YAML_TEMPLATE,
+    GRAPHANA_INI_TEMPLATE,


Is it Graphana or Grafana?

praateekmahajan · 2025-07-24T21:44:39Z

ray-curator/tests/test_get_ray_client.py

+
+    # TODO: This will kill all the ray processes. Find a better way to do this.
+    # Cleanup : stop any Ray processes we started
+    subprocess.run(["ray", "stop", "--force"], check=False, capture_output=True)  # noqa: S603, S607


yup this todo might be important.. would hate for it to kill my existing job if i end up running it on my cluster while i have the job.. i had to hack it in conftest in this PR too #844

praateekmahajan

Took a high level look it mostly looks good.. here are some nits

get_client by default could only get client. and we grafana prometheus behind a boolean which is false.
let's refactor it such that grafana prometheus are behind functions so get_client is super trivial to follow.. at the end of the day get_client should get client and everything else is nice to have
instead of core move it to core/client; similar for test let's follow the same nested structure instead of parent level
I love your test for test_two_clusters_metrics_visible. If we can add similarly more tests that'll be great too!

abhinavg4 added 6 commits July 20, 2025 07:43

Add .metrics/ directory to .gitignore to exclude metrics files from v…

bdf284e

…ersion control. Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>

Adding get client

687befc

Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>

abhinavg4 requested review from praateekmahajan, ayushdg, suiyoubi and sarahyurick July 20, 2025 07:57

ayushdg requested changes Jul 21, 2025

View reviewed changes

sarahyurick reviewed Jul 21, 2025

View reviewed changes

sarahyurick mentioned this pull request Jul 21, 2025

Add prometheus and graphana dashboards #793

Closed

3 tasks

praateekmahajan reviewed Jul 24, 2025

View reviewed changes

praateekmahajan requested changes Jul 24, 2025

View reviewed changes

Adding get client feature #834

Are you sure you want to change the base?

Adding get client feature #834

Uh oh!

Conversation

abhinavg4 commented Jul 20, 2025

Description

Usage

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

praateekmahajan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!