这是indexloc提供的服务,不要输入任何密码
Skip to content

Adding get client feature #834

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: ray-api
Choose a base branch
from
Open

Adding get client feature #834

wants to merge 6 commits into from

Conversation

abhinavg4
Copy link
Contributor

Description

This PR adds a get client function.

It sets up prometheus and graphana for the user and also allows customizing almost any parameter for Ray cluster.

Usage

get_client()

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

…ersion control.

Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
…ation

- Introduced `grafana.ini` for Grafana security and paths configuration.
- Added `xenna_graphana_dashboard.json` for a comprehensive dashboard setup.
- Created provisioning files for dashboards and data sources to streamline Grafana setup.
- Implemented a script to launch Prometheus and Grafana with necessary configurations.

This setup enhances the monitoring capabilities for Ray metrics, providing a structured approach to visualize and manage performance data.

Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
…tions

- Added support for starting Ray through the script with the `--start_ray` option.
- Implemented a help function to guide users on script usage and available options.
- Improved error handling for unknown command line arguments.

This update streamlines the process of launching monitoring tools for Ray metrics, allowing for a more flexible and user-friendly experience.

Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
….gitignore by removing the .metrics/ entry. This streamlines the project by eliminating obsolete files and ensuring that unnecessary directories are not tracked in version control.

Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
- Expanded the docstring for the get_ray_client function to include additional arguments such as executor, ray_temp_dir, prometheus_web_port, grafana_web_port, ray_dashboard_host, num_gpus, num_cpus, and enable_object_spilling.
- This enhancement improves clarity and usability for developers utilizing the function, ensuring they understand the parameters required for proper configuration.

Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
ray_dashboard_host: str = DEFAULT_RAY_DASHBOARD_HOST,
num_gpus: int | None = None,
num_cpus: int | None = None,
enable_object_spilling: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to expose kwargs here to also pass in kwargs to ray init?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to use ray start so there is no 1-1 mapping of the python and the command line arguments. I can use something like onto_slurm for making RayConfig and we can add stuff to it if required .

Comment on lines +156 to +159
downloaded, file_name = download_prometheus()
if not downloaded:
logger.error("Failed to download Prometheus.")
sys.exit(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should Curator be responsible for downloading prometheus? Maybe it's a requirement for users to do instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, we can have a parameter, but I think in the spirit of making things e2e, I would prefer we do download

Comment on lines +174 to +187
prometheus_config_path = os.path.join(DEFAULT_NEMO_CURATOR_METRICS_PATH, "prometheus.yml")
with open(prometheus_config_path, "w") as f:
f.write(
PROMETHEUS_YAML_TEMPLATE.format(
service_discovery_path=os.path.join(ray_temp_dir, "prom_metrics_service_discovery.json")
)
)

prometheus_cmd = [
f"{prometheus_dir}/prometheus",
"--config.file",
str(prometheus_config_path),
"--web.enable-lifecycle",
f"--web.listen-address=:{prometheus_web_port}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be cleaner to write a separate method responsible for this logic. Similar to download_prometheus() something like start_prometheus()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100%

Comment on lines +221 to +226
arch = platform.machine()
if arch not in ("x86_64", "amd64"):
logger.warning(
"Automatic Grafana installation is only tested on x86_64/amd64 architectures. "
"Please install Grafana manually if the following steps fail."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

"Please install Grafana manually if the following steps fail."
)

grafana_version = "12.0.2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go to constants?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup

Comment on lines +328 to +331
msg = "num_gpus is only supported for Xenna. For other executors, use CUDA_VISIBLE_DEVICES"
raise ValueError(
msg,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will change

else:
ray_dashboard_port = get_next_free_port(ray_dashboard_port)
ray_metrics_port = get_next_free_port(ray_metrics_port)
ray_port = get_next_free_port(ray_port)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should reuse the args provided from the spec/function args

Comment on lines +73 to +84
if executor == "Xenna":
from cosmos_xenna.ray_utils.cluster import API_LIMIT

# We need to set this env var to avoid ray from setting CUDA_VISIBLE_DEVICES.
# We set these manually in Xenna because we allocate the gpus manually instead of relying on ray's mechanisms.
# This will *only* get picked up from here if the cluster is started from this script. In the case of previously
# existing clusters, this needs to be set in the processes that set up the cluster.
os.environ["RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES"] = "0"
# These need to be set to allow listing debug info about more than 10k actors.
os.environ["RAY_MAX_LIMIT_FROM_API_SERVER"] = str(API_LIMIT)
os.environ["RAY_MAX_LIMIT_FROM_DATA_SOURCE"] = str(API_LIMIT)
os.environ["XENNA_RAY_METRICS_PORT"] = str(ray_metrics_port)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a possibility of moving some of these args like RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES into actor/stage runtime_env args in the xenna wrapper or executor so that the get_client call is executor agnostic.

Comment on lines +86 to +99
ip_address = socket.gethostbyname(socket.gethostname())
ray_command = ["ray", "start", "--head"]
ray_command.extend(["--node-ip-address", ip_address])
ray_command.extend(["--port", str(ray_port)])
ray_command.extend(["--metrics-export-port", str(ray_metrics_port)])
ray_command.extend(["--dashboard-host", ray_dashboard_host])
ray_command.extend(["--dashboard-port", str(ray_dashboard_port)])
ray_command.extend(["--temp-dir", ray_temp_dir])
ray_command.extend(["--disable-usage-stats"])
if enable_object_spilling:
ray_command.extend(
[
"--system-config",
'{"local_fs_capacity_threshold": 0.95, "object_spilling_config": "{ "type": "filesystem", "params": {"directory_path": "/tmp/ray_spill", "buffer_size": 1000000 } }"}',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use ray.init() instead of starting via CLI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially had ray init only but I think it does not support some of the features we need. Same was observed with cosmos curate:

https://github.com/nvidia-cosmos/cosmos-curate/blob/21a1f6b7414a7f3ab3ab7b9369689543e47bb85e/cosmos_curate/scripts/onto_slurm.py#L23

Copy link
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to run the quickstart and access the dashboards. I was wondering if you can create a user doc for this as well?

Comment on lines +25 to +27
GRAPHANA_DASHBOARD_YAML_TEMPLATE,
GRAPHANA_DATASOURCE_YAML_TEMPLATE,
GRAPHANA_INI_TEMPLATE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it Graphana or Grafana?


# TODO: This will kill all the ray processes. Find a better way to do this.
# Cleanup : stop any Ray processes we started
subprocess.run(["ray", "stop", "--force"], check=False, capture_output=True) # noqa: S603, S607
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup this todo might be important.. would hate for it to kill my existing job if i end up running it on my cluster while i have the job.. i had to hack it in conftest in this PR too #844

Copy link
Contributor

@praateekmahajan praateekmahajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a high level look it mostly looks good.. here are some nits

  1. get_client by default could only get client. and we grafana prometheus behind a boolean which is false.
  2. let's refactor it such that grafana prometheus are behind functions so get_client is super trivial to follow.. at the end of the day get_client should get client and everything else is nice to have
  3. instead of core move it to core/client; similar for test let's follow the same nested structure instead of parent level
  4. I love your test for test_two_clusters_metrics_visible. If we can add similarly more tests that'll be great too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants