-
Notifications
You must be signed in to change notification settings - Fork 154
Adding get client feature #834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: ray-api
Are you sure you want to change the base?
Conversation
…ersion control. Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
…ation - Introduced `grafana.ini` for Grafana security and paths configuration. - Added `xenna_graphana_dashboard.json` for a comprehensive dashboard setup. - Created provisioning files for dashboards and data sources to streamline Grafana setup. - Implemented a script to launch Prometheus and Grafana with necessary configurations. This setup enhances the monitoring capabilities for Ray metrics, providing a structured approach to visualize and manage performance data. Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
…tions - Added support for starting Ray through the script with the `--start_ray` option. - Implemented a help function to guide users on script usage and available options. - Improved error handling for unknown command line arguments. This update streamlines the process of launching monitoring tools for Ray metrics, allowing for a more flexible and user-friendly experience. Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
….gitignore by removing the .metrics/ entry. This streamlines the project by eliminating obsolete files and ensuring that unnecessary directories are not tracked in version control. Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
- Expanded the docstring for the get_ray_client function to include additional arguments such as executor, ray_temp_dir, prometheus_web_port, grafana_web_port, ray_dashboard_host, num_gpus, num_cpus, and enable_object_spilling. - This enhancement improves clarity and usability for developers utilizing the function, ensuring they understand the parameters required for proper configuration. Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
ray_dashboard_host: str = DEFAULT_RAY_DASHBOARD_HOST, | ||
num_gpus: int | None = None, | ||
num_cpus: int | None = None, | ||
enable_object_spilling: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to expose kwargs here to also pass in kwargs to ray init?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to use ray start so there is no 1-1 mapping of the python and the command line arguments. I can use something like onto_slurm for making RayConfig and we can add stuff to it if required .
downloaded, file_name = download_prometheus() | ||
if not downloaded: | ||
logger.error("Failed to download Prometheus.") | ||
sys.exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should Curator be responsible for downloading prometheus? Maybe it's a requirement for users to do instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, we can have a parameter, but I think in the spirit of making things e2e, I would prefer we do download
prometheus_config_path = os.path.join(DEFAULT_NEMO_CURATOR_METRICS_PATH, "prometheus.yml") | ||
with open(prometheus_config_path, "w") as f: | ||
f.write( | ||
PROMETHEUS_YAML_TEMPLATE.format( | ||
service_discovery_path=os.path.join(ray_temp_dir, "prom_metrics_service_discovery.json") | ||
) | ||
) | ||
|
||
prometheus_cmd = [ | ||
f"{prometheus_dir}/prometheus", | ||
"--config.file", | ||
str(prometheus_config_path), | ||
"--web.enable-lifecycle", | ||
f"--web.listen-address=:{prometheus_web_port}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be cleaner to write a separate method responsible for this logic. Similar to download_prometheus()
something like start_prometheus()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100%
arch = platform.machine() | ||
if arch not in ("x86_64", "amd64"): | ||
logger.warning( | ||
"Automatic Grafana installation is only tested on x86_64/amd64 architectures. " | ||
"Please install Grafana manually if the following steps fail." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
"Please install Grafana manually if the following steps fail." | ||
) | ||
|
||
grafana_version = "12.0.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this go to constants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup
msg = "num_gpus is only supported for Xenna. For other executors, use CUDA_VISIBLE_DEVICES" | ||
raise ValueError( | ||
msg, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will change
else: | ||
ray_dashboard_port = get_next_free_port(ray_dashboard_port) | ||
ray_metrics_port = get_next_free_port(ray_metrics_port) | ||
ray_port = get_next_free_port(ray_port) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should reuse the args provided from the spec/function args
if executor == "Xenna": | ||
from cosmos_xenna.ray_utils.cluster import API_LIMIT | ||
|
||
# We need to set this env var to avoid ray from setting CUDA_VISIBLE_DEVICES. | ||
# We set these manually in Xenna because we allocate the gpus manually instead of relying on ray's mechanisms. | ||
# This will *only* get picked up from here if the cluster is started from this script. In the case of previously | ||
# existing clusters, this needs to be set in the processes that set up the cluster. | ||
os.environ["RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES"] = "0" | ||
# These need to be set to allow listing debug info about more than 10k actors. | ||
os.environ["RAY_MAX_LIMIT_FROM_API_SERVER"] = str(API_LIMIT) | ||
os.environ["RAY_MAX_LIMIT_FROM_DATA_SOURCE"] = str(API_LIMIT) | ||
os.environ["XENNA_RAY_METRICS_PORT"] = str(ray_metrics_port) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be a possibility of moving some of these args like RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES
into actor/stage runtime_env args in the xenna wrapper or executor so that the get_client
call is executor agnostic.
ip_address = socket.gethostbyname(socket.gethostname()) | ||
ray_command = ["ray", "start", "--head"] | ||
ray_command.extend(["--node-ip-address", ip_address]) | ||
ray_command.extend(["--port", str(ray_port)]) | ||
ray_command.extend(["--metrics-export-port", str(ray_metrics_port)]) | ||
ray_command.extend(["--dashboard-host", ray_dashboard_host]) | ||
ray_command.extend(["--dashboard-port", str(ray_dashboard_port)]) | ||
ray_command.extend(["--temp-dir", ray_temp_dir]) | ||
ray_command.extend(["--disable-usage-stats"]) | ||
if enable_object_spilling: | ||
ray_command.extend( | ||
[ | ||
"--system-config", | ||
'{"local_fs_capacity_threshold": 0.95, "object_spilling_config": "{ "type": "filesystem", "params": {"directory_path": "/tmp/ray_spill", "buffer_size": 1000000 } }"}', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not use ray.init()
instead of starting via CLI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially had ray init only but I think it does not support some of the features we need. Same was observed with cosmos curate:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to run the quickstart and access the dashboards. I was wondering if you can create a user doc for this as well?
GRAPHANA_DASHBOARD_YAML_TEMPLATE, | ||
GRAPHANA_DATASOURCE_YAML_TEMPLATE, | ||
GRAPHANA_INI_TEMPLATE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it Graphana or Grafana?
|
||
# TODO: This will kill all the ray processes. Find a better way to do this. | ||
# Cleanup : stop any Ray processes we started | ||
subprocess.run(["ray", "stop", "--force"], check=False, capture_output=True) # noqa: S603, S607 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup this todo might be important.. would hate for it to kill my existing job if i end up running it on my cluster while i have the job.. i had to hack it in conftest in this PR too #844
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a high level look it mostly looks good.. here are some nits
get_client
by default could only get client. and we grafana prometheus behind a boolean which is false.- let's refactor it such that grafana prometheus are behind functions so
get_client
is super trivial to follow.. at the end of the day get_client should get client and everything else is nice to have - instead of core move it to core/client; similar for test let's follow the same nested structure instead of parent level
- I love your test for
test_two_clusters_metrics_visible
. If we can add similarly more tests that'll be great too!
Description
This PR adds a get client function.
It sets up prometheus and graphana for the user and also allows customizing almost any parameter for Ray cluster.
Usage
get_client()
# Add snippet demonstrating usage
Checklist