-
Notifications
You must be signed in to change notification settings - Fork 458
Description
Description
I am encountering data loading throughput issues while training a large model on Google Cloud Platform (GCP). Here's some context:
I am utilizing Vertex AI pipelines for my training process. According to GCP documentation, Vertex AI custom training jobs automatically mount GCS (Google Cloud Storage) buckets using GCSFuse. Upon debugging my training setup, I've identified that the bottleneck in data loading seems to be related to GCSFuse, leading to data starvation and subsequent drops in GPU utilization.
I've come across performance tips that discuss caching as a potential solution. However, since Vertex AI configures GCSFuse automatically, it's unclear how to enable caching.
Should I configure caching at runtime when running the training job?
When building the docker image that contains my code to run as custom job should I mount manually the bucket and specify cache-dir, won't that be reconfigured by vertex AI when submitting the job?
Additional context
I am running distributed training on a 4-node setup within Vertex AI pipelines. Each worker node is a n1-highmemory-16 machine equipped with 2 GPUs.
I am using google_cloud_pipeline_components.v1.custom_job.create_custom_training_job_from_component to create the custom training job.
In my code, I'm simply replacing gs:// with /gcs/ as per the GCP documentation for Vertex AI.
Type of issue
Information - request