这是indexloc提供的服务,不要输入任何密码
Skip to content

GCSFuse performance on Vertex AI custom training job #1830

@miguelalba96

Description

@miguelalba96

Description

I am encountering data loading throughput issues while training a large model on Google Cloud Platform (GCP). Here's some context:

I am utilizing Vertex AI pipelines for my training process. According to GCP documentation, Vertex AI custom training jobs automatically mount GCS (Google Cloud Storage) buckets using GCSFuse. Upon debugging my training setup, I've identified that the bottleneck in data loading seems to be related to GCSFuse, leading to data starvation and subsequent drops in GPU utilization.

I've come across performance tips that discuss caching as a potential solution. However, since Vertex AI configures GCSFuse automatically, it's unclear how to enable caching.

Should I configure caching at runtime when running the training job?
When building the docker image that contains my code to run as custom job should I mount manually the bucket and specify cache-dir, won't that be reconfigured by vertex AI when submitting the job?

Additional context

I am running distributed training on a 4-node setup within Vertex AI pipelines. Each worker node is a n1-highmemory-16 machine equipped with 2 GPUs.

I am using google_cloud_pipeline_components.v1.custom_job.create_custom_training_job_from_component to create the custom training job.

In my code, I'm simply replacing gs:// with /gcs/ as per the GCP documentation for Vertex AI.

Type of issue

Information - request

Metadata

Metadata

Labels

p2P2questionCustomer Issue: question about how to use tool

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions