GCSFuse performance on Vertex AI custom training job

## Description
I am encountering data loading throughput issues while training a large model on Google Cloud Platform (GCP). Here's some context:

I am utilizing Vertex AI pipelines for my training process. According to GCP documentation, Vertex AI custom training jobs automatically mount GCS (Google Cloud Storage) buckets using GCSFuse. Upon debugging my training setup, I've identified that the bottleneck in data loading seems to be related to GCSFuse, leading to data starvation and subsequent drops in GPU utilization.

I've come across [performance tips](https://cloud.google.com/storage/docs/gcsfuse-cache#stat-cache-overview) that discuss caching as a potential solution. However, since Vertex AI configures GCSFuse automatically, it's unclear how to enable caching. 

Should I configure caching at runtime when running the training job?
When building the docker image that contains my code to run as custom job should I mount manually the bucket and specify cache-dir, won't that be reconfigured by vertex AI when submitting the job?

## Additional context
I am running distributed training on a 4-node setup within Vertex AI pipelines. Each worker node is a n1-highmemory-16 machine equipped with 2 GPUs.

I am using google_cloud_pipeline_components.v1.custom_job.create_custom_training_job_from_component to create the custom training job.

In my code, I'm simply replacing gs:// with /gcs/ as per the GCP documentation for Vertex AI.

## Type of issue
Information - request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GCSFuse performance on Vertex AI custom training job #1830

Description

Additional context

Type of issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GCSFuse performance on Vertex AI custom training job #1830

Description

Description

Additional context

Type of issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions