这是indexloc提供的服务,不要输入任何密码
Skip to content

Add flatten_structure option to GCSToS3Operator for flexible path control #56134

@HaJunYoo

Description

@HaJunYoo

Description

Add a flatten_structure parameter to GCSToS3Operator that removes directory structure from transferred files, uploading only the filename to the S3 destination path.

Use case/motivation

Current Behavior:
The GCSToS3Operator always preserves the full GCS object path (including the prefix) when uploading to S3, regardless of the keep_directory_structure setting.

For example:

GCSToS3Operator(
      gcs_bucket="my-bucket",
      prefix="data/2025/01/15/file.parquet",
      dest_s3_key="s3://target-bucket/processed/2025/01/15/"
  )
# GCS files: "data/2025/01/15/file.parquet"
# Results in: s3://target-bucket/processed/2025/01/15/data/2025/01/15/file.parquet
#                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
#                                           Unwanted path duplication!

It can lead to unwanted path duplication when users want to reorganize directory structures.
This makes it impossible to reorganize file structure during transfer without creating intermediate buckets or complex workarounds.

Desired Behavior:
With flatten_structure=True, only the filename would be uploaded, eliminating path duplication as well:

GCSToS3Operator(
    gcs_bucket="my-bucket", 
    prefix="data/2025/01/15/file.parquet",
    dest_s3_key="s3://target-bucket/processed/2025/01/15/",
    flatten_structure=True
)
# GCS files: "data/2025/01/15/file.parquet"  
# Results in: s3://target-bucket/processed/2025/01/15/file.parquet
#                                         ^^^^^^^^^^^^^^^^^^^^^^^^
#                                         Clean, organized path!

Implementation:

def _transform_file_path(self, file_path: str) -> str:
    if self.flatten_structure:
        return os.path.basename(file_path)
    return file_path

This feature enables:

  • Flexible path reorganization during cross-cloud transfers
  • Cleaner S3 directory structures without GCS-specific paths
  • Simplified integration with legacy systems expecting flat structures
  • Eliminates need for post-processing scripts
  • Reduced storage complexity and improved performance in S3 LIST operations

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions