这是indexloc提供的服务,不要输入任何密码
Skip to content

Spark writers should set the sort_order_id data file entry in manifests to the write ordering #13634

@jbewing

Description

@jbewing

Apache Iceberg version

1.9.0

Query engine

Spark

Please describe the bug 🐞

Currently, when writing files with Spark—I'm using 3.5.6—and Iceberg 1.9.0, when a table declares a sort order—or honestly even when it doesn't—I would expect that when writing files from the Spark compute engine in a manner that is ordered e.g. not using the fanout writer that the sort_order_id field be set for written data files in the manifests (Data File Entry Manifest Spec). Currently, this field is never set when writing data files with Spark.

Per the Iceberg Table Spec on Sorting

A data or delete file is associated with a sort order by the sort order's id within a manifest. Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes.

I realize that this is an optional field, so it's not required to be set, however, theoretically setting this field can unlock performance optimizations in the future. For example, I have a feature that I'd love to contribute after this one from an Iceberg fork which enables reporting file ordering to Spark during scans by implementing the SupportsReportOrdering interface to enable the query optimizer to eliminate redundant sorts.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions