-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
Apache Iceberg version
1.9.0
Query engine
Spark
Please describe the bug 🐞
Currently, when writing files with Spark—I'm using 3.5.6—and Iceberg 1.9.0, when a table declares a sort order—or honestly even when it doesn't—I would expect that when writing files from the Spark compute engine in a manner that is ordered e.g. not using the fanout writer that the sort_order_id
field be set for written data files in the manifests (Data File Entry Manifest Spec). Currently, this field is never set when writing data files with Spark.
Per the Iceberg Table Spec on Sorting
A data or delete file is associated with a sort order by the sort order's id within a manifest. Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes.
I realize that this is an optional field, so it's not required to be set, however, theoretically setting this field can unlock performance optimizations in the future. For example, I have a feature that I'd love to contribute after this one from an Iceberg fork which enables reporting file ordering to Spark during scans by implementing the SupportsReportOrdering
interface to enable the query optimizer to eliminate redundant sorts.
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time