-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
Apache Iceberg version
1.9.2 (latest release)
Query engine
Spark
Please describe the bug 🐞
🧭 Problem Summary
My ENV is :
- iceberg: 1.7.x ~1.9.x
- spark: 3.5.3
- jdk: openjdk version "17.0.2" 2022-01-18
- linux: 5.12.5-1.el7.elrepo.x86_64
When querying a partitioned Iceberg table (by year, month) with Parquet bloom filter enabled on a STRING column (resource_id
), the query returns 0 rows on Linux (Iceberg 1.9.x + Spark 3.5.3) but returns correctly on Windows or when downgrading to Iceberg 1.7.x.
This discrepancy leads to incorrect query results and is platform-dependent.
📦 Table DDL
CREATE TABLE IF NOT EXISTS iceberg_catalog.test.xxh (
date_time TIMESTAMP,
operate_type INT,
resource_id STRING,
year INT,
month INT,
day INT
)
USING iceberg
PARTITIONED BY (year, month)
TBLPROPERTIES (
'write.distribution-mode' = 'hash',
'write.metadata.delete-after-commit.enabled' = 'true',
'write.metadata.previous-versions-max' = '2',
'write.parquet.bloom-filter-enabled.column.resource_id' = 'true',
'write.parquet.compression-codec' = 'zstd',
'write.target-file-size-bytes' = '4294967296'
);
on linux iceberg v1.7.2 return correct result, v1.9.2 can not return correct result
on windows iceberg v1.9.2 can return correct result
spark version > 3.5.3 with iceberg 1.7.1 will get another error
when i use this code it worked well. But i know how to set vectorization-enabled in sql
this is my data file
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time