How about introduce a light filter strategy that based on the col used in Dataset#repartition.

Builded data like this:`df.repartition(20, col("id")).write.parquet(path)`

When filter like this: `filter(col("id") === 123)`, we can prune 19  repartition files, without any overhead.

It's very simple to implement, we needn't create the index, just call the same hash function that `Dataset#repartition` used and get the specified file in `listFilesWithIndexSupport`.

I almost have done with that, but I have a little concern about the entry point that enables this(Now we'll create the index when found there's no index, seems no perfect way to Inject this, or implement a new MetastoreSupport).

And if you are OK about this feature, I can give a PR first, looking forward to your kind advice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How about introduce a light filter strategy that based on the col used in Dataset#repartition. #88

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How about introduce a light filter strategy that based on the col used in Dataset#repartition. #88

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions