这是indexloc提供的服务,不要输入任何密码
Skip to content

How about introduce a light filter strategy that based on the col used in Dataset#repartition. #88

@Aaaaaaron

Description

@Aaaaaaron

Builded data like this:df.repartition(20, col("id")).write.parquet(path)

When filter like this: filter(col("id") === 123), we can prune 19 repartition files, without any overhead.

It's very simple to implement, we needn't create the index, just call the same hash function that Dataset#repartition used and get the specified file in listFilesWithIndexSupport.

I almost have done with that, but I have a little concern about the entry point that enables this(Now we'll create the index when found there's no index, seems no perfect way to Inject this, or implement a new MetastoreSupport).

And if you are OK about this feature, I can give a PR first, looking forward to your kind advice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions