Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Support reading equality delete files #1210

Open
kevinjqliu opened this issue Sep 26, 2024 · 4 comments
Open

[feature request] Support reading equality delete files #1210

kevinjqliu opened this issue Sep 26, 2024 · 4 comments
Assignees

Comments

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Sep 26, 2024

Feature Request / Improvement

Only position delete is supported right now

positional_delete_entries = SortedList(key=lambda entry: entry.sequence_number or INITIAL_SEQUENCE_NUMBER)

Let's also add reading equality delete

Position delete PR apache/iceberg#6775

@kevinjqliu kevinjqliu assigned kevinjqliu and unassigned kevinjqliu Sep 26, 2024
@Zyiqin-Miranda
Copy link

Thanks @kevinjqliu, I can work on this issue

@sungwy
Copy link
Collaborator

sungwy commented Sep 27, 2024

This will be a fantastic addition to PyIceberg! Thank you for raising this issue @kevinjqliu and @Zyiqin-Miranda 🎉

@Zyiqin-Miranda
Copy link

Thanks @kevinjqliu and @sungwy. Starting to add support to current plan_files function for equality deletes, not sure if the current _InclusiveMetricsEvaluator can be directly used to determine whether the equality delete files is relevant to the data files?
Seems like Iceberg Java uses canContainEqDeletesForFile instead.
My understanding is that position deletes can use lower_bound == upper_bound of file_path column to filter out irrelevant files quickly but equality deletes don't have this advantage, so basically equality deletes can be relevant to any data files within same partition. Thanks for any insights here in advance!

@kevinjqliu
Copy link
Contributor Author

Equality Delete Files and Scan Planning are good docs for this.

My general understanding is that equality deletes are applied to all data files (across all partitions, if partitioned).

Position delete files must be applied to data files from the same commit, when the data and delete file data sequence numbers are equal. This allows deleting rows that were added in the same commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants