[Iceberg][Converter] Equality to Position Delete Converter

Converter using Pyiceberg library and Ray cluster compute to be able to convert Iceberg equality deletes to position deletes. 

[Initial PR merged](https://github.com/ray-project/deltacat/pull/356)

Tracking all feature-level TODOs in this issue:

Converter to-be implemented features list, tracking here for future PR references.

P0. [Done] Multiple identifier columns, column concatenating + relevant memory estimation change. [PR](https://github.com/ray-project/deltacat/pull/542)
P0. [Done] Verify pos delete written out can be read by Spark, probably done in unit test setup using 2.0 docker [PR](https://github.com/ray-project/deltacat/pull/516)
~~P0. Switch to construct equality delete tables using Spark, probably done in a unit test using 2.0 docker~~ [Updated]: Spark can't write equality deletes, using Pyiceberg to add equality deletes for testing.
~~P0. Any model changes we might need for new 2.0 storage model. eg, only convert certain partition, read “delta”, etc.~~ Updated: Deprioritized, to P2.
P0. [Done] Daft sha1 hash support.
P0. [Done] Verify correct deduplication based off identify columns if there are multiple records in whether original data files or equality delete files.
P0. [Done]Add test cases for two partition specs with bucket transform.

P1. Pyarrow chunked array can not exceed 2GB.
P1. [Done] Dynamic memory estimation for `file_path` column based off length of file path/file prefix override. 
~~P1. Currently, Assuming 1 node can fit one hash bucket for now, adjust parallel data file to download in convert function.~~ [Updated]: If `enforce_merge_key_uniqueness`, the parallel data file to download won't help in this case.
P1. [Done] Investigate Pyiceberg replace snapshot committing. Currently, replace snapshot committing self-implemented is not working as expected. Definition of correct should be that we’re able to read the REPLACE snapshot using Spark. So currently reuse the OVERWRITE snapshot committing strategy from Pyiceberg.
~~P1. Investigate replace snapshot committing using_starting_sequence to avoid conflict. Not entirely sure we need this. So deprioritize for P1.~~ [Updated]: No use case currently need this feature.
P1. Merge/Compact small pos delete files support
P1. Spark read pos delete performance. Position delete can correctly be matched to corresponding data files by setting lower_bounds==upper_bounds==file_path even with multiple data files. It’s not scanning whole partition pos delete into memory when trying to merge-on-read.

P2: Any model changes we might need for new 2.0 storage model. eg, only convert certain partition, read “delta”, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Iceberg][Converter] Equality to Position Delete Converter #471

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Iceberg][Converter] Equality to Position Delete Converter #471

Description

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions