Skip to content

[Iceberg][Converter] Equality to Position Delete Converter #471

@Zyiqin-Miranda

Description

@Zyiqin-Miranda

Converter using Pyiceberg library and Ray cluster compute to be able to convert Iceberg equality deletes to position deletes.

Initial PR merged

Tracking all feature-level TODOs in this issue:

Converter to-be implemented features list, tracking here for future PR references.

P0. [Done] Multiple identifier columns, column concatenating + relevant memory estimation change. PR
P0. [Done] Verify pos delete written out can be read by Spark, probably done in unit test setup using 2.0 docker PR
P0. Switch to construct equality delete tables using Spark, probably done in a unit test using 2.0 docker [Updated]: Spark can't write equality deletes, using Pyiceberg to add equality deletes for testing.
P0. Any model changes we might need for new 2.0 storage model. eg, only convert certain partition, read “delta”, etc. Updated: Deprioritized, to P2.
P0. [Done] Daft sha1 hash support.
P0. [Done] Verify correct deduplication based off identify columns if there are multiple records in whether original data files or equality delete files.
P0. [Done]Add test cases for two partition specs with bucket transform.

P1. Pyarrow chunked array can not exceed 2GB.
P1. [Done] Dynamic memory estimation for file_path column based off length of file path/file prefix override.
P1. Currently, Assuming 1 node can fit one hash bucket for now, adjust parallel data file to download in convert function. [Updated]: If enforce_merge_key_uniqueness, the parallel data file to download won't help in this case.
P1. [Done] Investigate Pyiceberg replace snapshot committing. Currently, replace snapshot committing self-implemented is not working as expected. Definition of correct should be that we’re able to read the REPLACE snapshot using Spark. So currently reuse the OVERWRITE snapshot committing strategy from Pyiceberg.
P1. Investigate replace snapshot committing using_starting_sequence to avoid conflict. Not entirely sure we need this. So deprioritize for P1. [Updated]: No use case currently need this feature.
P1. Merge/Compact small pos delete files support
P1. Spark read pos delete performance. Position delete can correctly be matched to corresponding data files by setting lower_bounds==upper_bounds==file_path even with multiple data files. It’s not scanning whole partition pos delete into memory when trying to merge-on-read.

P2: Any model changes we might need for new 2.0 storage model. eg, only convert certain partition, read “delta”, etc.

Sub-issues

Metadata

Metadata

Labels

P1Resolve if not working on P0 (< 2 weeks)V2Related to DeltaCAT V2 native catalog supportenhancementNew feature or requesticebergThis issue is related to Apache Iceberg catalog support

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions