Skip to content

Conversation

@Zyiqin-Miranda
Copy link
Member

@Zyiqin-Miranda Zyiqin-Miranda commented May 2, 2025

Summary

This PR mainly includes two additions:

  1. Adds support for concatenating multiple identifier hash column values to evaluate value equality when enforcing identifier column uniqueness.
  2. Refactor parquet file to Iceberg data file conversion to be Ray distributed tasks. Observed that when testing with scale, Pyiceberg function parquet_file_to_iceberg_data_file calls parquet.download_metadata and filesystem._get_file_info() under the hood, which are performance bottlenecks. So refactor the conversion functions to be Ray distributed tasks.

Rationale

Explain the reasoning behind the changes and their benefits to the project.

Changes

  1. Adding concatenating hash column value logic.
  2. Unit tests for multiple identifier value test case.

Impact

Discuss any potential impacts the changes may have on existing functionalities.

Testing

Describe how the changes have been tested, including both automated and manual testing strategies.
If this is a bugfix, explain how the fix has been tested to ensure the bug is resolved without introducing new issues.

Regression Risk

If this is a bugfix, assess the risk of regression caused by this fix and steps taken to mitigate it.

Checklist

  • Unit tests covering the changes have been added

    • If this is a bugfix, regression tests have been added
  • E2E testing has been performed

Additional Notes

Any additional information or context relevant to this PR.

@Zyiqin-Miranda Zyiqin-Miranda force-pushed the multiple-pk-support branch from 8f1760e to f4ad467 Compare May 5, 2025 00:28
@Zyiqin-Miranda Zyiqin-Miranda changed the title [Converter] Multiple pk support [Converter] Multiple pk support; Refactor parquet file to Iceberg data file conversion functions to be Ray distributed tasks. May 5, 2025
@Zyiqin-Miranda Zyiqin-Miranda force-pushed the multiple-pk-support branch from f4ad467 to 5c994a6 Compare May 5, 2025 05:22
Copy link
Member

@pdames pdames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Zyiqin-Miranda Zyiqin-Miranda merged commit d9cca4f into ray-project:2.0 May 9, 2025
3 checks passed
@Zyiqin-Miranda Zyiqin-Miranda deleted the multiple-pk-support branch May 9, 2025 23:23
rnapark pushed a commit to rnapark/deltacat that referenced this pull request Aug 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants