Skip to content

Conversation

@Zyiqin-Miranda
Copy link
Member

Summary

This PR adds two additional features for customized converter job options, mainly:

  1. File location override: Allow position delete produced to be written to customized location.
    Note this change does not directly override the Iceberg location provider implementation because currently DeltaCAT converter did not go through Iceberg write path. Instead, it rely on UuidBlockWritePathProvider for customized write path options.
  2. data_file_statistics_from_parquet_metadata() method override: As an workaround for injecting reserved column stats instead of breaking the current logic of getting all columns from schema.

Rationale

Explain the reasoning behind the changes and their benefits to the project.

Changes

List the major changes made in this pull request.

Impact

Discuss any potential impacts the changes may have on existing functionalities.

Testing

Describe how the changes have been tested, including both automated and manual testing strategies.
If this is a bugfix, explain how the fix has been tested to ensure the bug is resolved without introducing new issues.

Regression Risk

If this is a bugfix, assess the risk of regression caused by this fix and steps taken to mitigate it.

Checklist

  • Unit tests covering the changes have been added

    • If this is a bugfix, regression tests have been added
  • E2E testing has been performed

Additional Notes

Any additional information or context relevant to this PR.

@Zyiqin-Miranda Zyiqin-Miranda requested a review from pdames April 8, 2025 21:54
Copy link
Member

@pdames pdames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

# Safe limit ONLY considering CPU limit, typically 32 for a 8x-large worker
DEFAULT_MAX_PARALLEL_DATA_FILE_DOWNLOAD = 30

DEFAULT_ICEBERG_NAMESPACE = "default"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about recycling the exising DEFAULT_NAMESPACE in deltacat/constansts.py and changing its current value from "DEFAULT" to "default"? (I have the same change already staged in my workspace, to standardize on existing lowercase conventions for default names)

Copy link
Member Author

@Zyiqin-Miranda Zyiqin-Miranda May 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reusing same constant sounds good, merging this PR in first and addressing in next PR #542

@Zyiqin-Miranda Zyiqin-Miranda merged commit 4a3c279 into ray-project:2.0 May 4, 2025
3 checks passed
rnapark pushed a commit to rnapark/deltacat that referenced this pull request Aug 17, 2025
…ocation; Additional code clean-up (ray-project#524)

Co-authored-by: Miranda <yiqin121@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants