Skip to content

Conversation

anilmenon14
Copy link
Contributor

This pull request covers the below 4 workflows that were tested internally (on Databricks on Azure and AWS) after building the package in a local environment:

  • Load existing table in Unity Catalog and append to it without schema change : df.write_deltalake( uc_table, mode=‘append’) to existing table in UC retrieved using unity.load_table(table_name)
  • Load existing table in Unity Catalog and overwrite it without schema change : df.write_deltalake( uc_table, mode=‘overwrite’) overwrite existing table in UC retrieved using unity.load_table(table_name)
  • Load existing table in Unity Catalog and overwrite it with schema change : df.w rite_deltalake( uc_table, mode=‘overwrite’, schema_mode = ‘overwrite’) overwrite existing table, with schema change, in UC retrieved using unity.load_table(table_name)
  • Create new table in Unity Catalog using Daft engine and populate it with data : Register a new table in UC without any schema using unity.load_table(table_name, storage_path=“<some_valid_cloud_storage_path>”) and df.write_deltalake( uc_table, mode=‘overwrite’ , schema_mode = ‘overwrite’)

A few notes :

  • deltalake (0.22.3) does not support writes to table with Deletion vectors enabled. For appends to existing table, to avoid CommitFailedError: Unsupported reader features required: [DeletionVectors], ensure the tables being written to do not have Deletion vector enabled.

  • httpx==0.27.2 pinned dependency is due to a defect with unitycatalog-python, which is affecting Daft as well for all the previous versions. Fixing it from this PR.

  • If schema updates are performed by Daft, readers will immediately see the new schema since Delta log is self-containing. However, in Unity Catalog UI for the schema to update, will need to use REPAIR TABLE catalog.schema.table SYNC METADATA; from Databricks compute to update UC metadata to match what is in Delta log.

  • In this version, append to an existing table after changing schema is not supported. Only overwrites are supported.

  • For AWS, needed to set environment variable using export AWS_S3_ALLOW_UNSAFE_RENAME=true.

    • There appears to be a defect with the allow_unsafe_rename parameter in df.write_deltalake as it did not work during internal testing. This could be a new issue to log , once confirmed.

@anilmenon14 anilmenon14 changed the title Unity Catalog writes using daft.DataFrame.write_deltalake() feat:Unity Catalog writes using daft.DataFrame.write_deltalake() Dec 9, 2024
@anilmenon14 anilmenon14 changed the title feat:Unity Catalog writes using daft.DataFrame.write_deltalake() feat: Unity Catalog writes using daft.DataFrame.write_deltalake() Dec 9, 2024
@github-actions github-actions bot added the feat label Dec 9, 2024
Copy link

codspeed-hq bot commented Dec 9, 2024

CodSpeed Performance Report

Merging #3522 will degrade performances by 13.01%

Comparing anilmenon14:uc-write-deltalake (cb4a2ee) with main (f23ee37)

Summary

⚡ 1 improvements
❌ 1 regressions
✅ 15 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main anilmenon14:uc-write-deltalake Change
test_iter_rows_first_row[100 Small Files] 189.2 ms 217.5 ms -13.01%
test_show[100 Small Files] 23.8 ms 15.5 ms +53.55%

Copy link

codecov bot commented Dec 9, 2024

Codecov Report

Attention: Patch coverage is 16.00000% with 21 lines in your changes missing coverage. Please review.

Project coverage is 77.70%. Comparing base (f23ee37) to head (cb4a2ee).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
daft/unity_catalog/unity_catalog.py 5.88% 16 Missing ⚠️
daft/dataframe/dataframe.py 37.50% 5 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #3522   +/-   ##
=======================================
  Coverage   77.69%   77.70%           
=======================================
  Files         710      710           
  Lines       86941    86964   +23     
=======================================
+ Hits        67552    67572   +20     
- Misses      19389    19392    +3     
Files with missing lines Coverage Δ
daft/dataframe/dataframe.py 85.32% <37.50%> (-0.42%) ⬇️
daft/unity_catalog/unity_catalog.py 21.95% <5.88%> (+11.65%) ⬆️

... and 5 files with indirect coverage changes

@kevinzwang kevinzwang self-requested a review December 9, 2024 19:14
@kevinzwang
Copy link
Member

In this version, append to an existing table after changing schema is not supported. Only overwrites are supported.

Is this because the local table metadata still stores the old schema? We should file an issue for this.

Copy link
Member

@kevinzwang kevinzwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just one quick change to an error message.

@anilmenon14
Copy link
Contributor Author

anilmenon14 commented Dec 11, 2024

In this version, append to an existing table after changing schema is not supported. Only overwrites are supported.

Is this because the local table metadata still stores the old schema? We should file an issue for this.

Currently the reason it does not work is due to the safety in place in daft.DataFrame.write_deltalake that prevents this from happening, particularly this block of code. I didn't want to change this behavior with this PR to avoid any risk for other use cases using this writer.

Without knowing too much of the history, I assume this safety is in place since delta-rs likely has/had some limitation.
I could locate this PR in the delta-rs project that appears to mention there is support for what we are seeking, however from what I see in the delta-rs writer, I believe this is the relevant block of code explicitly prevents mode='append' and schema_mode='overwrite' from being done. I'm keen to hear your thoughts about this. I agree that it is a great feature for Daft to handle schema evolution on appends to Delta tables.

Updating error message to be descriptive about the table name passed along.

Co-authored-by: Kev Wang <[email protected]>
@anilmenon14
Copy link
Contributor Author

Looks great! Just one quick change to an error message.

Thanks @kevinzwang for flagging this.

@kevinzwang
Copy link
Member

Currently the reason it does not work is due to the safety in place in daft.DataFrame.write_deltalake that prevents this from happening, particularly this block of code. I didn't want to change this behavior with this PR to avoid any risk for other use cases using this writer.

Huh I would expect that to work because it fetches the updated schema from the table during write_deltalake. In any case, I don't see that blocking this PR, although if you don't mind, @anilmenon14 could you file an issue for that?

in the meantime, I'll merge this in. Thank you for working on this!

@kevinzwang kevinzwang merged commit dd2dc23 into Eventual-Inc:main Dec 11, 2024
39 of 41 checks passed
@anilmenon14
Copy link
Contributor Author

In any case, I don't see that blocking this PR, although if you don't mind, @anilmenon14 could you file an issue for that?

Absolutely. I will get a new issue logged for that. I will also fork a local copy and remove the safety on the mode and see what I encounter. Will describe what I see in the issue.
Thanks for the review, @kevinzwang !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants