-
Notifications
You must be signed in to change notification settings - Fork 296
feat: Unity Catalog writes using daft.DataFrame.write_deltalake()
#3522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
daft.DataFrame.write_deltalake()
daft.DataFrame.write_deltalake()
daft.DataFrame.write_deltalake()
daft.DataFrame.write_deltalake()
CodSpeed Performance ReportMerging #3522 will degrade performances by 13.01%Comparing Summary
Benchmarks breakdown
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3522 +/- ##
=======================================
Coverage 77.69% 77.70%
=======================================
Files 710 710
Lines 86941 86964 +23
=======================================
+ Hits 67552 67572 +20
- Misses 19389 19392 +3
|
…rage_options' to address mypy typing checks
Is this because the local table metadata still stores the old schema? We should file an issue for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just one quick change to an error message.
Currently the reason it does not work is due to the safety in place in Without knowing too much of the history, I assume this safety is in place since delta-rs likely has/had some limitation. |
Updating error message to be descriptive about the table name passed along. Co-authored-by: Kev Wang <[email protected]>
Thanks @kevinzwang for flagging this. |
Huh I would expect that to work because it fetches the updated schema from the table during in the meantime, I'll merge this in. Thank you for working on this! |
Absolutely. I will get a new issue logged for that. I will also fork a local copy and remove the safety on the |
This pull request covers the below 4 workflows that were tested internally (on Databricks on Azure and AWS) after building the package in a local environment:
df.write_deltalake( uc_table, mode=‘append’)
to existing table in UC retrieved usingunity.load_table(table_name)
df.write_deltalake( uc_table, mode=‘overwrite’)
overwrite existing table in UC retrieved usingunity.load_table(table_name)
df.w rite_deltalake( uc_table, mode=‘overwrite’, schema_mode = ‘overwrite’)
overwrite existing table, with schema change, in UC retrieved usingunity.load_table(table_name)
unity.load_table(table_name, storage_path=“<some_valid_cloud_storage_path>”)
anddf.write_deltalake( uc_table, mode=‘overwrite’ , schema_mode = ‘overwrite’)
A few notes :
deltalake
(0.22.3) does not support writes to table with Deletion vectors enabled. For appends to existing table, to avoidCommitFailedError: Unsupported reader features required: [DeletionVectors]
, ensure the tables being written to do not have Deletion vector enabled.httpx==0.27.2
pinned dependency is due to a defect with unitycatalog-python, which is affecting Daft as well for all the previous versions. Fixing it from this PR.If schema updates are performed by Daft, readers will immediately see the new schema since Delta log is self-containing. However, in Unity Catalog UI for the schema to update, will need to use
REPAIR TABLE catalog.schema.table SYNC METADATA;
from Databricks compute to update UC metadata to match what is in Delta log.In this version, append to an existing table after changing schema is not supported. Only overwrites are supported.
For AWS, needed to set environment variable using
export AWS_S3_ALLOW_UNSAFE_RENAME=true
.allow_unsafe_rename
parameter in df.write_deltalake as it did not work during internal testing. This could be a new issue to log , once confirmed.