You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: Unity Catalog writes using daft.DataFrame.write_deltalake() (#3522)
This pull request covers the below 4 workflows that were tested
internally (on Databricks on Azure and AWS) after building the package
in a local environment:
- Load existing table in Unity Catalog and append to it without schema
change : `df.write_deltalake( uc_table, mode=‘append’)` to existing
table in UC retrieved using `unity.load_table(table_name)`
- Load existing table in Unity Catalog and overwrite it without schema
change : `df.write_deltalake( uc_table, mode=‘overwrite’)` overwrite
existing table in UC retrieved using `unity.load_table(table_name)`
- Load existing table in Unity Catalog and overwrite it with schema
change : `df.w rite_deltalake( uc_table, mode=‘overwrite’, schema_mode =
‘overwrite’)` overwrite existing table, with schema change, in UC
retrieved using `unity.load_table(table_name)`
- Create new table in Unity Catalog using Daft engine and populate it
with data : Register a new table in UC without any schema using
`unity.load_table(table_name,
storage_path=“<some_valid_cloud_storage_path>”)` and
`df.write_deltalake( uc_table, mode=‘overwrite’ , schema_mode =
‘overwrite’)`
A few notes :
- `deltalake` (0.22.3) does not support writes to table with Deletion
vectors enabled. For appends to existing table, to avoid
`CommitFailedError: Unsupported reader features required:
[DeletionVectors]`, ensure the tables being written to do not have
Deletion vector enabled.
- `httpx==0.27.2` pinned dependency is due to a defect with
unitycatalog-python, which is affecting Daft as well for all the
previous versions. Fixing it from this PR.
- If schema updates are performed by Daft, readers will immediately see
the new schema since Delta log is self-containing. However, in Unity
Catalog UI for the schema to update, will need to use `REPAIR TABLE
catalog.schema.table SYNC METADATA;` from Databricks compute to update
UC metadata to match what is in Delta log.
- In this version, append to an existing table after changing schema is
not supported. Only overwrites are supported.
- For AWS, needed to set environment variable using `export
AWS_S3_ALLOW_UNSAFE_RENAME=true`.
- There appears to be a defect with the `allow_unsafe_rename` parameter
in df.write_deltalake as it did not work during internal testing. This
could be a new issue to log , once confirmed.
---------
Co-authored-by: Kev Wang <[email protected]>
This call is **blocking** and will execute the DataFrame when called
845
846
846
847
Args:
847
-
table (Union[str, pathlib.Path, DataCatalogTable, deltalake.DeltaTable]): Destination `Delta Lake Table <https://delta-io.github.io/delta-rs/api/delta_table/>`__ or table URI to write dataframe to.
848
+
table (Union[str, pathlib.Path, DataCatalogTable, deltalake.DeltaTable, UnityCatalogTable]): Destination `Delta Lake Table <https://delta-io.github.io/delta-rs/api/delta_table/>`__ or table URI to write dataframe to.
848
849
partition_cols (List[str], optional): How to subpartition each partition further. If table exists, expected to match table's existing partitioning scheme, otherwise creates the table with specified partition columns. Defaults to None.
849
850
mode (str, optional): Operation mode of the write. `append` will add new data, `overwrite` will replace table with new data, `error` will raise an error if table already exists, and `ignore` will not write anything if table already exists. Defaults to "append".
850
851
schema_mode (str, optional): Schema mode of the write. If set to `overwrite`, allows replacing the schema of the table when doing `mode=overwrite`. Schema mode `merge` is currently not supported.
"""Loads an existing Unity Catalog table. If the table is not found, and information is provided in the method to create a new table, a new table will be attempted to be registered.
95
+
96
+
Args:
97
+
table_name (str): Name of the table in Unity Catalog in the form of dot-separated, 3-level namespace
98
+
new_table_storage_path (str, optional): Cloud storage path URI to register a new external table using this path. Unity Catalog will validate if the path is valid and authorized for the principal, else will raise an exception.
0 commit comments