Write operation #29

fjetter · 2023-06-23T15:01:59Z

Closes #4

It effectively translates/copied most of https://github.com/delta-io/delta-rs/blob/e5dd8e2167b94e6856aa531d878584397d5bea69/python/deltalake/writer.py#L142-L342 to dask. I omitted the overwrite path for now mostly for simplicities sake. The only genuine addition here is how this would have to be wired together with dask (the HLG / map_partition / Scalar foo)

There is a lot of overlap and I have to hook into some internal flagged APIs of delta-rs . This can be hopefully cleaned up eventually (cc @MrPowers)

codecov-commenter · 2023-06-28T14:26:47Z

Codecov Report

Merging #29 (3a17af6) into main (ba3801c) will decrease coverage by 19.57%.
The diff coverage is 62.43%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@             Coverage Diff             @@
##             main      #29       +/-   ##
===========================================
- Coverage   92.85%   73.28%   -19.57%     
===========================================
  Files           2        4        +2     
  Lines         112      307      +195     
===========================================
+ Hits          104      225      +121     
- Misses          8       82       +74

Impacted Files	Coverage Δ
dask_deltatable/_schema.py	`60.86% <60.86%> (ø)`
dask_deltatable/write.py	`63.75% <63.75%> (ø)`
dask_deltatable/core.py	`92.52% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

fjetter

Test coverage is obviously still very bad but I would love to see this getting in soon before we cover all cases

cc @jrbourbeau @MrPowers

@rjzamora I assume you built the dask vanilla-parquet writer. This may interest you as well

fjetter · 2023-06-28T14:33:42Z

dask_deltatable/write.py

+    # TODO: This is applying a potentially stricted schema control than what
+    # Delta requires but if this passes, it should be good to go
+    schema = validate_compatible(schemas)


I took this code from https://github.com/data-engineering-collective/plateau where I've built a similar thing before. The challenge here is to get a schema from all the partitions that represents the transaction the best. this is particularly a struggle for partitions will null columns because pyarrow then does not give a proper schema. This method effectively looks at all written schemas, deduplicates it and merges them into a super-schema, i.e. it fills null columns with a proper type.

It also raises if there are incompatible schemas detected. Incompatible in this sense means, for example that the same column has an integer and a float in two different partitions. This may be stricter than what delta requires and a lot of this is already covered by dask but I used this regardless since we need the scheme deduplication.

I'm very open to throwing this out again down the road but this gets us started quickly since this is battle-proven code.

fjetter · 2023-06-28T14:34:02Z

tests/test_write.py

+def test_roundtrip(tmpdir, with_index):
+    dtypes = {
+        "str": object,
+        # FIXME: Categorical data does not work


There are a couple of problems around dtypes. We can open dedicated issues for this once the PR is merged

j-bennet · 2023-06-29T22:01:32Z

dask_deltatable/write.py

+        raise NotImplementedError()
+
+    written = df.map_partitions(
+        _write_partition,


Why can't we use write_deltalake from deltatable for writing a partition?

We want to separate the writing of the parquet files from the commit to the log. The entire to_deltalake is supposed to be one transaction. Using the deltatable.write_deltalake would create a commit per partition. This is not only not what a transaction is supposed to be but this would almost guarantee that we'd have write conflicts due to concurrent writes.

j-bennet · 2023-07-04T20:03:33Z

dask_deltatable/write.py

+            raise DeltaProtocolError(
+                "This table's min_writer_version is "
+                f"{table.protocol().min_writer_version}, "
+                "but this method only supports version 2."


Suggested change

"but this method only supports version 2."

f"but this method only supports version {MAX_SUPPORTED_WRITER_VERSION}."

j-bennet · 2023-07-04T20:07:04Z

dask_deltatable/write.py

+        partitioning = None
+    if mode == "overwrite":
+        # FIXME: There are a couple of checks that are not migrated yet
+        raise NotImplementedError()


Suggested change

raise NotImplementedError()

raise NotImplementedError("mode='overwrite' is not implemented")

j-bennet · 2023-07-04T20:29:11Z

dask_deltatable/write.py

+        format="parquet",
+        partitioning=partitioning,
+        # It will not accept a schema if using a RBR
+        schema=schema if not isinstance(data, RecordBatchReader) else None,


Why would data be a RecordBatchReader if you just created it as pa.Table.from_pandas on line 218?

j-bennet

Looks good! Currently, there are a couple of cases that are not covered by this PR. I would either open issues to fix those cases later, or add tests that are xfailed, so they serve as a reminder.

j-bennet · 2023-07-06T00:53:31Z

I added some issues for later:

fjetter added 2 commits June 28, 2023 16:12

WIP writer

f04b913

Finish simple roundtrip

e36b34c

fjetter force-pushed the dask_writer branch from f0a1ed9 to e36b34c Compare June 28, 2023 14:21

drop get_dataframe_not_nested

bf1e061

fjetter commented Jun 28, 2023

View reviewed changes

fjetter marked this pull request as ready for review June 28, 2023 14:35

fjetter changed the title ~~WIP Write operation~~ Write operation Jun 28, 2023

j-bennet reviewed Jun 29, 2023

View reviewed changes

j-bennet reviewed Jul 4, 2023

View reviewed changes

j-bennet approved these changes Jul 4, 2023

View reviewed changes

This was referenced Jul 6, 2023

Implement writing with categoricals #31

Open

Support partition_freq when writing #32

Closed

Support writing and reading back index #33

Open

Implement mode == "overwrite" in to_deltalake #34

Open

nits

3a17af6

fjetter merged commit 08cc5b4 into dask-contrib:main Jul 10, 2023

fjetter deleted the dask_writer branch July 10, 2023 14:45

mrocklin mentioned this pull request Jul 10, 2023

Create Dask Delta writer #4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write operation #29

Write operation #29

fjetter commented Jun 23, 2023 •

edited

Loading

codecov-commenter commented Jun 28, 2023 •

edited

Loading

fjetter left a comment

fjetter Jun 28, 2023

fjetter Jun 28, 2023

j-bennet Jun 29, 2023

fjetter Jun 30, 2023

j-bennet Jul 4, 2023

j-bennet Jul 4, 2023

j-bennet Jul 4, 2023

j-bennet left a comment

j-bennet commented Jul 6, 2023 •

edited

Loading

	"but this method only supports version 2."
	f"but this method only supports version {MAX_SUPPORTED_WRITER_VERSION}."

	raise NotImplementedError()
	raise NotImplementedError("mode='overwrite' is not implemented")

Write operation #29

Write operation #29

Conversation

fjetter commented Jun 23, 2023 • edited Loading

codecov-commenter commented Jun 28, 2023 • edited Loading

Codecov Report

fjetter left a comment

Choose a reason for hiding this comment

fjetter Jun 28, 2023

Choose a reason for hiding this comment

fjetter Jun 28, 2023

Choose a reason for hiding this comment

j-bennet Jun 29, 2023

Choose a reason for hiding this comment

fjetter Jun 30, 2023

Choose a reason for hiding this comment

j-bennet Jul 4, 2023

Choose a reason for hiding this comment

j-bennet Jul 4, 2023

Choose a reason for hiding this comment

j-bennet Jul 4, 2023

Choose a reason for hiding this comment

j-bennet left a comment

Choose a reason for hiding this comment

j-bennet commented Jul 6, 2023 • edited Loading

fjetter commented Jun 23, 2023 •

edited

Loading

codecov-commenter commented Jun 28, 2023 •

edited

Loading

j-bennet commented Jul 6, 2023 •

edited

Loading