Read test cases checklist #7

wjones127 · 2022-10-04T02:09:49Z

What is our philosophy of test cases? Do we care about each individual feature? Or are we collecting a set of cases that have maximal coverage of important common and corner cases? I'm assuming the latter for this draft list.

Reader protocol v1:

A Delta Lake table with all data types
A table with a checkpoint (with and without early transactions present / no replay)
A table which has had a schema change
A table with stats as struct
A table with multiple levels of partitioning, including null values at both levels (should cover all serialization cases here: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#partition-value-serialization)
A table with a multi-part checkpoint

Reader protocol v2:

Partitioned table with id-based column mapping
Partitioned table with name-based column mapping

MrPowers · 2022-10-04T15:05:21Z

@wjones127 - Your list looks great.

I'd add a Delta Lake that's constructed with different save modes to the list.

df = spark.range(0, 3)
df.write.format("delta").save("/tmp/delta-table")
df2 = spark.range(4, 6)
df2.write.mode("overwrite").format("delta").save("/tmp/delta-table")

This test will make sure that the Delta Lake reader isn't just reading all the Parquet files.

wjones127 · 2022-11-04T04:40:48Z

Some notes for implementing each of these

A Delta Lake table with all data types

Example from delta-rs tests: https://github.com/delta-io/delta-rs/blob/fae50cca528446e27c5401818a4f31b5a97e8ad2/python/tests/conftest.py#L30-L53

A table with a checkpoint

Set delta.checkpointInterval to 2 and we can get one with three commits.

A table which has had a schema change

overwrite with .option("overwriteSchema", "true").

A table with stats as struct

Turn delta.checkpoint.writeStatsAsJson off, delta.checkpoint.writeStatsAsStruct on.

A table with id-based column mapping

set delta.columnMapping.mode to id
Maybe alter a column in subsequent version? https://docs.databricks.com/delta/delta-column-mapping.html

A table with name-based column mapping

set delta.columnMapping.mode to name
Maybe alter a column in subsequent version? https://docs.databricks.com/delta/delta-column-mapping.html

A table with multi-part checkpoint

Use setting checkpoint.partSize (or delta.checkpoint.partSize?) to force a multi-part one.

https://github.com/delta-io/delta/pull/946/files

MrPowers · 2022-11-04T18:18:34Z

@wjones127 - are you cool with separate reference tables for "A Delta Lake table with all data types"? I think this will make it more obvious what types aren't supported for each connector. Suppose a connector doesn't support 5 data types. One failing test might not fully explain the gap like 5 failing tests would. Thoughts?

wjones127 · 2022-11-04T18:37:41Z

IMO that doesn't seem fully necessary. But perhaps we can separate the primitive types from the nested (struct, list, map)

MrPowers · 2022-11-04T19:20:43Z

@wjones127 - separating the primitive times from complex types seems like a nice balance 👍

tdas · 2022-11-04T23:28:12Z

These table ideas look very good to me.
Let me add a few more ideas, some of which may already be covered

Table with a file removed (e.g. compaction)
Table with actions having extra random fields in them (AddFiles, RemoveFiles, etc.) - json parsing should ignore them, this has to be hand constructed. This is important because we have seen multiple issues regarding this.
Table with all the different actions (settxn)
Tables with and without stats

I will think of more and keeping adding to this thread. :D

wjones127 · 2023-01-17T16:29:27Z

Table that has extra actions
A non-HIVE-partitioned table

MrPowers mentioned this issue Oct 10, 2022

Create a Delta Lake reference table with all data types #15

Closed

wjones127 mentioned this issue Nov 18, 2022

Add new tables #27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read test cases checklist #7

Read test cases checklist #7

wjones127 commented Oct 4, 2022 •

edited

Loading

MrPowers commented Oct 4, 2022

wjones127 commented Nov 4, 2022

MrPowers commented Nov 4, 2022

wjones127 commented Nov 4, 2022

MrPowers commented Nov 4, 2022

tdas commented Nov 4, 2022 •

edited by nkarpov

Loading

wjones127 commented Jan 17, 2023 •

edited

Loading

Read test cases checklist #7

Read test cases checklist #7

Comments

wjones127 commented Oct 4, 2022 • edited Loading

MrPowers commented Oct 4, 2022

wjones127 commented Nov 4, 2022

A Delta Lake table with all data types

A table with a checkpoint

A table which has had a schema change

A table with stats as struct

A table with id-based column mapping

A table with name-based column mapping

A table with multi-part checkpoint

MrPowers commented Nov 4, 2022

wjones127 commented Nov 4, 2022

MrPowers commented Nov 4, 2022

tdas commented Nov 4, 2022 • edited by nkarpov Loading

wjones127 commented Jan 17, 2023 • edited Loading

wjones127 commented Oct 4, 2022 •

edited

Loading

tdas commented Nov 4, 2022 •

edited by nkarpov

Loading

wjones127 commented Jan 17, 2023 •

edited

Loading