feat: read schema from parquet files in datafusion scans #1266

roeap · 2023-04-08T09:27:25Z

Description

This PR updates table scans with datafusion to read the file schema from the parquet file within the latest add action of the table. This is to work around some issues, where the schema we derive from metadata does not match the data in the parquet files - e.g. nanosecond timestamps vs. micorsoecond.

We also update the Load command to handle column selections and make it more consistent with the other operations.

Related Issue(s)

closes #441

Documentation

rust/src/operations/transaction/state.rs

wjones127 · 2023-04-08T18:06:20Z

rust/src/operations/transaction/state.rs

+    ///
+    /// This will construct a schema derived from the parqet schema of the latest data file,
+    /// and fields for partition columns from the schema defined in table meta data.
+    pub async fn physical_arrow_schema(


I don't think we can guarantee this schema is consistent across all parquet files in the table; different writers may have written to the table with different physical types for timestamps. IMO this should be handled in the scan of each Parquet file. That is, we should cast the physical type to microsecond timestamps as needed.

In PyArrow, we handle the int96 timestamp issue by passing an argument to the reader to coerce it to microsecond precision. Maybe we could implement something similar upstream?

delta-rs/python/tests/test_table_read.py

Line 32 in 34d43b6

parquet_read_options=ParquetReadOptions(coerce_int96_timestamp_unit="ms")

There definitely are no guarantees as to the file schema being consistent. Datafusion however needs a consistent schema. Once we get into column mappings etc, things might get even more demanding and we may have to roll our own parquet scan, or rather start putting logic into our DeltaScan.

That said, I do believe using the schema from the latest file is an improvement over the current way, which at least for me fails for more or less every databricks written table where there are timestamps involved.

Not sure about the best way forward, but I'm happy to keep that logic on a private branch somewhere until we have a more general fix.

somewhat related, so I already published it as WIP - in #1267 I did some work on the write command. There my plan was to use the same schema to validates writes. But there it would be even more confusing, since we might end up on situation, where writing the "official" schema of the chart would not be permissible. But somehow it feels very strange to me to have potentially many schemas in the same table.

i guess spark must allow at least some flexibility in what schema it expects at write time, otherwise how would we end up in this discussion at all :D.

Yeah we are definitely hitting the limits of DataFusion's scanner. I've created an issue upstream apache/datafusion#5950

I'm fine with moving this forward; I mostly care that we have a more robust implementation in the future and have at least some momentum towards it.

wjones127

Finished looking through. Just one other comment.

wjones127 · 2023-04-11T02:08:20Z

rust/src/operations/transaction/state.rs

+            .files()
+            .iter()
+            .max_by_key(|obj| obj.modification_time)
+            .ok_or(DeltaTableError::Generic("No active file actions to get physical schema. Maybe the current state has not yet been loaded?".into()))?


Does this error propagate to the user? Does this mean trying to scan an empty tables leads to an error? I don't think it should.

It does. at the time I though we could fail on a scan if no files have been added yet, but you are right there are several valid scenarios where we have no files in a table and still should be able to do a scan.

Fixed it so we a re falling back to the schema from metadata.

Co-authored-by: Will Jones <[email protected]>

roeap added 2 commits April 8, 2023 09:41

feat: allow column selection in laod command

8b4de59

feat: read physical file schema for scan operations

eecd2cc

github-actions bot added binding/rust Issues for the Rust crate rust labels Apr 8, 2023

roeap added 2 commits April 8, 2023 11:47

refactor: return full table schema when creating physical schema

8328ffe

test: add test for scanning table with nano timestamps in data files

54d4c4f

roeap changed the title ~~[WIP] feat: read schema from parquet files in datafusion scans~~ feat: read schema from parquet files in datafusion scans Apr 8, 2023

roeap marked this pull request as ready for review April 8, 2023 10:22

roeap requested review from houqp, xianwill, wjones127, fvaleye, rtyler and mosyp as code owners April 8, 2023 10:22

wjones127 reviewed Apr 8, 2023

View reviewed changes

roeap mentioned this pull request Apr 9, 2023

feat: write command improvements #1267

Merged

wjones127 requested changes Apr 11, 2023

View reviewed changes

roeap and others added 3 commits April 12, 2023 18:21

Merge branch 'main' into load-schema

b16dc68

fix: fall back to table schema when no files are present

5f15fb6

Update rust/src/operations/transaction/state.rs

e47f8aa

Co-authored-by: Will Jones <[email protected]>

wjones127 approved these changes Apr 14, 2023

View reviewed changes

wjones127 merged commit 362a94e into delta-io:main Apr 14, 2023

roeap deleted the load-schema branch April 14, 2023 04:59

HawaiianSpork mentioned this pull request Jun 24, 2024

feat(rust,python): cast each parquet file to delta schema #2615

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: read schema from parquet files in datafusion scans #1266

feat: read schema from parquet files in datafusion scans #1266

roeap commented Apr 8, 2023 •

edited

Loading

wjones127 Apr 8, 2023

wjones127 Apr 8, 2023

roeap Apr 9, 2023

roeap Apr 9, 2023

wjones127 Apr 11, 2023

wjones127 left a comment

wjones127 Apr 11, 2023

roeap Apr 12, 2023

feat: read schema from parquet files in datafusion scans #1266

feat: read schema from parquet files in datafusion scans #1266

Conversation

roeap commented Apr 8, 2023 • edited Loading

Description

Related Issue(s)

Documentation

wjones127 Apr 8, 2023

Choose a reason for hiding this comment

wjones127 Apr 8, 2023

Choose a reason for hiding this comment

roeap Apr 9, 2023

Choose a reason for hiding this comment

roeap Apr 9, 2023

Choose a reason for hiding this comment

wjones127 Apr 11, 2023

Choose a reason for hiding this comment

wjones127 left a comment

Choose a reason for hiding this comment

wjones127 Apr 11, 2023

Choose a reason for hiding this comment

roeap Apr 12, 2023

Choose a reason for hiding this comment

roeap commented Apr 8, 2023 •

edited

Loading