[BUG] Allow for Parquet reading from files with differing schemas #2514

jaychia · 2024-07-16T03:35:54Z

Fixes to allow for reading Parquet files and specifying columns that do not exist in the Parquet file.

This is common when we try to "apply" a schema from some external source (e.g. a data catalog). In that case, old Parquet files may not have certain columns because the schema evolved over time. We want to make sure that reads still succeed on these files, and we get back tables with the appropriate number of rows (even if no columns were read!)

Summary of Changes

Fixes to Parquet reader

Allows for reads of Parquet files with column names that may not exist in the file. These columns will just be missing from the returned Table. Note that this potentially means we get empty Tables with valid number of rows.

Fixes on Table

Fixes Table to allow empty (num_rows=0) tables that still have columns of data. This lets us do reads of files without producing any data at all (e.g. reading only col("x") from a file that doesn't have col("x")).
During reading of Parquet files, remove our FieldNotFound errors. We no longer complain about a user-provided field not found, and will just return Table structs without the requested columns instead
Fix Table::new to not default to num_rows=1. Instead, we pass in num_rows explicitly and check against that.
Fix Table::from_columns to receive an explicit num_rows as well. This cleans up a host of bugs where we might be naively creating tables with the wrong length when a table has no columns.

codecov · 2024-07-16T18:09:39Z

Codecov Report

Attention: Patch coverage is 90.63830% with 22 lines in your changes missing coverage. Please review.

Project coverage is 63.23%. Comparing base (29af6bf) to head (c613aa7).
Report is 8 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2514      +/-   ##
==========================================
- Coverage   63.30%   63.23%   -0.08%     
==========================================
  Files         968      973       +5     
  Lines      108068   108506     +438     
==========================================
+ Hits        68414    68613     +199     
- Misses      39654    39893     +239

Files	Coverage Δ
daft/table/table.py	`59.20% <100.00%> (+1.02%)`	⬆️
src/daft-csv/src/read.rs	`99.40% <100.00%> (+<0.01%)`	⬆️
src/daft-execution/src/task/mod.rs	`63.47% <100.00%> (-0.22%)`	⬇️
src/daft-execution/src/test/mod.rs	`76.40% <100.00%> (-0.27%)`	⬇️
src/daft-json/src/local.rs	`86.47% <100.00%> (+0.12%)`	⬆️
src/daft-json/src/read.rs	`96.68% <100.00%> (+0.03%)`	⬆️
src/daft-parquet/src/lib.rs	`50.00% <ø> (ø)`
src/daft-parquet/src/python.rs	`57.25% <100.00%> (+0.17%)`	⬆️
src/daft-plan/src/source_info/file_info.rs	`76.66% <100.00%> (-0.20%)`	⬇️
src/daft-table/src/ffi.rs	`98.18% <100.00%> (ø)`
... and 12 more

... and 31 files with indirect coverage changes

jaychia · 2024-07-16T20:18:21Z

src/daft-parquet/src/lib.rs

-        field: String,
-        available_fields: Vec<String>,
-        path: String,
-    },


I removed this because our Parquet readers will now not complain if a requested field is not found.

Instead, the readers will return a best-effort Table, with all the columns it can find from the ones requested.

jaychia · 2024-07-16T20:20:52Z

src/daft-table/src/lib.rs

        let schema: SchemaRef = schema.into();
        if schema.fields.len() != columns.len() {
            return Err(DaftError::SchemaMismatch(format!("While building a Table, we found that the number of fields did not match between the schema and the input columns.\n {:?}\n vs\n {:?}", schema.fields.len(), columns.len())));
        }
-        let mut num_rows = 1;


We used to default to tables with num_rows=1, but this led to really odd behavior. Now the caller will explicitly pass in num_rows, which makes a little more sense because the caller usually has more context (e.g. callers may know that even though no columns were read, the Parquet rowgroup(s) it read had 1,000 rows).

jaychia · 2024-07-17T02:55:38Z

src/daft-table/src/lib.rs

+            // We discard the original self.len() because we expect aggregations to change
+            // the final cardinality. Aggregations on empty tables are expected to produce unit length results.
+            (true, _) => result_series.iter().map(|s| s.len()).max().unwrap(),
+        };


The logic for num_rows here was tricky to get correct. Reviewers should pay a little more attention to this block of code.

It was a little difficult to get the "correct" logic for how many rows would result from a call to eval_expression_list, which can contain cardinality-modifying expressions such as UDFs and aggregations.

) Enables the `dataframe/test_creation.py` and `io/test_parquet.py` test suite for the native executor. Changes: - Add `PythonStorageConfig` reading functionality (just copying the existing logic in `materialize_scan_task`) - Enable streaming parquet reads to read files with differing schemas: See: #2514 --------- Co-authored-by: Colin Ho <[email protected]> Co-authored-by: Colin Ho <[email protected]>

…entual-Inc#2672) Enables the `dataframe/test_creation.py` and `io/test_parquet.py` test suite for the native executor. Changes: - Add `PythonStorageConfig` reading functionality (just copying the existing logic in `materialize_scan_task`) - Enable streaming parquet reads to read files with differing schemas: See: Eventual-Inc#2514 --------- Co-authored-by: Colin Ho <[email protected]> Co-authored-by: Colin Ho <[email protected]>

github-actions bot added the bug Something isn't working label Jul 16, 2024

jaychia force-pushed the jay/column-pruning-parquet branch 4 times, most recently from ca55235 to 8cee2b3 Compare July 16, 2024 06:32

Jay Chia added 5 commits July 16, 2024 02:07

[BUG] Allow for Parquet reading from files with differing schemas

4dcc52b

Add more tests, try to force prune columns

5ab364b

Allow for 0 length tables

295c437

Fix filter case and broadcasting for other ops

c782dbc

Fix local reader for both Daft and PyArrow reads

c3299f4

jaychia force-pushed the jay/column-pruning-parquet branch from 4a7bd71 to c3299f4 Compare July 16, 2024 09:09

Fix sizing logic for empty tables with no aggs

49b33f0

Jay Chia added 10 commits July 16, 2024 12:37

Fix bug in get_columns

6554ded

Refactor to explicitly pass in num_rows for Table::from_columns

5ca1dff

num_rows propagating into pyarrow reads

59b42a9

Fix bug in map_groups

5fd40cc

Use separate Table::new* methods

d309d66

Fix inference logic for Table::from_nonempty_columns

5b6ccf7

Perform broadcast in map_groups

9454e40

Broadcast either keys or values in map_groups

f336e42

Fix for pyarrow v7

76d1589

More principled calculation of num_rows in record_batches_to_table

c613aa7

jaychia commented Jul 17, 2024

View reviewed changes

jaychia requested a review from samster25 July 17, 2024 16:07

jaychia merged commit 6983635 into main Jul 18, 2024
46 checks passed

jaychia deleted the jay/column-pruning-parquet branch July 18, 2024 00:35

colin-ho mentioned this pull request Sep 25, 2024

[CHORE] Enable test_creation and test_parquet for native executor #2672

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Allow for Parquet reading from files with differing schemas #2514

[BUG] Allow for Parquet reading from files with differing schemas #2514

jaychia commented Jul 16, 2024 •

edited

Loading

codecov bot commented Jul 16, 2024 •

edited

Loading

jaychia Jul 16, 2024

jaychia Jul 16, 2024

jaychia Jul 17, 2024

[BUG] Allow for Parquet reading from files with differing schemas #2514

[BUG] Allow for Parquet reading from files with differing schemas #2514

Conversation

jaychia commented Jul 16, 2024 • edited Loading

Summary of Changes

codecov bot commented Jul 16, 2024 • edited Loading

Codecov Report

jaychia Jul 16, 2024

Choose a reason for hiding this comment

jaychia Jul 16, 2024

Choose a reason for hiding this comment

jaychia Jul 17, 2024

Choose a reason for hiding this comment

jaychia commented Jul 16, 2024 •

edited

Loading

codecov bot commented Jul 16, 2024 •

edited

Loading