Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Allow for Parquet reading from files with differing schemas #2514

Merged
merged 16 commits into from
Jul 18, 2024

Conversation

jaychia
Copy link
Contributor

@jaychia jaychia commented Jul 16, 2024

Fixes to allow for reading Parquet files and specifying columns that do not exist in the Parquet file.

This is common when we try to "apply" a schema from some external source (e.g. a data catalog). In that case, old Parquet files may not have certain columns because the schema evolved over time. We want to make sure that reads still succeed on these files, and we get back tables with the appropriate number of rows (even if no columns were read!)

Summary of Changes

Fixes to Parquet reader

  1. Allows for reads of Parquet files with column names that may not exist in the file. These columns will just be missing from the returned Table. Note that this potentially means we get empty Tables with valid number of rows.

Fixes on Table

  1. Fixes Table to allow empty (num_rows=0) tables that still have columns of data. This lets us do reads of files without producing any data at all (e.g. reading only col("x") from a file that doesn't have col("x")).
  2. During reading of Parquet files, remove our FieldNotFound errors. We no longer complain about a user-provided field not found, and will just return Table structs without the requested columns instead
  3. Fix Table::new to not default to num_rows=1. Instead, we pass in num_rows explicitly and check against that.
  4. Fix Table::from_columns to receive an explicit num_rows as well. This cleans up a host of bugs where we might be naively creating tables with the wrong length when a table has no columns.

@github-actions github-actions bot added the bug Something isn't working label Jul 16, 2024
@jaychia jaychia force-pushed the jay/column-pruning-parquet branch 4 times, most recently from ca55235 to 8cee2b3 Compare July 16, 2024 06:32
@jaychia jaychia force-pushed the jay/column-pruning-parquet branch from 4a7bd71 to c3299f4 Compare July 16, 2024 09:09
Copy link

codecov bot commented Jul 16, 2024

Codecov Report

Attention: Patch coverage is 90.63830% with 22 lines in your changes missing coverage. Please review.

Project coverage is 63.23%. Comparing base (29af6bf) to head (c613aa7).
Report is 8 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2514      +/-   ##
==========================================
- Coverage   63.30%   63.23%   -0.08%     
==========================================
  Files         968      973       +5     
  Lines      108068   108506     +438     
==========================================
+ Hits        68414    68613     +199     
- Misses      39654    39893     +239     
Files Coverage Δ
daft/table/table.py 59.20% <100.00%> (+1.02%) ⬆️
src/daft-csv/src/read.rs 99.40% <100.00%> (+<0.01%) ⬆️
src/daft-execution/src/task/mod.rs 63.47% <100.00%> (-0.22%) ⬇️
src/daft-execution/src/test/mod.rs 76.40% <100.00%> (-0.27%) ⬇️
src/daft-json/src/local.rs 86.47% <100.00%> (+0.12%) ⬆️
src/daft-json/src/read.rs 96.68% <100.00%> (+0.03%) ⬆️
src/daft-parquet/src/lib.rs 50.00% <ø> (ø)
src/daft-parquet/src/python.rs 57.25% <100.00%> (+0.17%) ⬆️
src/daft-plan/src/source_info/file_info.rs 76.66% <100.00%> (-0.20%) ⬇️
src/daft-table/src/ffi.rs 98.18% <100.00%> (ø)
... and 12 more

... and 31 files with indirect coverage changes

field: String,
available_fields: Vec<String>,
path: String,
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this because our Parquet readers will now not complain if a requested field is not found.

Instead, the readers will return a best-effort Table, with all the columns it can find from the ones requested.

let schema: SchemaRef = schema.into();
if schema.fields.len() != columns.len() {
return Err(DaftError::SchemaMismatch(format!("While building a Table, we found that the number of fields did not match between the schema and the input columns.\n {:?}\n vs\n {:?}", schema.fields.len(), columns.len())));
}
let mut num_rows = 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to default to tables with num_rows=1, but this led to really odd behavior. Now the caller will explicitly pass in num_rows, which makes a little more sense because the caller usually has more context (e.g. callers may know that even though no columns were read, the Parquet rowgroup(s) it read had 1,000 rows).

// We discard the original self.len() because we expect aggregations to change
// the final cardinality. Aggregations on empty tables are expected to produce unit length results.
(true, _) => result_series.iter().map(|s| s.len()).max().unwrap(),
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for num_rows here was tricky to get correct. Reviewers should pay a little more attention to this block of code.

It was a little difficult to get the "correct" logic for how many rows would result from a call to eval_expression_list, which can contain cardinality-modifying expressions such as UDFs and aggregations.

@jaychia jaychia requested a review from samster25 July 17, 2024 16:07
@jaychia jaychia merged commit 6983635 into main Jul 18, 2024
46 checks passed
@jaychia jaychia deleted the jay/column-pruning-parquet branch July 18, 2024 00:35
colin-ho added a commit that referenced this pull request Sep 27, 2024
)

Enables the `dataframe/test_creation.py` and `io/test_parquet.py` test
suite for the native executor.

Changes:
- Add `PythonStorageConfig` reading functionality (just copying the
existing logic in `materialize_scan_task`)
- Enable streaming parquet reads to read files with differing schemas:
See: #2514

---------

Co-authored-by: Colin Ho <[email protected]>
Co-authored-by: Colin Ho <[email protected]>
sagiahrac pushed a commit to sagiahrac/Daft that referenced this pull request Oct 7, 2024
…entual-Inc#2672)

Enables the `dataframe/test_creation.py` and `io/test_parquet.py` test
suite for the native executor.

Changes:
- Add `PythonStorageConfig` reading functionality (just copying the
existing logic in `materialize_scan_task`)
- Enable streaming parquet reads to read files with differing schemas:
See: Eventual-Inc#2514

---------

Co-authored-by: Colin Ho <[email protected]>
Co-authored-by: Colin Ho <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant