Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide Arrow Schema Hint to Parquet Reader - Alternative 2 #5939

Merged
merged 3 commits into from
Jul 2, 2024

Conversation

efredine
Copy link
Contributor

@efredine efredine commented Jun 22, 2024

Which issue does this PR close?

Closes #5657. Alternative to #5671.

Rationale for this change

The parquet reader automatically uses an embedded arrow schema to hint type inference for decode. In particular if the hinted type is compatible with the underlying parquet type, it performs a cast. However, in situations where the writer was not an arrow writer the schema is not available and the arrow types are inferred from the parquet schema. This is not always desirable.

For example, you may want to cast an INT64 column to a Timestamp column or a Timestamp column to a different timezone. This PR allows a schema to be provided for use with type hinting.

What changes are included in this PR?

This is a draft PR for discussion. There are some API alternatives that can be considered. These are described in the section on user facing changes.

Currently, this PR adds supplied_schema field to ArrowReaderOptions and modifies the try_new method of ArrowReaderMetadata to use the supplied_schema instead of the metadata when one is provided. This option defaults to None so it should be backwards compatible?

The option is provided with a new with_schema method on ArrowReaderOptions.

Adds tests showing a simple success case. Once the approach is finalized, these can be expanded to cover off all the valid cast scenarios using an approach similar to run_single_column_read_tests.

It also adds test showing two failure scenarios:

  1. The supplied_schema must have exactly the same number of columns as the parquet schema or an error is thrown.
  2. The final inferred schema must be contained by the supplied schema.

Are there any user-facing changes?

A with_schema method is added to the ArrowReaderOptions. It is intended to be used as follows:

let file = File::open("file.parquet");
let schema = Arc::new(Schema::new(vec![Fields::new("col", DataType::Int32, false)]));
let options = ArrowReaderOptions::new().with_schema(schema);
// This may fail as described above.
let builder = ParquetRecordBatchReaderBuilder::try_new_with_options(file, options).unwrap();
let reader = builder.build().unwrap();

It need to be provided as an option before the builder is constructed so it can be utilized to provide type hints when the metadata is being read.

There are different approaches that could be taken with the supplied_schema:

  1. Currently, it needs to provide an complete schema for the entire file. It seems like it might be possible for it be treated like a partial schema that would override the the inferred schema or any schema in the metadata. This would be more ergonomic if only wanting to update a subset of fields is needed.
  2. If the supplied type hints can't be utilized the API doesn't need to return an error and just use the type from the Parquet schema. This is what happens with the type hints derived from the metadata, so it might be better to make it consistent.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 22, 2024
@efredine efredine changed the title Provide Arrow Schema Hint to Parquet Reader #5671 - Alternative 2 Provide Arrow Schema Hint to Parquet Reader - Alternative 2 Jun 22, 2024
@efredine efredine marked this pull request as draft June 22, 2024 17:33
@alamb
Copy link
Contributor

alamb commented Jun 23, 2024

Thanks @efredine -- I started the CI on this PR

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @efredine -- I think this PR and API looks good to me 👍

Another test that might be good would be to write a parquet file using the arrow reader with one schema (perhaps a StringArray) and then read the data back using another schema (perhaps a DictionaryArray).

parquet/src/arrow/arrow_reader/mod.rs Show resolved Hide resolved
parquet/src/arrow/arrow_reader/mod.rs Outdated Show resolved Hide resolved
Eric Fredine added 2 commits June 29, 2024 10:27
Adds a more detailed error message for incompatible columns.

Adds nested fields to test_with_schema.

Adds test for incompatible nested field.

Updates documentation.
@efredine efredine marked this pull request as ready for review June 29, 2024 17:31
@efredine
Copy link
Contributor Author

I expanded the tests to include converting from an String to a Dictionary and also included a nested field. I also applied what I learnt about constructing record batches from an iterator ;-).

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @efredine -- I think this PR looks great.

I'll leave it open for another day or two to allow time for others to comment

Thank you so much

/// writer.close().unwrap();
///
/// // Read the file back.
/// // Supply a schema that interprets the Int32 column as a Timestamp.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -2620,6 +2732,275 @@ mod tests {
assert_eq!(reader.schema(), schema_without_metadata);
}

fn write_parquet_from_iter<I, F>(value: I) -> File
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very impressive set of negative tests 💯 Thank you

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, even better that you found a way to make the code change relatively small 👍

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide Arrow Schema Hint to Parquet Reader
3 participants