Provide Arrow Schema Hint to Parquet Reader - Alternative 2 #5939

efredine · 2024-06-22T17:25:48Z

Which issue does this PR close?

Closes #5657. Alternative to #5671.

Rationale for this change

The parquet reader automatically uses an embedded arrow schema to hint type inference for decode. In particular if the hinted type is compatible with the underlying parquet type, it performs a cast. However, in situations where the writer was not an arrow writer the schema is not available and the arrow types are inferred from the parquet schema. This is not always desirable.

For example, you may want to cast an INT64 column to a Timestamp column or a Timestamp column to a different timezone. This PR allows a schema to be provided for use with type hinting.

What changes are included in this PR?

This is a draft PR for discussion. There are some API alternatives that can be considered. These are described in the section on user facing changes.

Currently, this PR adds supplied_schema field to ArrowReaderOptions and modifies the try_new method of ArrowReaderMetadata to use the supplied_schema instead of the metadata when one is provided. This option defaults to None so it should be backwards compatible?

The option is provided with a new with_schema method on ArrowReaderOptions.

Adds tests showing a simple success case. Once the approach is finalized, these can be expanded to cover off all the valid cast scenarios using an approach similar to run_single_column_read_tests.

It also adds test showing two failure scenarios:

The supplied_schema must have exactly the same number of columns as the parquet schema or an error is thrown.
The final inferred schema must be contained by the supplied schema.

Are there any user-facing changes?

A with_schema method is added to the ArrowReaderOptions. It is intended to be used as follows:

let file = File::open("file.parquet");
let schema = Arc::new(Schema::new(vec![Fields::new("col", DataType::Int32, false)]));
let options = ArrowReaderOptions::new().with_schema(schema);
// This may fail as described above.
let builder = ParquetRecordBatchReaderBuilder::try_new_with_options(file, options).unwrap();
let reader = builder.build().unwrap();

It need to be provided as an option before the builder is constructed so it can be utilized to provide type hints when the metadata is being read.

There are different approaches that could be taken with the supplied_schema:

Currently, it needs to provide an complete schema for the entire file. It seems like it might be possible for it be treated like a partial schema that would override the the inferred schema or any schema in the metadata. This would be more ergonomic if only wanting to update a subset of fields is needed.
If the supplied type hints can't be utilized the API doesn't need to return an error and just use the type from the Parquet schema. This is what happens with the type hints derived from the metadata, so it might be better to make it consistent.

alamb · 2024-06-23T11:03:52Z

Thanks @efredine -- I started the CI on this PR

alamb

Thank you for this contribution @efredine -- I think this PR and API looks good to me 👍

Another test that might be good would be to write a parquet file using the arrow reader with one schema (perhaps a StringArray) and then read the data back using another schema (perhaps a DictionaryArray).

parquet/src/arrow/arrow_reader/mod.rs

Adds a more detailed error message for incompatible columns. Adds nested fields to test_with_schema. Adds test for incompatible nested field. Updates documentation.

efredine · 2024-06-29T17:32:54Z

I expanded the tests to include converting from an String to a Dictionary and also included a nested field. I also applied what I learnt about constructing record batches from an iterator ;-).

alamb

Thank you @efredine -- I think this PR looks great.

I'll leave it open for another day or two to allow time for others to comment

Thank you so much

alamb · 2024-06-30T11:34:56Z

parquet/src/arrow/arrow_reader/mod.rs

+    /// writer.close().unwrap();
+    ///
+    /// // Read the file back.
+    /// // Supply a schema that interprets the Int32 column as a Timestamp.


alamb · 2024-06-30T11:36:42Z

parquet/src/arrow/arrow_reader/mod.rs

@@ -2620,6 +2732,275 @@ mod tests {
        assert_eq!(reader.schema(), schema_without_metadata);
    }

+    fn write_parquet_from_iter<I, F>(value: I) -> File


This is a very impressive set of negative tests 💯 Thank you

tustvold

This looks good to me, even better that you found a way to make the code change relatively small 👍

github-actions bot added the parquet Changes to the parquet crate label Jun 22, 2024

efredine changed the title ~~Provide Arrow Schema Hint to Parquet Reader #5671 - Alternative 2~~ Provide Arrow Schema Hint to Parquet Reader - Alternative 2 Jun 22, 2024

efredine marked this pull request as draft June 22, 2024 17:33

This was referenced Jun 23, 2024

DataFusion weekly project plan (Andrew Lamb) - June 17, 2024 apache/datafusion#10955

Closed

DataFusion weekly project plan (Andrew Lamb) - June 24, 2024 apache/datafusion#11106

Closed

alamb reviewed Jun 27, 2024

View reviewed changes

parquet/src/arrow/arrow_reader/mod.rs Show resolved Hide resolved

parquet/src/arrow/arrow_reader/mod.rs Outdated Show resolved Hide resolved

Eric Fredine added 2 commits June 29, 2024 10:27

Adds option for providing a schema to the Arrow Parquet Reader.

322c1cf

Adds more complete tests.

3b4b2a1

Adds a more detailed error message for incompatible columns. Adds nested fields to test_with_schema. Adds test for incompatible nested field. Updates documentation.

efredine force-pushed the parquet-reader-schema branch from d310f92 to 3b4b2a1 Compare June 29, 2024 17:27

efredine marked this pull request as ready for review June 29, 2024 17:31

Add an example using showing how to use the with_schema option.

7cf6fdb

alamb approved these changes Jun 30, 2024

View reviewed changes

This was referenced Jun 30, 2024

Release arrow-rs / parquet minor version 52.1.0 #5905

Closed

Provide Arrow Schema Hint to Parquet Reader #5671

Closed

tustvold approved these changes Jul 2, 2024

View reviewed changes

tustvold merged commit 6351674 into apache:master Jul 2, 2024
16 checks passed

alamb mentioned this pull request Jul 2, 2024

Provide Arrow Schema Hint to Parquet Reader #5657

Closed

mapleFU mentioned this pull request Aug 13, 2024

GH-43660: [C++] Add a CastingGenerator to Parquet Reader that applies required casts before slicing apache/arrow#43661

Open

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide Arrow Schema Hint to Parquet Reader - Alternative 2 #5939

Provide Arrow Schema Hint to Parquet Reader - Alternative 2 #5939

efredine commented Jun 22, 2024 •

edited

Loading

alamb commented Jun 23, 2024

alamb left a comment

efredine commented Jun 29, 2024

alamb left a comment

alamb Jun 30, 2024

alamb Jun 30, 2024

tustvold left a comment

Provide Arrow Schema Hint to Parquet Reader - Alternative 2 #5939

Provide Arrow Schema Hint to Parquet Reader - Alternative 2 #5939

Conversation

efredine commented Jun 22, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb commented Jun 23, 2024

alamb left a comment

Choose a reason for hiding this comment

efredine commented Jun 29, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb Jun 30, 2024

Choose a reason for hiding this comment

alamb Jun 30, 2024

Choose a reason for hiding this comment

tustvold left a comment

Choose a reason for hiding this comment

efredine commented Jun 22, 2024 •

edited

Loading