[comet-parquet-exec] Simplify schema logic for CometNativeScan #1142

mbutrovich · 2024-12-05T13:18:42Z

The current logic takes the data schema and the required schema from the Java side (in the scan node) and:

Converts back to a Parquet schema (Thrift encoded)
Serializes it to the native side
Parses it to a schema descriptor
Converts that to an Arrow schema

This process is introducing conversion errors that are difficult to recover from (e.g. Timestamp(milli) -> INT96 -> Timestamp(nano)). This PR simplifies the schema serialization and conversion to native side, building on what @viirya did with the partition schema (thank you for the inspiration!).

In this PR, data schema and required schema are now serialized as Spark types. On the native side they are converted to Arrow types. We also now serialize more schema info (column names, nullability) than we did for just partition schema.

…on vector on the Java side.

…chema

mbutrovich · 2024-12-05T17:03:34Z

Tests: succeeded 657, failed 99, canceled 1, ignored 47, pending 0

mbutrovich added 5 commits December 4, 2024 18:34

Serialize original data schema and required schema, generate projecti…

13c8de2

…on vector on the Java side.

Sending over more schema info like column names and nullability.

6a4751c

Using the new stuff in the proto. About to take the old out.

60e544e

Remove old logic.

a226dee

remove errant print.

5f703d2

mbutrovich marked this pull request as draft December 5, 2024 13:52

mbutrovich added 9 commits December 5, 2024 08:54

Serialize original data schema and required schema, generate projecti…

acfee56

…on vector on the Java side.

Sending over more schema info like column names and nullability.

1c6f5c8

Using the new stuff in the proto. About to take the old out.

e92daa9

Remove old logic.

a13622e

remove errant print.

bd10aed

Merge remote-tracking branch 'origin/simplify_schema' into simplify_s…

949a6f9

…chema

Remove commented print. format.

abba9fa

Remove commented print. format.

9c5f052

Fix projection_vector to include partition_schema cols correctly.

6343ad8

mbutrovich marked this pull request as ready for review December 5, 2024 17:03

Rename variable.

0ed9a96

andygrove approved these changes Dec 5, 2024

View reviewed changes

andygrove merged commit e0d8077 into apache:comet-parquet-exec Dec 5, 2024
7 of 23 checks passed

mbutrovich deleted the simplify_schema branch December 5, 2024 17:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[comet-parquet-exec] Simplify schema logic for CometNativeScan #1142

[comet-parquet-exec] Simplify schema logic for CometNativeScan #1142

Uh oh!

mbutrovich commented Dec 5, 2024 •

edited

Loading

Uh oh!

mbutrovich commented Dec 5, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[comet-parquet-exec] Simplify schema logic for CometNativeScan #1142

[comet-parquet-exec] Simplify schema logic for CometNativeScan #1142

Uh oh!

Conversation

mbutrovich commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbutrovich commented Dec 5, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mbutrovich commented Dec 5, 2024 •

edited

Loading