Skip to content

Conversation

@mbutrovich
Copy link
Contributor

@mbutrovich mbutrovich commented Dec 5, 2024

The current logic takes the data schema and the required schema from the Java side (in the scan node) and:

  1. Converts back to a Parquet schema (Thrift encoded)
  2. Serializes it to the native side
  3. Parses it to a schema descriptor
  4. Converts that to an Arrow schema

This process is introducing conversion errors that are difficult to recover from (e.g. Timestamp(milli) -> INT96 -> Timestamp(nano)). This PR simplifies the schema serialization and conversion to native side, building on what @viirya did with the partition schema (thank you for the inspiration!).

In this PR, data schema and required schema are now serialized as Spark types. On the native side they are converted to Arrow types. We also now serialize more schema info (column names, nullability) than we did for just partition schema.

@mbutrovich mbutrovich marked this pull request as draft December 5, 2024 13:52
@mbutrovich
Copy link
Contributor Author

Tests: succeeded 657, failed 99, canceled 1, ignored 47, pending 0

@mbutrovich mbutrovich marked this pull request as ready for review December 5, 2024 17:03
@andygrove andygrove merged commit e0d8077 into apache:comet-parquet-exec Dec 5, 2024
7 of 23 checks passed
@mbutrovich mbutrovich deleted the simplify_schema branch December 5, 2024 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants