-
Notifications
You must be signed in to change notification settings - Fork 265
[comet-parquet-exec] Use Parquet schema for scan instead of Spark schema #1103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Conflicts: # native/core/src/execution/datafusion/planner.rs
| new SparkToParquetSchemaConverter(conf).convert(scan.requiredSchema) | ||
| val dataSchemaParquet = | ||
| new SparkToParquetSchemaConverter(conf).convert(scan.relation.dataSchema) | ||
| val projection_vector: Array[java.lang.Long] = scan.requiredSchema.fields.map(field => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change essentially means that any schema 'adaptation' made in SparkToParquetSchemaConverter.convert to support legacy timestamps and decimals will not be supported. But we will probably fail tests with incorrect results.
Also, Comet's Parquet file reader uses CometParquetReadSupport.clipParquetSchema to do similar conversion and it includes support for Parquet field_id which is desirable for delta sources like Iceberg.
Basically a field_id, if present, identifies a field more precisely (in the event of field name changes) in a schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SchemaConverter seems like it could be handled in DF's SchemaAdapter. I'll look at clipParquetSchema as well, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you just need Arrow types, can you just convert Spark types to Arrow types? For example, if the column in Spark is treated as timestamp type, its Arrow type is timestamp too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that...
- Java side parses Parquet metadata, generates a Spark schema
- Java side converts Spark schema to Arrow schema (following Comet conversion rules)
- Serialize Arrow types, native side feeds this into ParquetExec as the data schema
...may yield different results than:
- Java side serializes original Parquet metadata
- Serialize schema message
- Native side parses message, generates Arrow schema and feeds this into ParquetExec as the data schema
I guess I could exhaustively test this hypothesis with all types.
| val broadcastedHadoopConf = | ||
| sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) | ||
| val sharedConf = broadcastedHadoopConf.value.value | ||
| val footer = FooterReader.readFooter(sharedConf, file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. This can never be in production code. For one, this is expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory it's just replacing the call that currently takes place during the fileReader instantiation, but yeah I'm still curious if it's already cached somewhere. I see references within Spark to a footersCache so I'm curious to look for that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know about a footersCache in Spark. Could you share a link maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks. I don't think we ever travel this path.
Currently we get the scan schema from the plan nodes scan schema, and then serialize that back to a Parquet schema, then parse that on the native side. This is lossy, particularly with timestamps. For example:
The former is the original Parquet footer, the latter is what we get after going through Spark. We need the original to handle timestamps correctly in ParquetExec.
This PR extracts some code from elsewhere (CometParquetFileFormat, CometNativeScanExec) to read the footer from the Parquet file, and serialize the original metadata. We also now generate the projection vector on the Spark side because the required columns is in Spark schema format, so will not match the Parquet schema 1:1. On the native side, we now have to regenerate the required schema from the Parquet schema using the projection vector (converted to a DF ProjectionMask).