[comet-parquet-exec] Use Parquet schema for scan instead of Spark schema #1103

mbutrovich · 2024-11-19T21:08:26Z

Currently we get the scan schema from the plan nodes scan schema, and then serialize that back to a Parquet schema, then parse that on the native side. This is lossy, particularly with timestamps. For example:

schema: message root {
  optional int64 _0 (TIMESTAMP(MILLIS,true));
  optional int64 _1 (TIMESTAMP(MICROS,true));
  optional int64 _2 (TIMESTAMP(MILLIS,true));
  optional int64 _3 (TIMESTAMP(MILLIS,false));
  optional int64 _4 (TIMESTAMP(MICROS,true));
  optional int64 _5 (TIMESTAMP(MICROS,false));
  optional int64 _6 (INTEGER(64,true));
}

dataSchema: message spark_schema {
  optional int96 _0;
  optional int96 _1;
  optional int96 _2;
  optional int64 _3 (TIMESTAMP(MICROS,false));
  optional int96 _4;
  optional int64 _5 (TIMESTAMP(MICROS,false));
  optional int64 _6;
}

The former is the original Parquet footer, the latter is what we get after going through Spark. We need the original to handle timestamps correctly in ParquetExec.

This PR extracts some code from elsewhere (CometParquetFileFormat, CometNativeScanExec) to read the footer from the Parquet file, and serialize the original metadata. We also now generate the projection vector on the Spark side because the required columns is in Spark schema format, so will not match the Parquet schema 1:1. On the native side, we now have to regenerate the required schema from the Parquet schema using the projection vector (converted to a DF ProjectionMask).

# Conflicts: # native/core/src/execution/datafusion/planner.rs

parthchandra · 2024-11-19T23:00:41Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

-            new SparkToParquetSchemaConverter(conf).convert(scan.requiredSchema)
-          val dataSchemaParquet =
-            new SparkToParquetSchemaConverter(conf).convert(scan.relation.dataSchema)
+          val projection_vector: Array[java.lang.Long] = scan.requiredSchema.fields.map(field => {


This change essentially means that any schema 'adaptation' made in SparkToParquetSchemaConverter.convert to support legacy timestamps and decimals will not be supported. But we will probably fail tests with incorrect results.
Also, Comet's Parquet file reader uses CometParquetReadSupport.clipParquetSchema to do similar conversion and it includes support for Parquet field_id which is desirable for delta sources like Iceberg.
Basically a field_id, if present, identifies a field more precisely (in the event of field name changes) in a schema.

The SchemaConverter seems like it could be handled in DF's SchemaAdapter. I'll look at clipParquetSchema as well, thanks!

If you just need Arrow types, can you just convert Spark types to Arrow types? For example, if the column in Spark is treated as timestamp type, its Arrow type is timestamp too.

My concern is that...

Java side parses Parquet metadata, generates a Spark schema

Java side converts Spark schema to Arrow schema (following Comet conversion rules)

Serialize Arrow types, native side feeds this into ParquetExec as the data schema

...may yield different results than:

Java side serializes original Parquet metadata

Serialize schema message

Native side parses message, generates Arrow schema and feeds this into ParquetExec as the data schema

I guess I could exhaustively test this hypothesis with all types.

parthchandra · 2024-11-19T23:03:55Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

+        val broadcastedHadoopConf =
+          sparkContext.broadcast(new SerializableConfiguration(hadoopConf))
+        val sharedConf = broadcastedHadoopConf.value.value
+        val footer = FooterReader.readFooter(sharedConf, file)


You're right. This can never be in production code. For one, this is expensive.

In theory it's just replacing the call that currently takes place during the fileReader instantiation, but yeah I'm still curious if it's already cached somewhere. I see references within Spark to a footersCache so I'm curious to look for that as well.

I didn't know about a footersCache in Spark. Could you share a link maybe?

It's actually part of parquet-hadoop: https://github.com/apache/parquet-java/blob/1e04ec74060a04cb029b6e61cf1098c1aba1446e/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L525

Ah, thanks. I don't think we ever travel this path.

mbutrovich added 4 commits November 19, 2024 10:52

It seems schema can be stored in CometParquetFileFormat.

563178c

Get the schema from Spark footer instead of Spark schema.

3f1be1d

Get the schema from Spark footer instead of Spark schema.

0a233b9

Clean up prints and dead code.

8adaf95

mbutrovich marked this pull request as draft November 19, 2024 21:08

mbutrovich added 3 commits November 19, 2024 16:13

Merge branch 'comet-parquet-exec' into fileformatschema

20828bd

# Conflicts: # native/core/src/execution/datafusion/planner.rs

Fix post-merge issues.

c9ca89a

Fix post-merge issues.

de6425a

mbutrovich changed the title ~~Use Parquet schema for scan instead of Spark schema~~ [comet-parquet-exec] Use Parquet schema for scan instead of Spark schema Nov 19, 2024

change projection vector generation on Java side.

0fdfd9d

mbutrovich marked this pull request as ready for review November 19, 2024 22:44

parthchandra reviewed Nov 19, 2024

View reviewed changes

Playing with clippedSchema.

5c7b73f

mbutrovich mentioned this pull request Nov 20, 2024

fix: Support partition values in feature branch comet-parquet-exec #1106

Merged

mbutrovich closed this Dec 5, 2024

mbutrovich deleted the fileformatschema branch June 12, 2025 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[comet-parquet-exec] Use Parquet schema for scan instead of Spark schema #1103

[comet-parquet-exec] Use Parquet schema for scan instead of Spark schema #1103

Uh oh!

mbutrovich commented Nov 19, 2024 •

edited

Loading

Uh oh!

parthchandra Nov 19, 2024

Uh oh!

mbutrovich Nov 20, 2024

Uh oh!

viirya Nov 20, 2024

Uh oh!

mbutrovich Nov 20, 2024

Uh oh!

parthchandra Nov 19, 2024

Uh oh!

mbutrovich Nov 20, 2024 •

edited

Loading

Uh oh!

parthchandra Nov 21, 2024

Uh oh!

mbutrovich Nov 21, 2024

Uh oh!

parthchandra Nov 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[comet-parquet-exec] Use Parquet schema for scan instead of Spark schema #1103

[comet-parquet-exec] Use Parquet schema for scan instead of Spark schema #1103

Uh oh!

Conversation

mbutrovich commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbutrovich Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mbutrovich commented Nov 19, 2024 •

edited

Loading

mbutrovich Nov 20, 2024 •

edited

Loading