Skip to content

Commit 4ad5153

Browse files
viiryaliancheng
authored andcommitted
[SPARK-6037][SQL] Avoiding duplicate Parquet schema merging
`FilteringParquetRowInputFormat` manually merges Parquet schemas before computing splits. However, it is duplicate because the schemas are already merged in `ParquetRelation2`. We don't need to re-merge them at `InputFormat`. Author: Liang-Chi Hsieh <[email protected]> Closes #4786 from viirya/dup_parquet_schemas_merge and squashes the following commits: ef78a5a [Liang-Chi Hsieh] Avoiding duplicate Parquet schema merging.
1 parent 18f2098 commit 4ad5153

File tree

1 file changed

+7
-16
lines changed

1 file changed

+7
-16
lines changed

sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala

Lines changed: 7 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -434,22 +434,13 @@ private[parquet] class FilteringParquetRowInputFormat
434434
return splits
435435
}
436436

437-
Option(globalMetaData.getKeyValueMetaData.get(RowReadSupport.SPARK_METADATA_KEY)).foreach {
438-
schemas =>
439-
val mergedSchema = schemas
440-
.map(DataType.fromJson(_).asInstanceOf[StructType])
441-
.reduce(_ merge _)
442-
.json
443-
444-
val mergedMetadata = globalMetaData
445-
.getKeyValueMetaData
446-
.updated(RowReadSupport.SPARK_METADATA_KEY, setAsJavaSet(Set(mergedSchema)))
447-
448-
globalMetaData = new GlobalMetaData(
449-
globalMetaData.getSchema,
450-
mergedMetadata,
451-
globalMetaData.getCreatedBy)
452-
}
437+
val metadata = configuration.get(RowWriteSupport.SPARK_ROW_SCHEMA)
438+
val mergedMetadata = globalMetaData
439+
.getKeyValueMetaData
440+
.updated(RowReadSupport.SPARK_METADATA_KEY, setAsJavaSet(Set(metadata)))
441+
442+
globalMetaData = new GlobalMetaData(globalMetaData.getSchema,
443+
mergedMetadata, globalMetaData.getCreatedBy)
453444

454445
val readContext = getReadSupport(configuration).init(
455446
new InitContext(configuration,

0 commit comments

Comments
 (0)