-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3204] Fixing partition-values being derived from partition-path instead of source columns #5364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…mat` and overriding the behavior of adding partition columns to every row
…e to configure whether to append partition values or not
…are not persisted in data-file
6fa8f15 to
ceb5add
Compare
| * <li>Avoiding appending partition values to the rows read from the data file</li> | ||
| * </ol> | ||
| */ | ||
| class Spark24HoodieParquetFileFormat(private val shouldAppendPartitionValues: Boolean) extends ParquetFileFormat { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only aspects that diverge from the source are the ones using shouldAppendPartitionValues
…one (to make sure partition values appending is handled correctly)
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Left a few nits.
|
|
||
| val (tableFileFormat, formatClassName) = metaClient.getTableConfig.getBaseFileFormat match { | ||
| case HoodieFileFormat.PARQUET => (new ParquetFileFormat, "parquet") | ||
| case HoodieFileFormat.PARQUET => (sparkAdapter.createHoodieParquetFileFormat(shouldAppendPartitionColumns).get, "hoodie-parquet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: create a constant for "hoodie-parquet" so it can be referenced everywhere.
| // NOTE: There are currently 2 ways partition values could be fetched: | ||
| // - Source columns (producing the values used for physical partitioning) will be read | ||
| // from the data file | ||
| // - Values parsed from the actual partition pat would be appended to the final dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "pat"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in a follow-up
| sqlContext.sparkContext.hadoopConfiguration.set(SparkInternalSchemaConverter.HOODIE_VALID_COMMITS_LIST, validCommits) | ||
| val formatClassName = metaClient.getTableConfig.getBaseFileFormat match { | ||
| case HoodieFileFormat.PARQUET => if (!internalSchema.isEmptySchema) "HoodieParquet" else "parquet" | ||
| case HoodieFileFormat.PARQUET => "hoodie-parquet" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here for constant
|
|
||
| class SparkHoodieParquetFileFormat extends ParquetFileFormat with SparkAdapterSupport { | ||
| override def shortName(): String = "HoodieParquet" | ||
| override def shortName(): String = "hoodie-parquet" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is used by Spark to identify the format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
…cating partition values handling)
…ry of red-herring stacktraces in the logs
| */ | ||
| class Spark312HoodieParquetFileFormat(private val shouldAppendPartitionValues: Boolean) extends ParquetFileFormat { | ||
|
|
||
| override def buildReaderWithPartitionValues(sparkSession: SparkSession, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github UI has hard time reflecting the changes properly:
- Had to remove top-level conditional (since this FileFormat is now used to control whether partition values will be appended)
- Did minor cleanup for things related to handling of
InternalSchemato make sure those are not failing w/ NPEs - Adding changes to handle
shouldAppendPartitionValues
NOTE: Copy both of those into IDEA scratchpad to be able to compare them side by side in a more meaningful way
| * <li>Schema on-read</li> | ||
| * </ol> | ||
| */ | ||
| class Spark32HoodieParquetFileFormat(private val shouldAppendPartitionValues: Boolean) extends ParquetFileFormat { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as for Spark 3.1
|
@xiarixiaoyao please take a look as well |
| None | ||
| } | ||
| override def createHoodieParquetFileFormat(appendPartitionValues: Boolean): Option[ParquetFileFormat] = { | ||
| Some(new Spark312HoodieParquetFileFormat(appendPartitionValues)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason why the class loader is used before, instead of directly creating a new instance with the class? @xushiyan do you have any context here, to make sure there is no historical get-around and we're not breaking any logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My hunch is that @xiarixiaoyao was using the reflection to load this component to handle the case of Spark 3.0. But given that we're dropping support for it in 0.11, i just dropped the reflection and instantiate it directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it
| hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow] = { | ||
| if (hadoopConf.get(SparkInternalSchemaConverter.HOODIE_QUERY_SCHEMA, "").isEmpty) { | ||
| // fallback to origin parquet File read | ||
| super.buildReaderWithPartitionValues(sparkSession, dataSchema, partitionSchema, requiredSchema, filters, options, hadoopConf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If shouldAppendPartitionValues is true and the existing if condition is true, can we still fall back to the original parquet file read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldAppendPartitionValues is almost never true now (only in cases when we drop the source columns)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sg
| filters: Seq[Filter], | ||
| options: Map[String, String], | ||
| hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow] = { | ||
| if (hadoopConf.get(SparkInternalSchemaConverter.HOODIE_QUERY_SCHEMA, "").isEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responded above
| def toHadoopFsRelation: HadoopFsRelation = { | ||
| // We're delegating to Spark to append partition values to every row only in cases | ||
| // when these corresponding partition-values are not persisted w/in the data file itself | ||
| val shouldAppendPartitionColumns = omitPartitionColumnsInFile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor. instead of "omitPartitionColumnsInFile" (present tense), may be we can name the variable as "isPartitionColumnPersistedInDataFile" (past tense).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsivabalan on a second thought -- this flag is actually directing whether we should be omitting partition columns when we persist in data files, so kept it as omitPartitionColumns to be aligned with the config value
|
|
||
| object Spark32HoodieParquetFileFormat { | ||
|
|
||
| def pruneInternalSchema(internalSchemaStr: String, requiredSchema: StructType): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel free to fix this in a follow up PR if need be. may be we can move this to a util class and used in across adaptors? I see same exact method in Spark312HoodieParquetFileFormat class as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there's quite a bit duplication we can eliminate. We can take it up as a a follow-up for the sake of moving f/w w/ RC3 ASAP
| val taskContext = Option(TaskContext.get()) | ||
| if (enableVectorizedReader) { | ||
| val vectorizedReader = new Spark312HoodieVectorizedParquetRecordReader( | ||
| convertTz.orNull, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls use shouldUseInternalSchema to fallback origin VectorizedParquetRecordReader. Spark312HoodieVectorizedParquetRecordReader is used for schema evolution.
|
@alexeykudinkin now we use hoodieparquetFile, |
|
@xiarixiaoyao addressed |
|
@alexeykudinkin thanks for your address. |
|
@alexeykudinkin @nsivabalan @yihua i will put another patch to deal with schema evolution. |
… instead of source columns (#5364) - Scaffolded `Spark24HoodieParquetFileFormat` extending `ParquetFileFormat` and overriding the behavior of adding partition columns to every row - Amended `SparkAdapter`s `createHoodieParquetFileFormat` API to be able to configure whether to append partition values or not - Fallback to append partition values in cases when the source columns are not persisted in data-file - Fixing HoodieBaseRelation incorrectly handling mandatory columns
| // NOTE: There are currently 2 ways partition values could be fetched: | ||
| // - Source columns (producing the values used for physical partitioning) will be read | ||
| // from the data file | ||
| // - Values parsed from the actual partition pat would be appended to the final dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in a follow-up
| val projectedRDD = if (prunedRequiredSchema.structTypeSchema != requiredSchema.structTypeSchema) { | ||
| rdd.mapPartitions { it => | ||
| val fullPrunedSchema = StructType(prunedRequiredSchema.structTypeSchema.fields ++ partitionSchema.fields) | ||
| val unsafeProjection = generateUnsafeProjection(fullPrunedSchema, requiredSchema.structTypeSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YannByron this is the problem you're hitting with mandatory columns -- when you're filtering out partition columns from the schema, you actually re-ordered the columns relative to what caller (Spark) was expecting and it was simply projecting schema assuming that BaseRelation will return rows adhering to the schema, while it was returning it w/ columns reordered (where partition columns were appended at the end).
Proper fix for that was to do projection here back into the schema that caller expects
Tips
What is the purpose of the pull request
After we did a fallback to
HadoopFsRelationin #5352, old issues of HUDI-3204 re-surfaced due to the fact that Spark forcibly appends partition values parsed from the actual partition path to the fetched dataset.Unfortunately, this behavior is not configurable, and therefore to address this problem we have to override default
ParquetFileFormatw/ our own modified instance that overrides this behavior: our own extension forParquetFileFormatmake such behavior configurable -- caller can configure whether it would preferBrief change log
Spark24HoodieParquetFileFormatextendingParquetFileFormatand overriding the behavior of adding partition columns to every rowSparkAdapterscreateHoodieParquetFileFormatAPI to be able to configure whether to append partition values or notVerify this pull request
This pull request is already covered by existing tests, such as (please describe tests).
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.