[HUDI-3204] Fixing partition-values being derived from partition-path instead of source columns #5364

alexeykudinkin · 2022-04-19T18:55:25Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

After we did a fallback to HadoopFsRelation in #5352, old issues of HUDI-3204 re-surfaced due to the fact that Spark forcibly appends partition values parsed from the actual partition path to the fetched dataset.

Unfortunately, this behavior is not configurable, and therefore to address this problem we have to override default ParquetFileFormat w/ our own modified instance that overrides this behavior: our own extension for ParquetFileFormat make such behavior configurable -- caller can configure whether it would prefer

To append partition values parsed from the actually partition path
To avoid appending such partition values and instead rely on reading the source columns being used as partition values (Hudi-specific behavior)

Brief change log

Scaffolded Spark24HoodieParquetFileFormat extending ParquetFileFormat and overriding the behavior of adding partition columns to every row
Amended SparkAdapters createHoodieParquetFileFormat API to be able to configure whether to append partition values or not
Fallback to append partition values in cases when the source columns are not persisted in data-file
Fixing HoodieBaseRelation incorrectly handling mandatory columns

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

…mat` and overriding the behavior of adding partition columns to every row

…e to configure whether to append partition values or not

…are not persisted in data-file

alexeykudinkin · 2022-04-19T18:58:36Z

...cala/org/apache/spark/sql/execution/datasources/parquet/Spark24HoodieParquetFileFormat.scala

+ *   <li>Avoiding appending partition values to the rows read from the data file</li>
+ * </ol>
+ */
+class Spark24HoodieParquetFileFormat(private val shouldAppendPartitionValues: Boolean) extends ParquetFileFormat {


Only aspects that diverge from the source are the ones using shouldAppendPartitionValues

…one (to make sure partition values appending is handled correctly)

yihua

Overall LGTM. Left a few nits.

yihua · 2022-04-19T19:46:22Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala

+
    val (tableFileFormat, formatClassName) = metaClient.getTableConfig.getBaseFileFormat match {
-      case HoodieFileFormat.PARQUET => (new ParquetFileFormat, "parquet")
+      case HoodieFileFormat.PARQUET => (sparkAdapter.createHoodieParquetFileFormat(shouldAppendPartitionColumns).get, "hoodie-parquet")


nit: create a constant for "hoodie-parquet" so it can be referenced everywhere.

yihua · 2022-04-19T19:47:31Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala

+      // NOTE: There are currently 2 ways partition values could be fetched:
+      //          - Source columns (producing the values used for physical partitioning) will be read
+      //          from the data file
+      //          - Values parsed from the actual partition pat would be appended to the final dataset


typo: "pat"

Addressed in a follow-up

yihua · 2022-04-19T19:51:16Z

...-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/IncrementalRelation.scala

      sqlContext.sparkContext.hadoopConfiguration.set(SparkInternalSchemaConverter.HOODIE_VALID_COMMITS_LIST, validCommits)
      val formatClassName = metaClient.getTableConfig.getBaseFileFormat match {
-        case HoodieFileFormat.PARQUET => if (!internalSchema.isEmptySchema) "HoodieParquet" else "parquet"
+        case HoodieFileFormat.PARQUET => "hoodie-parquet"


same here for constant

yihua · 2022-04-19T19:54:39Z

.../scala/org/apache/spark/sql/execution/datasources/parquet/SparkHoodieParquetFileFormat.scala


 class SparkHoodieParquetFileFormat extends ParquetFileFormat with SparkAdapterSupport {
-  override def shortName(): String = "HoodieParquet"
+  override def shortName(): String = "hoodie-parquet"


I assume this is used by Spark to identify the format?

Tidying up

…cating partition values handling)

…ry of red-herring stacktraces in the logs

…iew)

alexeykudinkin · 2022-04-19T21:50:25Z

...ala/org/apache/spark/sql/execution/datasources/parquet/Spark312HoodieParquetFileFormat.scala

+ */
+class Spark312HoodieParquetFileFormat(private val shouldAppendPartitionValues: Boolean) extends ParquetFileFormat {
+
+  override def buildReaderWithPartitionValues(sparkSession: SparkSession,


Github UI has hard time reflecting the changes properly:

Had to remove top-level conditional (since this FileFormat is now used to control whether partition values will be appended)

Did minor cleanup for things related to handling of InternalSchema to make sure those are not failing w/ NPEs

Adding changes to handle shouldAppendPartitionValues

NOTE: Copy both of those into IDEA scratchpad to be able to compare them side by side in a more meaningful way

alexeykudinkin · 2022-04-19T21:51:27Z

...cala/org/apache/spark/sql/execution/datasources/parquet/Spark32HoodieParquetFileFormat.scala

+ *   <li>Schema on-read</li>
+ * </ol>
+ */
+class Spark32HoodieParquetFileFormat(private val shouldAppendPartitionValues: Boolean) extends ParquetFileFormat {


Same comments as for Spark 3.1

alexeykudinkin · 2022-04-19T22:55:49Z

@xiarixiaoyao please take a look as well

yihua · 2022-04-19T22:42:01Z

...datasource/hudi-spark3.1.x/src/main/scala/org/apache/spark/sql/adapter/Spark3_1Adapter.scala

-      None
-    }
+  override def createHoodieParquetFileFormat(appendPartitionValues: Boolean): Option[ParquetFileFormat] = {
+    Some(new Spark312HoodieParquetFileFormat(appendPartitionValues))


Is there any reason why the class loader is used before, instead of directly creating a new instance with the class? @xushiyan do you have any context here, to make sure there is no historical get-around and we're not breaking any logic?

My hunch is that @xiarixiaoyao was using the reflection to load this component to handle the case of Spark 3.0. But given that we're dropping support for it in 0.11, i just dropped the reflection and instantiate it directly

yihua · 2022-04-19T22:57:00Z

...cala/org/apache/spark/sql/execution/datasources/parquet/Spark32HoodieParquetFileFormat.scala

-                                               hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow] = {
-    if (hadoopConf.get(SparkInternalSchemaConverter.HOODIE_QUERY_SCHEMA, "").isEmpty) {
-      // fallback to origin parquet File read
-      super.buildReaderWithPartitionValues(sparkSession, dataSchema, partitionSchema, requiredSchema, filters, options, hadoopConf)


If shouldAppendPartitionValues is true and the existing if condition is true, can we still fall back to the original parquet file read?

shouldAppendPartitionValues is almost never true now (only in cases when we drop the source columns)

yihua · 2022-04-19T22:57:28Z

...ala/org/apache/spark/sql/execution/datasources/parquet/Spark312HoodieParquetFileFormat.scala

-                                               filters: Seq[Filter],
-                                               options: Map[String, String],
-                                               hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow] = {
-    if (hadoopConf.get(SparkInternalSchemaConverter.HOODIE_QUERY_SCHEMA, "").isEmpty) {


similar here

Responded above

…pped

nsivabalan · 2022-04-19T23:01:10Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala

  def toHadoopFsRelation: HadoopFsRelation = {
+    // We're delegating to Spark to append partition values to every row only in cases
+    // when these corresponding partition-values are not persisted w/in the data file itself
+    val shouldAppendPartitionColumns = omitPartitionColumnsInFile


minor. instead of "omitPartitionColumnsInFile" (present tense), may be we can name the variable as "isPartitionColumnPersistedInDataFile" (past tense).

@nsivabalan on a second thought -- this flag is actually directing whether we should be omitting partition columns when we persist in data files, so kept it as omitPartitionColumns to be aligned with the config value

nsivabalan · 2022-04-19T23:37:55Z

...cala/org/apache/spark/sql/execution/datasources/parquet/Spark32HoodieParquetFileFormat.scala


 object Spark32HoodieParquetFileFormat {

+  def pruneInternalSchema(internalSchemaStr: String, requiredSchema: StructType): String = {


feel free to fix this in a follow up PR if need be. may be we can move this to a util class and used in across adaptors? I see same exact method in Spark312HoodieParquetFileFormat class as well.

Yeah, there's quite a bit duplication we can eliminate. We can take it up as a a follow-up for the sake of moving f/w w/ RC3 ASAP

xiarixiaoyao · 2022-04-20T02:36:59Z

...ala/org/apache/spark/sql/execution/datasources/parquet/Spark312HoodieParquetFileFormat.scala

+      val taskContext = Option(TaskContext.get())
+      if (enableVectorizedReader) {
+        val vectorizedReader = new Spark312HoodieVectorizedParquetRecordReader(
+          convertTz.orNull,


Pls use shouldUseInternalSchema to fallback origin VectorizedParquetRecordReader. Spark312HoodieVectorizedParquetRecordReader is used for schema evolution.

xiarixiaoyao · 2022-04-20T02:47:22Z

@alexeykudinkin now we use hoodieparquetFile,
we may need to modfy this function to reduce the impact of schema evolution
HoodieDataSourceHelper.getConfigurationWithInternalSchema

  def getConfigurationWithInternalSchema(conf: Configuration, internalSchema: InternalSchema, tablePath: String, validCommits: String): Configuration = {
    val querySchemaString = SerDeHelper.toJson(internalSchema)
    if (!querySchemaString.isEmpty) {
      conf.set(SparkInternalSchemaConverter.HOODIE_QUERY_SCHEMA, querySchemaString)
      conf.set(SparkInternalSchemaConverter.HOODIE_TABLE_PATH, tablePath)
      conf.set(SparkInternalSchemaConverter.HOODIE_VALID_COMMITS_LIST, validCommits)
    }
    conf
  }

alexeykudinkin · 2022-04-20T04:43:39Z

@xiarixiaoyao addressed

xiarixiaoyao · 2022-04-20T06:39:21Z

@alexeykudinkin thanks for your address.
@alexeykudinkin @nsivabalan @yihua now schema evolution is not worked with this pr, let me find the reason, pls wait a moment

hudi-bot · 2022-04-20T07:20:59Z

CI report:

6fa8f15 UNKNOWN
27d58e9 UNKNOWN
3dc0bd1 UNKNOWN
f9e166e UNKNOWN
64dd7b1 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

xiarixiaoyao · 2022-04-20T10:42:34Z

@alexeykudinkin @nsivabalan @yihua
LGTM

i will put another patch to deal with schema evolution.

… instead of source columns (#5364) - Scaffolded `Spark24HoodieParquetFileFormat` extending `ParquetFileFormat` and overriding the behavior of adding partition columns to every row - Amended `SparkAdapter`s `createHoodieParquetFileFormat` API to be able to configure whether to append partition values or not - Fallback to append partition values in cases when the source columns are not persisted in data-file - Fixing HoodieBaseRelation incorrectly handling mandatory columns

alexeykudinkin · 2022-04-22T19:18:54Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala

+      // NOTE: There are currently 2 ways partition values could be fetched:
+      //          - Source columns (producing the values used for physical partitioning) will be read
+      //          from the data file
+      //          - Values parsed from the actual partition pat would be appended to the final dataset


Addressed in a follow-up

alexeykudinkin · 2022-04-22T19:22:53Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

+      val projectedRDD = if (prunedRequiredSchema.structTypeSchema != requiredSchema.structTypeSchema) {
+        rdd.mapPartitions { it =>
+          val fullPrunedSchema = StructType(prunedRequiredSchema.structTypeSchema.fields ++ partitionSchema.fields)
+          val unsafeProjection = generateUnsafeProjection(fullPrunedSchema, requiredSchema.structTypeSchema)


@YannByron this is the problem you're hitting with mandatory columns -- when you're filtering out partition columns from the schema, you actually re-ordered the columns relative to what caller (Spark) was expecting and it was simply projecting schema assuming that BaseRelation will return rows adhering to the schema, while it was returning it w/ columns reordered (where partition columns were appended at the end).

Proper fix for that was to do projection here back into the schema that caller expects

Alexey Kudinkin added 7 commits April 19, 2022 11:56

Scaffolded Spark24HoodieParquetFileFormat extending `ParquetFileFor…

f3cad46

…mat` and overriding the behavior of adding partition columns to every row

Amended SparkAdapters createHoodieParquetFileFormat API to be abl…

e2c6ce6

…e to configure whether to append partition values or not

Fixing usages

49fe4b4

Fallback to append partition values in cases when the source columns …

ed396fe

…are not persisted in data-file

Tidying up

5c7dda4

Fixing tests

6aaf861

Fixing HoodieBaseRelation incorreclty handling mandatory columns

ceb5add

alexeykudinkin force-pushed the ak/ts-kgen-fix branch from 6fa8f15 to ceb5add Compare April 19, 2022 18:56

alexeykudinkin commented Apr 19, 2022

View reviewed changes

Alexey Kudinkin added 2 commits April 19, 2022 12:24

Rely on SparkHoodieParquetFileFormat by default instead of Spark's …

8f6a7e1

…one (to make sure partition values appending is handled correctly)

Fixing tests

c040d5c

yihua approved these changes Apr 19, 2022

View reviewed changes

yihua self-assigned this Apr 19, 2022

Alexey Kudinkin added 14 commits April 19, 2022 13:10

SparkHoodieParquetFileFormat > HoodieParquetFileFormat;

20bf6c9

Tidying up

Tidying up Spark31HoodieParquetFileFormat

a94bcce

Cleaning up Spark31HoodieParquetFileFormat impl (required for repli…

5d8e15e

…cating partition values handling)

Added handling for configurable appending of partitioned values

3bee902

Fixing Spark31HoodieParquetFileFormat init seq

c1009b9

Fixing compilation

595a6db

Fixing tests compilation

ff25a23

Fixing NPEs

ee7f3ef

Downgrade logging w/in HoodiePartitionMetadata reader to avoid flur…

27d58e9

…ry of red-herring stacktraces in the logs

Sync'd Spark32HoodieParquetFileFormat to Spark 3.2.1 impl

3dc0bd1

Fixed Spark32HoodieParquetFileFormat init seq

8a8f18b

Replicated changes from Spark 3.1 to Spark32HoodieParquetFileFormat

d5b99e6

Tidying up

1548335

Restoring back Spark312HoodieParquetFileFormat (to ease up code rev…

f4eaa8e

…iew)

alexeykudinkin commented Apr 19, 2022

View reviewed changes

Extracted schema related utils into AvroSchemaUtils

ccc1546

nsivabalan added the priority:blocker Production down; release blocker label Apr 19, 2022

yihua reviewed Apr 19, 2022

View reviewed changes

Alexey Kudinkin added 5 commits April 19, 2022 15:59

Fixed TableSchemaResolver to correctly create nullable field

87b30e0

Tidying up

3ffdf87

Log exceptions in cases of failure to resolve schema from the table

5a168d6

Properly append partition values in cases when source columns are dro…

61dc421

…pped

Fixing compilation

9353c8d

nsivabalan reviewed Apr 19, 2022

View reviewed changes

Alexey Kudinkin added 3 commits April 19, 2022 18:16

Tidying up

4f6a098

Tidying up

5dab226

Cleaned up partition column pruning flow

d7215bb

xiarixiaoyao reviewed Apr 20, 2022

View reviewed changes

Alexey Kudinkin added 5 commits April 19, 2022 20:20

Project rows from the pruned schema back into required one

dc18eb4

Fixed incorrect projection

e3cb6f8

Fixed non-serializable task

e56dcad

Fixed test

f9e166e

Make sure schema-evolution code-paths are bypassed

64dd7b1

xiarixiaoyao approved these changes Apr 20, 2022

View reviewed changes

xushiyan merged commit f7544e2 into apache:master Apr 20, 2022

alexeykudinkin mentioned this pull request Apr 21, 2022

[HUDI-3935] Adding config to fallback to enabled Partition Values extraction from Partition path #5377

Merged

5 tasks

alexeykudinkin commented Apr 22, 2022

View reviewed changes

nsivabalan mentioned this pull request May 12, 2022

[SUPPORT] Partition column not appearing in spark dataframe #4784

Closed

alexeykudinkin mentioned this pull request May 27, 2022

[HUDI-4161] Make sure partition values are taken from partition path #5699

Merged

5 tasks

hudi-bot mentioned this pull request Dec 9, 2025

Allow original partition column value to be retrieved when using TimestampBasedKeyGen #14991

Closed


		object Spark32HoodieParquetFileFormat {

		def pruneInternalSchema(internalSchemaStr: String, requiredSchema: StructType): String = {

[HUDI-3204] Fixing partition-values being derived from partition-path instead of source columns #5364

[HUDI-3204] Fixing partition-values being derived from partition-path instead of source columns #5364

Uh oh!

Conversation

alexeykudinkin commented Apr 19, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin commented Apr 19, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiarixiaoyao commented Apr 20, 2022

Uh oh!

alexeykudinkin commented Apr 20, 2022

Uh oh!

xiarixiaoyao commented Apr 20, 2022

Uh oh!

hudi-bot commented Apr 20, 2022

CI report:

Uh oh!

xiarixiaoyao commented Apr 20, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels