[HUDI-5443] Fixing exception trying to read MOR table after `NestedSchemaPruning` rule has been applied #7528

alexeykudinkin · 2022-12-21T06:31:53Z

Change Logs

Currently MOR tables w/ NestedSchemaPruning (NSP) rule successfully applied (ie being able to prune nested schema) would fail to read in case any log-file merging would occur.

Fixing HoodieClientTestHarness to properly init SparkSession so that HoodieSparkSessionExtensions are injected properly
Converted all custom Hudi relations to be case-classes (to make them immutable)
Implement projection being able to project nested structs as well
Cleaning up

Impact

Addresses issue w/ MOR tables where after applying NSP optimization, it would fail to read in case delta-log merging will nee d to happen

Risk level (write none, low medium or high below)

Low

Documentation Update

N/A

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

.../hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java

...lient/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestHarness.java

alexeykudinkin · 2022-12-21T06:35:53Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala

-                           userSchema: Option[StructType],
-                           globPaths: Seq[Path])
-  extends HoodieBaseRelation(sqlContext, metaClient, optParams, userSchema) with SparkAdapterSupport {
+case class BaseFileOnlyRelation(override val sqlContext: SQLContext,


Primary change here is converting the class to be a case class, which in turn entails that all of the ctor parameters would become field values requiring corresponding annotation

The reason this method is converted to a case-class is to avoid any in-place mutations and instead make updatePrunedDataSchema produce new instance instead

alexeykudinkin · 2022-12-21T06:36:56Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

-   * Get all PartitionDirectories based on globPaths if specified, otherwise use the table path.
-   * Will perform pruning if necessary
-   */
-  private def listPartitionDirectories(globPaths: Seq[Path], partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Seq[PartitionDirectory] = {


Combined this 2 methods into 1

alexeykudinkin · 2022-12-21T06:37:08Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

    }
  }

-  protected def getColName(f: StructField): String = {


alexeykudinkin · 2022-12-21T06:39:03Z

...source/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala

-  extends MergeOnReadSnapshotRelation(sqlContext, optParams, userSchema, Seq(), metaClient) with HoodieIncrementalRelationTrait {
-
-  override type FileSplit = HoodieMergeOnReadFileSplit
+case class MergeOnReadIncrementalRelation(override val sqlContext: SQLContext,


Same changes as other relations

alexeykudinkin · 2022-12-21T06:39:10Z

...atasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala

-                                  globPaths: Seq[Path],
-                                  metaClient: HoodieTableMetaClient)
-  extends HoodieBaseRelation(sqlContext, metaClient, optParams, userSchema) {
+case class MergeOnReadSnapshotRelation(override val sqlContext: SQLContext,


Same changes as other relations

...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala

alexeykudinkin · 2022-12-21T17:19:33Z

@hudi-bot run azure

alexeykudinkin · 2022-12-23T01:57:07Z

@hudi-bot run azure

yihua

@alexeykudinkin I left a few comments. Besides, is there any test that failed before and is now fixed? If not, could you add such a test to verify the fix?

...nt/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystExpressionUtils.scala

yihua · 2023-01-11T17:19:32Z

...nt/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystExpressionUtils.scala

+        sourceExpr
+
+      case (sourceType: StructType, targetType: StructType) =>
+        val fieldValueExprs = targetType.fields.map { tf =>


Looks like a subset of nested fields may be taken during the projection, e.g., if the source has a {a.b, a.c, a.d} and the target has a.b, we only keep a.b instead of the whole StructType a. Does this happen or the caller of this function always makes sure the targetStructType is properly constructed to preserve the root-level field instead of a subset of nested fields? Is this a problem for projection, where the parquet and log reader can read files with a schema containing a subset of nested fields?

Looks like a subset of nested fields may be taken during the projection, e.g., if the source has a {a.b, a.c, a.d} and the target has a.b, we only keep a.b instead of the whole StructType a. Does this happen or the caller of this function always makes sure the targetStructType is properly constructed to preserve the root-level field instead of a subset of nested fields?

It does happen. It's actually the reason for this PR -- previously it was only handling nested field projections, but NestedSchemaPruning could produce schemas w/ nested fields being pruned as well. Therefore, we need to make sure we handle this appropriately when reading log-files (by projecting records into the new schema)

Realized that this is actually not the right approach and the problem is elsewhere:

Problem was that we're simply not reading projected records from the Parquet -- and the reason for that was that in case when non-whitelisted RecordPayload is used -- we will fallback to reading full record, but we still were allowing NestedSchemaPruning to be applied nevertheless

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala

yihua · 2023-01-11T18:07:47Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala

-    }
+  private def getHadoopConf: Configuration = {
+    val conf = hadoopConfBroadcast.value.value
+    new Configuration(conf)


still need the lock for concurrency?

I don't actually think we need the lock

Given that this might introduce side effects and there's little time before the code freeze of the release to verify the removal of the lock, could you keep this part the same as before? It is not essential to the PR.

I don't think there's any real explanation for why that lock might be needed

Underlying broadcast implementation is thread-safe (takes its own lock)

Hadoop configuration doesn't need a lock

But even more importantly -- there's no real concurrent access to this method

This lock is from the beginning of the implementation of HoodieMergeOnReadRDD and it is likely due to the broadcast configuration issue in Spark 2.4 (#1848 (comment)). @garyli1019 could you confirm if that's the case? I'd rather keep it and keep the code safe from potentially breaking the MOR queries.

How would it break MOR queries though?

See whichever way we turn this around this lock doesn't really make sense:

There's no concurrency accessing this method, each task gets its own copy of the object

Only concurrency is w/in accessing the Broadcast shared cache which is hedged by its own lock

Synced up offline. @alexeykudinkin and I are aligned in that we need to clean up the legacy code and remove any unnecessary code path. For this particular case, there could be a problem regarding the modification of the hadoop conf returned by this function (check: HoodieMergeOnReadRDD#compute -> LogFileIterator.logRecords -> scanLog -> FSUtils.getFs -> prepareHadoopConf -> conf.set). That's likely why the lock is put in from the beginning. So the change is going to be reverted in this PR and we'll revisit this in a separate PR to be merged after 0.13.0.

@yihua Hi Ethan, the hadoop configuration issue that I can recall was related to serialization during the broadcast. The hadoop configuration was not serializable. The initial implementation of HoodieMergeOnReadRDD somehow new a Configuration object or use a serializableConfiguration to solve this issue.

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala

...udi-spark/src/test/scala/org/apache/spark/sql/hudi/TestNestedSchemaPruningOptimization.scala

…n it's a no-op

Replaced pruned data-schema updating sequence to create a new relation instead of mutating existing one (that might be cached)

…om it for both Snapshot and Incremental relations; Fixed signature of `updatePrunedDataSchema` to rely on overrrident type decl

…dSchemaPruning` rule could be applied to MOR table; Tidying up

Tidying up

…mplement lazy semantic for every projection

alexeykudinkin · 2023-01-24T04:07:58Z

CI is green:

https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=14575&view=results

yihua

LGTM

hudi-bot · 2023-01-24T19:38:03Z

CI report:

f3a4398 UNKNOWN
a950ede Azure: SUCCESS
cd5bab7 Azure: PENDING

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

…hemaPruning` rule has been applied (apache#7528) Addresses issue w/ MOR tables where after applying `NestedSchemaPruning` optimization, it would fail to read in case delta-log merging will need to happen

alexeykudinkin commented Dec 21, 2022

View reviewed changes

alexeykudinkin force-pushed the ak/mor-sch-prj-fix branch 5 times, most recently from 5994fda to 9339277 Compare December 21, 2022 07:28

alexeykudinkin requested review from YannByron, codope and xushiyan December 21, 2022 07:30

alexeykudinkin added priority:blocker Production down; release blocker engine:spark Spark integration area:sql SQL interfaces labels Dec 21, 2022

alexeykudinkin force-pushed the ak/mor-sch-prj-fix branch from 9339277 to 636b300 Compare December 21, 2022 07:35

alexeykudinkin changed the title ~~[HUDI-5443] Fixing exception trying to read MOR table after NestedSchemaPruning rule has been applied~~ [HUDI-5443] Fixing exception trying to read MOR table after NestedSchemaPruning rule has been applied Dec 21, 2022

alexeykudinkin assigned codope Dec 22, 2022

alexeykudinkin force-pushed the ak/mor-sch-prj-fix branch from 636b300 to 8012230 Compare January 4, 2023 02:52

alexeykudinkin requested review from yihua and removed request for codope January 10, 2023 01:01

alexeykudinkin assigned yihua and unassigned codope Jan 10, 2023

yihua reviewed Jan 11, 2023

View reviewed changes

yihua reviewed Jan 20, 2023

View reviewed changes

...udi-spark/src/test/scala/org/apache/spark/sql/hudi/TestNestedSchemaPruningOptimization.scala Show resolved Hide resolved

alexeykudinkin force-pushed the ak/mor-sch-prj-fix branch from 8012230 to 9267c53 Compare January 21, 2023 03:44

Alexey Kudinkin added 5 commits January 21, 2023 10:39

Removing unnecessary wrappers

7706f12

Tidyng up

200e13d

Tidying up HoodieMergeOnReadRDD

764204a

Deduplicate file-slice listing in Spark's relations

392e04c

Rebased MOR iterators onto lazy projection that avoids projecting whe…

878e689

…n it's a no-op

Alexey Kudinkin added 18 commits January 21, 2023 10:39

Converted all Hudi's Spark relations into case classes;

a57e5b4

Replaced pruned data-schema updating sequence to create a new relation instead of mutating existing one (that might be cached)

Abstracted BaseMergeOnReadSnapshotRelation to be able to inherit fr…

ae9165d

…om it for both Snapshot and Incremental relations; Fixed signature of `updatePrunedDataSchema` to rely on overrrident type decl

Fixed typo

e0392f7

Make sure Hudi relation's schema pruned only once

4d6eb44

Replaced SafeAvroProjection w/ simplified AvroProjection

ede8bf5

XXX

258ba98

Fixing compilation for Spark 2.x

e8e4868

Re-enable NestedSchemaPruning rule

ebe210f

Abstracted isProjectionCompatible utility to control whether `Neste…

867081c

…dSchemaPruning` rule could be applied to MOR table; Tidying up

Reverting unnecessary changes;

c726d22

Tidying up

Fixing NPE

5a0b542

Fixing compilation

be35257

Added negative tests for NestedSchemaPruning

50a5462

Fixing test

106b1e0

Tidying up

fbd9b56

Tidying up tests

4a00a79

Combined generateUnsafeProjection and generateLazyProjection to i…

90c9106

…mplement lazy semantic for every projection

Combined create and createLazy for AvroProjections as well

8f78916

alexeykudinkin force-pushed the ak/mor-sch-prj-fix branch from 74f9bc4 to 8f78916 Compare January 21, 2023 18:39

Alexey Kudinkin added 2 commits January 23, 2023 09:27

Fixed tests

7e89e71

Fixed nullability annotations

a950ede

Restoring CONFIG_INSTANTIATION_LOCK

cd5bab7

yihua approved these changes Jan 24, 2023

View reviewed changes

yihua merged commit 1769ff8 into apache:master Jan 24, 2023

[HUDI-5443] Fixing exception trying to read MOR table after NestedSchemaPruning rule has been applied #7528

[HUDI-5443] Fixing exception trying to read MOR table after NestedSchemaPruning rule has been applied #7528

Uh oh!

Conversation

alexeykudinkin commented Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexeykudinkin commented Dec 21, 2022

Uh oh!

alexeykudinkin commented Dec 23, 2022

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Jan 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alexeykudinkin commented Jan 24, 2023

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jan 24, 2023

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

[HUDI-5443] Fixing exception trying to read MOR table after `NestedSchemaPruning` rule has been applied #7528

[HUDI-5443] Fixing exception trying to read MOR table after `NestedSchemaPruning` rule has been applied #7528

alexeykudinkin commented Dec 21, 2022 •

edited

Loading

alexeykudinkin Jan 21, 2023 •

edited

Loading