[HUDI-3396] Refactoring `MergeOnReadRDD` to avoid duplication, fetch only projected columns #4888

alexeykudinkin · 2022-02-23T18:21:35Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

Refactoring MergeOnReadRDD to

Avoid duplication
Enable optimization to avoid reading base-file with full-schema, fetching only projected columns

Brief change log

Unified MOR record iteration logic w/in 3 iterators extending each other
Added optimization to consider whether a) virtual keys are used and b) whether non-default RecordPayload class is used to determine whether we can do projected read from base-file instead of full-schema one

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

alexeykudinkin · 2022-03-21T21:35:32Z

@hudi-bot run azure

vinothchandar

yet to read HoodieMergeOnReadRDD

vinothchandar · 2022-03-21T23:15:08Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

-case class HoodieTableState(recordKeyField: String,
-                            preCombineFieldOpt: Option[String])
+case class HoodieTableState(tablePath: String,
+                            latestCommit: String,


latestCommitTime

vinothchandar · 2022-03-21T23:17:27Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala

  override type FileSplit = HoodieBaseFileSplit

+  override lazy val mandatoryColumns: Seq[String] = {
+    if (isMetadataTable(metaClient)) {


lets remove this special casing for metadata table? It's dealing with an abstraction few layers deeper than here.

Good call, this actually not required here

vinothchandar · 2022-03-21T23:18:06Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala


-  // If meta fields are enabled, always prefer key from the meta field as opposed to user-specified one
-  // NOTE: This is historical behavior which is preserved as is
+  // NOTE: Record key-field is assumed singular here due to the either of


same. can we avoid references to metadata table from this layer

This actually refers to metadata fields, not MT. Will amend

alexeykudinkin · 2022-03-22T02:27:19Z

@hudi-bot run azure

xushiyan

nice improvements!

hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileScanRDD.scala

xushiyan · 2022-03-22T13:23:45Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala

    val iter = mergeOnReadPartition.split match {
      case dataFileOnlySplit if dataFileOnlySplit.logFiles.isEmpty =>
-        requiredSchemaFileReader(dataFileOnlySplit.dataFile.get)
+        requiredSchemaFileReader.apply(dataFileOnlySplit.dataFile.get)


let's always call baseFile instead of dataFile, while you're at it

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala

xushiyan · 2022-03-22T14:30:57Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala

+              maxCompactionMemoryInBytes: Long,
+              hadoopConf: Configuration): HoodieMergedLogRecordScanner = {


why still have maxCompactionMemoryInBytes as an arg, which can be retrieved from hadoopConf ? having >5 args makes the API harder to use

this comment seems not addressed?

I inlined maxCompactionMemoryInBytes into HoodieMergeOnReadRDD (previously was passed in as an arg), but i don't think it makes sense to eliminate it from this arg line here -- it's not a simple field access but requires quite some computation.

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

Removed duplication w/in `HoodieFileScanRDD`

…the file reader, and instead rely on explicitly provided one

Add test for fallback to read whole row

… non-whitelisted Record Payload classes are used

xushiyan

LGTM. suggest to test with spark 3.x in read/write scenarios. also rebase master to have github actions covering some spark3.x quickstart tests.

xushiyan · 2022-03-25T14:16:33Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala

+    //          a) It does use one of the standard (and whitelisted) Record Payload classes
+    //       then we can avoid reading and parsing the records w/ _full_ schema, and instead only
+    //       rely on projected one, nevertheless being able to perform merging correctly
+    if (!whitelistedPayloadClasses.contains(tableState.recordPayloadClassName))


/nit I'd prefer flip the condition to avoid logical negation; less mind processing

Not sure i follow how you propose to change this expr?

Suggested change

if (!whitelistedPayloadClasses.contains(tableState.recordPayloadClassName))

if (whitelistedPayloadClasses.contains(tableState.recordPayloadClassName))

ok i just meant without negation it reads better

xushiyan · 2022-03-25T14:26:05Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala

+              maxCompactionMemoryInBytes: Long,
+              hadoopConf: Configuration): HoodieMergedLogRecordScanner = {


this comment seems not addressed?

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala

hudi-bot · 2022-03-25T16:00:10Z

CI report:

de2c881 UNKNOWN
fbe3701 UNKNOWN
7bfcb40 Azure: SUCCESS
54be92b UNKNOWN

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

xushiyan · 2022-03-25T16:33:16Z

@alexeykudinkin landing this as the CI passed before "Tidying up" commit. Also basic multi spark version tests passed in GA.

alexeykudinkin changed the title ~~[HUDI-3396] Refactoring MergeOnReadRDD to avoid duplication, fetch only projected columns~~ [HUDI-3396][Stacked on 4877] Refactoring MergeOnReadRDD to avoid duplication, fetch only projected columns Feb 23, 2022

alexeykudinkin force-pushed the ak/spkds-ref-3 branch from 425a845 to dbc0ceb Compare February 24, 2022 22:16

alexeykudinkin force-pushed the ak/spkds-ref-3 branch 7 times, most recently from c6b213e to c7321c5 Compare March 16, 2022 19:06

nsivabalan added the priority:blocker Production down; release blocker label Mar 16, 2022

alexeykudinkin force-pushed the ak/spkds-ref-3 branch 2 times, most recently from 1f7cca1 to 59ff930 Compare March 17, 2022 21:27

alexeykudinkin changed the title ~~[HUDI-3396][Stacked on 4877] Refactoring MergeOnReadRDD to avoid duplication, fetch only projected columns~~ [HUDI-3396] Refactoring MergeOnReadRDD to avoid duplication, fetch only projected columns Mar 21, 2022

alexeykudinkin force-pushed the ak/spkds-ref-3 branch from f326e75 to de2c881 Compare March 21, 2022 20:04

vinothchandar reviewed Mar 21, 2022

View reviewed changes

alexeykudinkin force-pushed the ak/spkds-ref-3 branch from d8c47c9 to a12b35f Compare March 22, 2022 04:54

xushiyan reviewed Mar 22, 2022

View reviewed changes

xushiyan reviewed Mar 23, 2022

View reviewed changes

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala Outdated Show resolved Hide resolved

alexeykudinkin force-pushed the ak/spkds-ref-3 branch from 366ec61 to c688189 Compare March 25, 2022 00:40

Alexey Kudinkin added 9 commits March 24, 2022 22:32

Rebased HoodieUnsafeRDD to become a trait;

3660a0b

Removed duplication w/in `HoodieFileScanRDD`

Refactored HoodieMergeOnReadRDD iterators to re-use iteration logic

7aa0b61

Tidying up

390c47c

Tidying up (cont)

04590d3

Cleaned up MOR Relation data-flows

fdbcc70

Tidying up

09b1778

Tidying up

271f7f4

Adding missing docs

c926bb6

Decoupled Record Merging iterator from assuming particular schema of …

bb60abb

…the file reader, and instead rely on explicitly provided one

Alexey Kudinkin added 21 commits March 24, 2022 22:32

After rebase fixes

0e08367

Tidying up (after rebase)

6b036e6

Extracted DeltaLogSupport trait encapsulating reading from Delta Logs

3d80eab

Fixed Merging Record iterator incorrectly projecting merged record

a6a9831

Tidying up

5a7648f

Inlined DeltaLogSupport trait

f45e98d

Fixing NPE

98658f8

Adjusted column projection tests to reflect the changes;

2f63d1d

Add test for fallback to read whole row

Tidying up

83e97b3

Allow non-partitioned table to rely on virtual-keys

16e51f8

Fixed tests

f8ce13d

Pushed down mandatoryColumns defs into individual Relations

65d67ed

Clarified requiredKeyField semantic

b4b6d0f

Relaxed requirements to read only projected schema to only cases when…

692d96d

… non-whitelisted Record Payload classes are used

Fixed tests for Spark 2.x

e23b648

Tidying up

762f555

Fixed tests for Spark 3.x

16f6204

Revert some inadvertent changes

f497df2

Clean up referecnes to MT

b146570

Tidying up

73a645b

Inlined maxCompactionMemoryInBytes w/in HoodieMergeOnReadRDD

7bfcb40

alexeykudinkin force-pushed the ak/spkds-ref-3 branch from c688189 to 7bfcb40 Compare March 25, 2022 05:49

xushiyan approved these changes Mar 25, 2022

View reviewed changes

Tidying up

54be92b

xushiyan merged commit 51034fe into apache:master Mar 25, 2022

TengHuo mentioned this pull request Nov 1, 2022

[PROPOSE] Add column prune support for other payload class #7106

Closed

hudi-bot mentioned this pull request Nov 30, 2025

Add column pruning support to any payload #15542

Open

		maxCompactionMemoryInBytes: Long,
		hadoopConf: Configuration): HoodieMergedLogRecordScanner = {

	if (!whitelistedPayloadClasses.contains(tableState.recordPayloadClassName))
	if (whitelistedPayloadClasses.contains(tableState.recordPayloadClassName))

[HUDI-3396] Refactoring MergeOnReadRDD to avoid duplication, fetch only projected columns #4888

[HUDI-3396] Refactoring MergeOnReadRDD to avoid duplication, fetch only projected columns #4888

Uh oh!

Conversation

alexeykudinkin commented Feb 23, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

alexeykudinkin commented Mar 21, 2022

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin commented Mar 22, 2022

Uh oh!

xushiyan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xushiyan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hudi-bot commented Mar 25, 2022

CI report:

Uh oh!

xushiyan commented Mar 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[HUDI-3396] Refactoring `MergeOnReadRDD` to avoid duplication, fetch only projected columns #4888

[HUDI-3396] Refactoring `MergeOnReadRDD` to avoid duplication, fetch only projected columns #4888