-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3396] Refactoring MergeOnReadRDD to avoid duplication, fetch only projected columns
#4888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
MergeOnReadRDD to avoid duplication, fetch only projected columnsMergeOnReadRDD to avoid duplication, fetch only projected columns
425a845 to
dbc0ceb
Compare
c6b213e to
c7321c5
Compare
1f7cca1 to
59ff930
Compare
MergeOnReadRDD to avoid duplication, fetch only projected columnsMergeOnReadRDD to avoid duplication, fetch only projected columns
f326e75 to
de2c881
Compare
|
@hudi-bot run azure |
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yet to read HoodieMergeOnReadRDD
| case class HoodieTableState(recordKeyField: String, | ||
| preCombineFieldOpt: Option[String]) | ||
| case class HoodieTableState(tablePath: String, | ||
| latestCommit: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latestCommitTime
| override type FileSplit = HoodieBaseFileSplit | ||
|
|
||
| override lazy val mandatoryColumns: Seq[String] = { | ||
| if (isMetadataTable(metaClient)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets remove this special casing for metadata table? It's dealing with an abstraction few layers deeper than here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, this actually not required here
|
|
||
| // If meta fields are enabled, always prefer key from the meta field as opposed to user-specified one | ||
| // NOTE: This is historical behavior which is preserved as is | ||
| // NOTE: Record key-field is assumed singular here due to the either of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same. can we avoid references to metadata table from this layer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually refers to metadata fields, not MT. Will amend
|
@hudi-bot run azure |
d8c47c9 to
a12b35f
Compare
xushiyan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice improvements!
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
Show resolved
Hide resolved
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala
Show resolved
Hide resolved
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileScanRDD.scala
Show resolved
Hide resolved
| val iter = mergeOnReadPartition.split match { | ||
| case dataFileOnlySplit if dataFileOnlySplit.logFiles.isEmpty => | ||
| requiredSchemaFileReader(dataFileOnlySplit.dataFile.get) | ||
| requiredSchemaFileReader.apply(dataFileOnlySplit.dataFile.get) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's always call baseFile instead of dataFile, while you're at it
...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala
Show resolved
Hide resolved
...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala
Show resolved
Hide resolved
...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala
Show resolved
Hide resolved
| maxCompactionMemoryInBytes: Long, | ||
| hadoopConf: Configuration): HoodieMergedLogRecordScanner = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why still have maxCompactionMemoryInBytes as an arg, which can be retrieved from hadoopConf ? having >5 args makes the API harder to use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment seems not addressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I inlined maxCompactionMemoryInBytes into HoodieMergeOnReadRDD (previously was passed in as an arg), but i don't think it makes sense to eliminate it from this arg line here -- it's not a simple field access but requires quite some computation.
...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala
Show resolved
Hide resolved
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala
Outdated
Show resolved
Hide resolved
366ec61 to
c688189
Compare
Removed duplication w/in `HoodieFileScanRDD`
…the file reader, and instead rely on explicitly provided one
Add test for fallback to read whole row
… non-whitelisted Record Payload classes are used
c688189 to
7bfcb40
Compare
xushiyan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. suggest to test with spark 3.x in read/write scenarios. also rebase master to have github actions covering some spark3.x quickstart tests.
| // a) It does use one of the standard (and whitelisted) Record Payload classes | ||
| // then we can avoid reading and parsing the records w/ _full_ schema, and instead only | ||
| // rely on projected one, nevertheless being able to perform merging correctly | ||
| if (!whitelistedPayloadClasses.contains(tableState.recordPayloadClassName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/nit I'd prefer flip the condition to avoid logical negation; less mind processing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure i follow how you propose to change this expr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (!whitelistedPayloadClasses.contains(tableState.recordPayloadClassName)) | |
| if (whitelistedPayloadClasses.contains(tableState.recordPayloadClassName)) |
ok i just meant without negation it reads better
| maxCompactionMemoryInBytes: Long, | ||
| hadoopConf: Configuration): HoodieMergedLogRecordScanner = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment seems not addressed?
...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala
Outdated
Show resolved
Hide resolved
...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala
Outdated
Show resolved
Hide resolved
|
@alexeykudinkin landing this as the CI passed before "Tidying up" commit. Also basic multi spark version tests passed in GA. |
Tips
What is the purpose of the pull request
Refactoring
MergeOnReadRDDtoBrief change log
Verify this pull request
This pull request is already covered by existing tests, such as (please describe tests).
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.