[WIP] [HUDI-1041] Cache the explodeRecordRDDWithFileComparisons instead of commuting it… #1721

EdwinGuo · 2020-06-09T17:58:05Z

… twice in lookUpIndex

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

Cache the explodeRecordRDDWithFileComparisons instead of commuting it twice in lookUpIndex
(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

… twice in lookUpIndex

codecov-commenter · 2020-06-09T22:54:19Z

Codecov Report

Merging #1721 into master will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##             master    #1721      +/-   ##
============================================
- Coverage     18.16%   18.15%   -0.01%     
  Complexity      860      860              
============================================
  Files           352      352              
  Lines         15411    15410       -1     
  Branches       1525     1524       -1     
============================================
- Hits           2799     2798       -1     
  Misses        12254    12254              
  Partials        358      358

Impacted Files	Coverage Δ	Complexity Δ
.../org/apache/hudi/index/bloom/HoodieBloomIndex.java	`58.16% <100.00%> (-0.43%)`	`16.00 <0.00> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37838ce...ef08b1e. Read the comment docs.

nsivabalan

Good one. If you don't mind, can you run a sample job(with 1 M records or something) and show the spark UI stages screen shot to see the difference with and w/o this optimization.

EdwinGuo · 2020-06-11T20:53:46Z

Good one. If you don't mind, can you run a sample job(with 1 M records or something) and show the spark UI stages screen shot to see the difference with and w/o this optimization.

Will do. Thanks for reviewing.

vinothchandar

This needs a bit more thought.. the exploded RDD can be large and caching may incurs more overhead than recomputing in cases..

vinothchandar · 2020-06-22T23:50:06Z

Can you please include the jira number in the pr title

vinothchandar

@EdwinGuo @nsivabalan let's hash this out.. its an interesting one.. Although it may seem like we are computing the fully exploded RDD in both places.. if you look closely, we do

fileToComparisons = explodeRecordRDDWithFileComparisons(partitionToFileInfo, partitionRecordKeyPairRDD)
          .mapToPair(t -> t).countByKey();

countByKey() does not shuffle actual data, but just the counts per file.. We only pay the compute cost of exploding twice.. And all this to estimate the parallelism. given this is jsut an estimate, would it be better to introduce an option to simply down sample and estimate, rather than adding caching?

eg.

fileToComparisons = explodeRecordRDDWithFileComparisons(partitionToFileInfo, partitionRecordKeyPairRDD.sample(true, 0.1))
          .mapToPair(t -> t).countByKey();

would cut the cost down by 90% .. we need to adjust the computations in the map accordingly ofc..
Even spark sort does some kind of reservoir sampling.. So this could be a valid approach overall.

What do you both think? I am bit concerned about caching this exploded RDD (that's why I chose to recompute to begin with)

EdwinGuo · 2020-06-24T01:30:36Z

Can you please include the jira number in the pr title

Done. https://issues.apache.org/jira/browse/HUDI-1041

EdwinGuo · 2020-06-24T01:44:48Z

@EdwinGuo @nsivabalan let's hash this out.. its an interesting one.. Although it may seem like we are computing the fully exploded RDD in both places.. if you look closely, we do
fileToComparisons = explodeRecordRDDWithFileComparisons(partitionToFileInfo, partitionRecordKeyPairRDD)
          .mapToPair(t -> t).countByKey();
countByKey() does not shuffle actual data, but just the counts per file.. We only pay the compute cost of exploding twice.. And all this to estimate the parallelism. given this is jsut an estimate, would it be better to introduce an option to simply down sample and estimate, rather than adding caching?

eg.
fileToComparisons = explodeRecordRDDWithFileComparisons(partitionToFileInfo, partitionRecordKeyPairRDD.sample(true, 0.1))
          .mapToPair(t -> t).countByKey();
would cut the cost down by 90% .. we need to adjust the computations in the map accordingly ofc..
Even spark sort does some kind of reservoir sampling.. So this could be a valid approach overall.

What do you both think? I am bit concerned about caching this exploded RDD (that's why I chose to recompute to begin with)

I can provide some performance comparison tomorrow. fileComparisonsRDD is being compute in a different patterns within findMatchingFilesForRecordKeys and computeComparisonsPerFileGroup, so yes, countByKey is light in shuffle but could be heavy in IO for some of the cases. I agree StorageLevel.MEMORY_AND_DISK_SER() could be heavy than recompute in some of the scenario, so let me conduct some performance testing and get back to you.

Regarding sampling, what if some of the partitions are skewed? Will that cause more overhead than flush the file out?

vinothchandar · 2020-06-24T15:18:02Z

Regarding sampling, what if some of the partitions are skewed? Will that cause more overhead than flush the file out?

IIRC the partitionRecordKeyPairRDD would have even distribution of keys from the precombine step which just does a reduceByKey. We can always support a config to increase the sampling rate, right? All depends on how much difference there is in the computed parallelism with samplingRate=0.1 and 1.0?

EdwinGuo · 2020-07-01T12:58:31Z

Regarding sampling, what if some of the partitions are skewed? Will that cause more overhead than flush the file out?

IIRC the partitionRecordKeyPairRDD would have even distribution of keys from the precombine step which just does a reduceByKey. We can always support a config to increase the sampling rate, right? All depends on how much difference there is in the computed parallelism with samplingRate=0.1 and 1.0?

Ok, let me work on a sampling rate and get back with the performance result. Thanks

vinothchandar · 2021-09-07T21:08:49Z

Closing due to inactivity

EdwinGuo added 2 commits June 9, 2020 13:55

Cache the explodeRecordRDDWithFileComparisons instead of commuting it…

be71c70

… twice in lookUpIndex

Fix Indentation is travic-ci.

ef08b1e

vinothchandar assigned nsivabalan Jun 10, 2020

nsivabalan reviewed Jun 11, 2020

View reviewed changes

vinothchandar reviewed Jun 22, 2020

View reviewed changes

vinothchandar reviewed Jun 24, 2020

View reviewed changes

vinothchandar changed the title ~~Cache the explodeRecordRDDWithFileComparisons instead of commuting it…~~ [WIP] Cache the explodeRecordRDDWithFileComparisons instead of commuting it… Jun 24, 2020

vinothchandar added the status:in-progress Work in progress label Jun 24, 2020

EdwinGuo changed the title ~~[WIP] Cache the explodeRecordRDDWithFileComparisons instead of commuting it…~~ [WIP] [HUDI-1041] Cache the explodeRecordRDDWithFileComparisons instead of commuting it… Jun 24, 2020

nsivabalan mentioned this pull request Dec 11, 2020

[MINOR] Improve code readability by passing in the fileComparisonsRDD in bloom index #2319

Merged

5 tasks

vinothchandar closed this Sep 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] [HUDI-1041] Cache the explodeRecordRDDWithFileComparisons instead of commuting it… #1721

[WIP] [HUDI-1041] Cache the explodeRecordRDDWithFileComparisons instead of commuting it… #1721

Uh oh!

EdwinGuo commented Jun 9, 2020

Uh oh!

codecov-commenter commented Jun 9, 2020 •

edited

Loading

Uh oh!

nsivabalan left a comment

Uh oh!

EdwinGuo commented Jun 11, 2020

Uh oh!

vinothchandar left a comment

Uh oh!

vinothchandar commented Jun 22, 2020

Uh oh!

vinothchandar left a comment

Uh oh!

EdwinGuo commented Jun 24, 2020

Uh oh!

EdwinGuo commented Jun 24, 2020

Uh oh!

vinothchandar commented Jun 24, 2020

Uh oh!

EdwinGuo commented Jul 1, 2020

Uh oh!

vinothchandar commented Sep 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[WIP] [HUDI-1041] Cache the explodeRecordRDDWithFileComparisons instead of commuting it… #1721

[WIP] [HUDI-1041] Cache the explodeRecordRDDWithFileComparisons instead of commuting it… #1721

Uh oh!

Conversation

EdwinGuo commented Jun 9, 2020

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

codecov-commenter commented Jun 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

EdwinGuo commented Jun 11, 2020

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

vinothchandar commented Jun 22, 2020

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

EdwinGuo commented Jun 24, 2020

Uh oh!

EdwinGuo commented Jun 24, 2020

Uh oh!

vinothchandar commented Jun 24, 2020

Uh oh!

EdwinGuo commented Jul 1, 2020

Uh oh!

vinothchandar commented Sep 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jun 9, 2020 •

edited

Loading