Skip to content

Conversation

@satishkotha
Copy link
Member

What is the purpose of the pull request

Support custom clustering strategies and preserve commit time to support incremental read

Brief change log

  • introduce new way of running clustering using SingleSparkJobExecutionStrategy for usecases that dont need sorting
  • Push down more logic into clustering strategies to avoid RDD union.
  • Make some performance improvements after running at large scale. Avoid RDD collect multiple times.
  • Preserve Hoodie commit time (optional for backward compatibility) while rewriting the data

Verify this pull request

This change added tests and can be verified as follows:

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@satishkotha satishkotha changed the title [HUDI-1468] Support custom clustering strategies and preserve commit … [HUDI-1468] Support more flexible clustering strategies and preserve commit … Jul 2, 2021
@hudi-bot
Copy link
Collaborator

hudi-bot commented Jul 2, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run travis re-run the last Travis build
  • @hudi-bot run azure re-run the last Azure build

@satishkotha
Copy link
Member Author

@n3nash @vinothchandar this includes all my changes done for supporting encryption style usecases using clustering framework. I still need to port some tests. But please take a look and add any comments

@satishkotha satishkotha force-pushed the sk/clusteringImprovements branch from ab7bacb to c9c9a0d Compare July 2, 2021 04:10
@codecov-commenter
Copy link

codecov-commenter commented Jul 2, 2021

Codecov Report

Merging #3211 (56f4484) into master (6eca06d) will increase coverage by 18.25%.
The diff coverage is 60.30%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master    #3211       +/-   ##
=============================================
+ Coverage     47.51%   65.76%   +18.25%     
+ Complexity     5429      796     -4633     
=============================================
  Files           922      101      -821     
  Lines         40968     3529    -37439     
  Branches       4105      351     -3754     
=============================================
- Hits          19464     2321    -17143     
+ Misses        19780     1070    -18710     
+ Partials       1724      138     -1586     
Flag Coverage Δ
hudicli ?
hudiclient 65.76% <60.30%> (+31.18%) ⬆️
hudicommon ?
hudiflink ?
hudihadoopmr ?
hudisparkdatasource ?
hudisync ?
huditimelineservice ?
hudiutilities ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...trategy/SparkRecentDaysClusteringPlanStrategy.java 100.00% <ø> (+24.39%) ⬆️
...SparkSelectedPartitionsClusteringPlanStrategy.java 0.00% <0.00%> (ø)
.../run/strategy/SingleSparkJobExecutionStrategy.java 0.00% <0.00%> (ø)
...ring/update/strategy/SparkAllowUpdateStrategy.java 0.00% <0.00%> (ø)
...SparkInsertOverwriteTableCommitActionExecutor.java 0.00% <ø> (ø)
...va/org/apache/hudi/client/SparkRDDWriteClient.java 72.19% <50.00%> (+0.13%) ⬆️
...strategy/SparkSizeBasedClusteringPlanStrategy.java 70.27% <70.27%> (ø)
...un/strategy/MultipleSparkJobExecutionStrategy.java 92.00% <92.00%> (ø)
...er/SparkExecuteClusteringCommitActionExecutor.java 82.85% <92.30%> (-6.57%) ⬇️
...un/strategy/SparkSortAndSizeExecutionStrategy.java 100.00% <100.00%> (ø)
... and 825 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6eca06d...56f4484. Read the comment docs.

@satishkotha satishkotha force-pushed the sk/clusteringImprovements branch from c9c9a0d to 56f4484 Compare July 2, 2021 18:37
@codope
Copy link
Member

codope commented Jul 5, 2021

@satishkotha Couple of high level questions:

@vinothchandar vinothchandar self-assigned this Jul 8, 2021
@vinothchandar vinothchandar added the priority:blocker Production down; release blocker label Jul 8, 2021
@vinothchandar
Copy link
Member

@codope @satishkotha what's the next step here?
Could I help somehow to get this moving along

@satishkotha
Copy link
Member Author

@codope @satishkotha what's the next step here?
Could I help somehow to get this moving along

@codope is working on adding additional tests for this PR. he mentioned he opened codope#3
I'll review that and merge it here sometime this week/early next week.

@vinothchandar
Copy link
Member

Closing in favor of #3419

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants