Skip to content

Conversation

@bvaradar
Copy link
Contributor

This PR is dependent on #1577 It has 2 commits: The first commit corresponds to #1577 and the second commit is for this PR.

Contains:

  1. Structured Streaming Async Compaction Support
  2. Integration tests was missing Structured Streaming. Added them to ITTestHoodieSanity with async compaction enabled for MOR tables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar vinothchandar self-assigned this Jun 21, 2020
@bvaradar bvaradar changed the title [WIP] [HUDI-575] Support Async Compaction for spark streaming writes to hudi table [HUDI-575] Support Async Compaction for spark streaming writes to hudi table Jun 24, 2020
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some preliminary comments..

I rebased this PR against latest master..
Will continue to review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible that nothing is actually scheduled here, since there is nothiing to compact?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the user sets the writeClient config for inline = false and does not set async compaction datasource option? should we control at a single level..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basic question.. if there are no writes , no compaction gets scheduled right? so async compaction is a no-op i.e it will check if there is some work to do, if not won't trigger anything?

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks close. but we need to iron out the activity detection/async compaction thread shutdown

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move comments that refer to a sub-class impl to that class itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not mean there is no work for compaction right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is deleted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more importantly, we should also renable the test in TestDataSource

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabled it after adding timed retry logic to wait for commits

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just confirming that reuse of writeClient across batches is fine..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this worked fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this alone should be good enough to prevent the jvm from not hanging during exit? do we really need the laststart/lastend logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this and setting daemon mode was good enough

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this will happen each trigger/ not just first time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, only for the first time when async compactor is null.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why move to COW?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-StreamingQueryManager.html seems like there are some listeners we can exploit to know of a StreamingQuery?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed by setting up daemon mode for async compactor thread

@codecov-commenter
Copy link

Codecov Report

Merging #1752 into master will decrease coverage by 2.86%.
The diff coverage is 14.68%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1752      +/-   ##
============================================
- Coverage     62.82%   59.95%   -2.87%     
- Complexity     3437     3609     +172     
============================================
  Files           401      439      +38     
  Lines         17091    19088    +1997     
  Branches       1698     1943     +245     
============================================
+ Hits          10737    11445     +708     
- Misses         5623     6850    +1227     
- Partials        731      793      +62     
Flag Coverage Δ Complexity Δ
#hudicli 67.89% <0.00%> (?) 1430.00 <0.00> (?)
#hudiclient 78.35% <0.00%> (-0.89%) 1258.00 <0.00> (ø)
#hudicommon 54.29% <ø> (+0.04%) 1486.00 <ø> (+1.00)
#hudihadoopmr 39.36% <ø> (ø) 163.00 <ø> (ø)
#hudihivesync 72.25% <ø> (ø) 121.00 <ø> (ø)
#hudispark 41.54% <18.89%> (-2.69%) 78.00 <2.00> (+2.00) ⬇️
#huditimelineservice 63.47% <ø> (ø) 47.00 <ø> (ø)
#hudiutilities 73.49% <100.00%> (-0.27%) 284.00 <0.00> (-3.00)
Impacted Files Coverage Δ Complexity Δ
...va/org/apache/hudi/async/AbstractAsyncService.java 2.17% <0.00%> (+0.04%) 1.00 <0.00> (ø)
...ava/org/apache/hudi/async/AsyncCompactService.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...rc/main/java/org/apache/hudi/client/Compactor.java 0.00% <ø> (ø) 0.00 <0.00> (?)
.../hudi/async/SparkStreamingAsyncCompactService.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...di/async/SparkStreamingWriterActivityDetector.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...in/scala/org/apache/hudi/HoodieStreamingSink.scala 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
...tilities/deltastreamer/SchedulerConfGenerator.java 90.90% <ø> (ø) 8.00 <0.00> (ø)
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala 53.60% <50.00%> (-1.16%) 0.00 <0.00> (ø)
...src/main/java/org/apache/hudi/DataSourceUtils.java 54.90% <66.66%> (-0.10%) 27.00 <2.00> (+2.00) ⬇️
...main/scala/org/apache/hudi/DataSourceOptions.scala 93.81% <100.00%> (+0.26%) 0.00 <0.00> (ø)
... and 41 more

@bvaradar
Copy link
Contributor Author

bvaradar commented Aug 3, 2020

All tests passes now. Will run this in a setup for few hours before merging.

@bvaradar bvaradar force-pushed the hudi-575 branch 3 times, most recently from 8d515f9 to f3736c2 Compare August 5, 2020 02:49
@bvaradar
Copy link
Contributor Author

bvaradar commented Aug 5, 2020

For the remaining 2 questions, here is the answer:

what if the user sets the writeClient config for inline = false and does not set async compaction datasource option? should we control at a single level.. ?
After discussion, we decided to have Async compaction be enabled by default for MOR table. If async compaction is disabled by config, inline compaction will be enabled automatically.

basic question.. if there are no writes , no compaction gets scheduled right? so async compaction is a no-op i.e it will check if there is some work to do, if not won't trigger anything?
Yes, that is correct.

@bvaradar bvaradar merged commit 7a2429f into apache:master Aug 5, 2020
@bvaradar
Copy link
Contributor Author

bvaradar commented Aug 5, 2020

Merging as was previously discussed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants