Skip to content

Conversation

@swuferhong
Copy link
Contributor

Tips

What is the purpose of the pull request

Support independent flink hudi compaction function, which can do hudi compaction by a independent flink task.

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@codecov-commenter
Copy link

codecov-commenter commented Jun 7, 2021

Codecov Report

Merging #3046 (19a1b55) into master (673d62f) will increase coverage by 44.16%.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff              @@
##             master    #3046       +/-   ##
=============================================
+ Coverage      8.43%   52.60%   +44.16%     
- Complexity       62      406      +344     
=============================================
  Files            70       70               
  Lines          2880     2880               
  Branches        359      359               
=============================================
+ Hits            243     1515     +1272     
+ Misses         2616     1220     -1396     
- Partials         21      145      +124     
Flag Coverage Δ
hudiclient ?
hudisync 6.79% <ø> (ø)
hudiutilities 70.96% <ø> (+61.86%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../apache/hudi/utilities/HoodieSnapshotExporter.java 88.79% <0.00%> (+5.17%) ⬆️
...e/hudi/utilities/transform/ChainedTransformer.java 100.00% <0.00%> (+11.11%) ⬆️
...in/java/org/apache/hudi/utilities/UtilHelpers.java 64.53% <0.00%> (+30.23%) ⬆️
...ties/deltastreamer/HoodieDeltaStreamerMetrics.java 36.11% <0.00%> (+36.11%) ⬆️
...g/apache/hudi/utilities/schema/SchemaProvider.java 100.00% <0.00%> (+42.85%) ⬆️
...ities/checkpointing/InitialCheckPointProvider.java 45.45% <0.00%> (+45.45%) ⬆️
...di/utilities/sources/helpers/IncrSourceHelper.java 54.54% <0.00%> (+54.54%) ⬆️
.../hudi/utilities/sources/helpers/AvroConvertor.java 58.33% <0.00%> (+58.33%) ⬆️
...org/apache/hudi/utilities/HoodieClusteringJob.java 62.50% <0.00%> (+62.50%) ⬆️
...a/org/apache/hudi/utilities/sources/SqlSource.java 64.70% <0.00%> (+64.70%) ⬆️
... and 33 more

Copy link
Member

@garyli1019 garyli1019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swuferhong thanks for your contribution! left some comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing apache license.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will modify it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually don't add personal info to the codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will modify it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this keyBy will unevenly distribute the task cause the total events number are relatively small. Maybe we could find a better partitioning strategy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a bottleneck, what partitioning strategy do you suggest ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like we list all the filegroup for pending compaction and assign a taskID from (0 to taskNums - 1) to each of them, and partitionCustom by taskID % parallelism

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, we can infer the parallelism actually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, @garyli1019, it is not easy to set a logic similar to reblance here, so we only optimized the number of parallelism, which determine by the size of operations instead of passing in conf through the client.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more descriptions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thank you for your suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move these validation checks to a commonplace? Looks like this was being used elsewhere too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bug in BaseScheduleCompactionActionExecutor$execute as the PR #3025 want to fix, this bug will let independence hudi compaction can not run. But this PR need to discuss, so we choose to modify the subclass FlinkScheduleCompactionActionExecutor to finish compaction.

@garyli1019 garyli1019 self-assigned this Jun 8, 2021
@swuferhong swuferhong force-pushed the HUDI-1984 branch 2 times, most recently from 0054e2c to 56d6dd6 Compare June 8, 2021 09:43
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether to execute compaction asynchronously.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synchronous way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

() -> doCompaction(instantTime, compactionOperation, collector)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CompactionCommitSource => CompactionPlanSourceFunction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
 * Flink hudi compaction source function.
 *
 * <p>This function read the compaction plan as {@link CompactionOperation}s then assign the compaction task
 * event {@link CompactionPlanEvent} to downstream operators.
 *
 * <p>The compaction instant time is specified explicitly with strategies:
 *
 * <ul>
 *   <li>If the timeline has no inflight instants,
 *   use {@link HoodieActiveTimeline#createNewInstantTime()} as the instant time;</li>
 *   <li>If the timeline has inflight instants,
 *   use the {earliest inflight instant time - 1ms} as the instant time.</li>
 * </ul>
 */

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get => Gets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
   * Sets up the avro schema string into the give configuration {@code conf}
   * through reading from the hoodie table metadata.
   */

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove it and use FlinkClientUtil.getHadoopConf() directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invalid import.

@swuferhong swuferhong force-pushed the HUDI-1984 branch 2 times, most recently from 80953b2 to 9bddbd1 Compare June 9, 2021 01:30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add parameter HoodieTableMetaClient metaClient and move out the table name set up as a separate method:
setUpTableName

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check condition never hits, remove all these checks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method should not have return value.

@swuferhong swuferhong force-pushed the HUDI-1984 branch 3 times, most recently from efb8f15 to 7d3a329 Compare June 11, 2021 02:54
Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

Copy link
Member

@garyli1019 garyli1019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants