[HUDI-6085] Eliminate cleaning tasks for flink mor table if online async compaciton is disabled #8394

zhuanshenbsj1 · 2023-04-06T07:28:52Z

Change Logs

MOR table in upsert scenario, not need to clean when online async compaction is turned off.

In scenarios with a large number of files and partitions, turning off clean will improve performance

Impact

Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

danny0405 · 2023-04-06T14:30:14Z

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSink.java

        return Pipelines.compact(conf, pipeline);
+      } else if (OptionsResolver.isMorTable(conf)) {
+        return Pipelines.dummySink(pipeline);
      } else {


So what component is responsible for data cleaning then?

Offline compaction do clean，Consistent with clustering. // Append mode if (OptionsResolver.isAppendMode(conf)) { DataStream<Object> pipeline = Pipelines.append(conf, rowType, dataStream, context.isBounded()); if (OptionsResolver.needsAsyncClustering(conf)) { return Pipelines.cluster(conf, rowType, pipeline); } else { return Pipelines.dummySink(pipeline);// If async-clustering=true, no clean operator. } }

The cleaning can take effect finally right? Because the table would get compacted for sometime anyway.

Offline clustering/compaction will do clean by default , add --clean-async-enabled config will close it.

Current online async-clean and offline clean use the same configuration ：clean.async.enabled. Add a new configuration clean.offline.enable，making the configuration description clearer.

The spark offline compaction Job does not take care of cleaning, could you make it clrear how user can handle the cleaning when they use the flink streaming ingestion and spark offline compaction.

Adjust the cleaning operation in SparkRDDWriteClient#cluster/compact, when ASYNC_CLEAN is true will do asynchronous clean in prewrite， otherwise will do synchronous clean in autoCleanOnCommit().

We need to think through the e2e use case for flink streaming ingestion and spark offline compaction, we should add cleaning for spark compaction and clustering first which is a block change of this patch.

We will not schedule cleaning task if async cleaning is disabled:

hudi/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/CleanFunction.java

Line 72 in 67ae0c8

if (conf.getBoolean(FlinkOptions.CLEAN_ASYNC_ENABLED) && isCleaning) {

, so should be fine?

If CLEAN_ASYNC_ENABLED = true，a schedule will still be executed. I think cluster and compact should be consistent here(If the cow table async cluster is closed, there will be no clean operator). And now both Spark and Flink will clean up when executing offline jobs, unless forcibly closed. What do you think?

...k-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/HoodieFlinkCompactor.java

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java

...ource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java

hudi-bot · 2023-04-19T19:12:11Z

CI report:

9edae81 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405 · 2023-04-20T02:33:50Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java

-    return tableServiceClient.cluster(clusteringInstant, shouldComplete);
+    HoodieWriteMetadata<JavaRDD<WriteStatus>> clusteringMetadata = tableServiceClient.cluster(clusteringInstant, shouldComplete);
+    autoCleanOnCommit();
+    return clusteringMetadata;


Do not introduce unrelated changes in one patch, if we want to add cleaning procedure for Spark compaction and cleaning, fire another PR and let's discuss with it.

Removed Spark related modifications and migrated to a new PR. #8505

yihua · 2024-03-26T01:28:44Z

@zhuanshenbsj1 @danny0405 is this PR still relevant?

danny0405 · 2024-03-26T01:32:16Z

cc @zhuanshenbsj1 I think we can cloase it, because we already have #8505 .

danny0405 reviewed Apr 6, 2023

View reviewed changes

zhuanshenbsj1 force-pushed the removeCleanForNoCompaction branch from 47cb215 to 67fc8b9 Compare April 13, 2023 02:38

adjust HoodieTableSink for clean

79ec658

zhuanshenbsj1 force-pushed the removeCleanForNoCompaction branch from 67fc8b9 to 36fc037 Compare April 13, 2023 07:54

danny0405 reviewed Apr 15, 2023

View reviewed changes

...k-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/HoodieFlinkCompactor.java Outdated Show resolved Hide resolved

danny0405 reviewed Apr 15, 2023

View reviewed changes

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java Outdated Show resolved Hide resolved

danny0405 reviewed Apr 15, 2023

View reviewed changes

...ource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java Outdated Show resolved Hide resolved

danny0405 changed the title ~~adjust HoodieTableSink for clean~~ Eliminate cleaning tasks for flink mor table if async cleaning is disabled Apr 15, 2023

danny0405 changed the title ~~Eliminate cleaning tasks for flink mor table if async cleaning is disabled~~ [HUDI-6085] Eliminate cleaning tasks for flink mor table if async cleaning is disabled Apr 15, 2023

danny0405 self-assigned this Apr 15, 2023

danny0405 added engine:flink Flink integration area:table-service Table services area:streaming Streaming operations labels Apr 15, 2023

zhuanshenbsj1 force-pushed the removeCleanForNoCompaction branch from 36fc037 to 79ec658 Compare April 17, 2023 03:47

danny0405 reviewed Apr 20, 2023

View reviewed changes

zhuanshenbsj1 force-pushed the removeCleanForNoCompaction branch from 9edae81 to 79ec658 Compare April 20, 2023 02:56

zhuanshenbsj1 changed the title ~~[HUDI-6085] Eliminate cleaning tasks for flink mor table if async cleaning is disabled~~ [HUDI-6085] Eliminate cleaning tasks for flink mor table if online async compaciton is disabled Apr 20, 2023

github-actions bot added the size:XS PR with lines of changes in <= 10 label Feb 26, 2024

zhuanshenbsj1 closed this Mar 26, 2024

cshuo mentioned this pull request Dec 3, 2025

Eliminate cleaning tasks for flink mor table if async cleaning is disabled #15904

Closed

[HUDI-6085] Eliminate cleaning tasks for flink mor table if online async compaciton is disabled #8394

[HUDI-6085] Eliminate cleaning tasks for flink mor table if online async compaciton is disabled #8394

Uh oh!

Conversation

zhuanshenbsj1 commented Apr 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuanshenbsj1 Apr 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuanshenbsj1 Apr 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuanshenbsj1 May 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hudi-bot commented Apr 19, 2023

CI report:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yihua commented Mar 26, 2024

Uh oh!

danny0405 commented Mar 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhuanshenbsj1 commented Apr 6, 2023 •

edited

Loading

zhuanshenbsj1 Apr 7, 2023 •

edited

Loading

zhuanshenbsj1 Apr 19, 2023 •

edited

Loading

zhuanshenbsj1 May 15, 2023 •

edited

Loading