Skip to content

Conversation

@zhuanshenbsj1
Copy link
Contributor

@zhuanshenbsj1 zhuanshenbsj1 commented Apr 6, 2023

Change Logs

MOR table in upsert scenario, not need to clean when online async compaction is turned off.
image

In scenarios with a large number of files and partitions, turning off clean will improve performance
image

Impact

Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

return Pipelines.compact(conf, pipeline);
} else if (OptionsResolver.isMorTable(conf)) {
return Pipelines.dummySink(pipeline);
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what component is responsible for data cleaning then?

Copy link
Contributor Author

@zhuanshenbsj1 zhuanshenbsj1 Apr 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offline compaction do clean,Consistent with clustering.

// Append mode
  if (OptionsResolver.isAppendMode(conf)) {
    DataStream<Object> pipeline = Pipelines.append(conf, rowType, dataStream, context.isBounded());
    if (OptionsResolver.needsAsyncClustering(conf)) {
      return Pipelines.cluster(conf, rowType, pipeline);
    } else {
      return Pipelines.dummySink(pipeline);// If async-clustering=true,  no clean operator.
    }
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleaning can take effect finally right? Because the table would get compacted for sometime anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offline clustering/compaction will do clean by default , add --clean-async-enabled config will close it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current online async-clean and offline clean use the same configuration :clean.async.enabled. Add a new configuration clean.offline.enable,making the configuration description clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spark offline compaction Job does not take care of cleaning, could you make it clrear how user can handle the cleaning when they use the flink streaming ingestion and spark offline compaction.

Copy link
Contributor Author

@zhuanshenbsj1 zhuanshenbsj1 Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjust the cleaning operation in SparkRDDWriteClient#cluster/compact, when ASYNC_CLEAN is true will do asynchronous clean in prewrite, otherwise will do synchronous clean in autoCleanOnCommit().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to think through the e2e use case for flink streaming ingestion and spark offline compaction, we should add cleaning for spark compaction and clustering first which is a block change of this patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will not schedule cleaning task if async cleaning is disabled:

if (conf.getBoolean(FlinkOptions.CLEAN_ASYNC_ENABLED) && isCleaning) {
, so should be fine?

Copy link
Contributor Author

@zhuanshenbsj1 zhuanshenbsj1 May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If CLEAN_ASYNC_ENABLED = true,a schedule will still be executed. I think cluster and compact should be consistent here(If the cow table async cluster is closed, there will be no clean operator). And now both Spark and Flink will clean up when executing offline jobs, unless forcibly closed. What do you think?

@zhuanshenbsj1 zhuanshenbsj1 force-pushed the removeCleanForNoCompaction branch from 47cb215 to 67fc8b9 Compare April 13, 2023 02:38
@zhuanshenbsj1 zhuanshenbsj1 force-pushed the removeCleanForNoCompaction branch from 67fc8b9 to 36fc037 Compare April 13, 2023 07:54
@danny0405 danny0405 changed the title adjust HoodieTableSink for clean Eliminate cleaning tasks for flink mor table if async cleaning is disabled Apr 15, 2023
@danny0405 danny0405 changed the title Eliminate cleaning tasks for flink mor table if async cleaning is disabled [HUDI-6085] Eliminate cleaning tasks for flink mor table if async cleaning is disabled Apr 15, 2023
@danny0405 danny0405 self-assigned this Apr 15, 2023
@danny0405 danny0405 added engine:flink Flink integration area:table-service Table services area:streaming Streaming operations labels Apr 15, 2023
@zhuanshenbsj1 zhuanshenbsj1 force-pushed the removeCleanForNoCompaction branch from 36fc037 to 79ec658 Compare April 17, 2023 03:47
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

return tableServiceClient.cluster(clusteringInstant, shouldComplete);
HoodieWriteMetadata<JavaRDD<WriteStatus>> clusteringMetadata = tableServiceClient.cluster(clusteringInstant, shouldComplete);
autoCleanOnCommit();
return clusteringMetadata;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not introduce unrelated changes in one patch, if we want to add cleaning procedure for Spark compaction and cleaning, fire another PR and let's discuss with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed Spark related modifications and migrated to a new PR. #8505

@zhuanshenbsj1 zhuanshenbsj1 force-pushed the removeCleanForNoCompaction branch from 9edae81 to 79ec658 Compare April 20, 2023 02:56
@zhuanshenbsj1 zhuanshenbsj1 changed the title [HUDI-6085] Eliminate cleaning tasks for flink mor table if async cleaning is disabled [HUDI-6085] Eliminate cleaning tasks for flink mor table if online async compaciton is disabled Apr 20, 2023
@github-actions github-actions bot added the size:XS PR with lines of changes in <= 10 label Feb 26, 2024
@yihua
Copy link
Contributor

yihua commented Mar 26, 2024

@zhuanshenbsj1 @danny0405 is this PR still relevant?

@danny0405
Copy link
Contributor

cc @zhuanshenbsj1 I think we can cloase it, because we already have #8505 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:streaming Streaming operations area:table-service Table services engine:flink Flink integration size:XS PR with lines of changes in <= 10

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

4 participants