Skip to content

[HUDI-5234] Streaming read skip clustering#7296

Merged
danny0405 merged 1 commit intoapache:masterfrom
danny0405:HUDI-5234
Nov 25, 2022
Merged

[HUDI-5234] Streaming read skip clustering#7296
danny0405 merged 1 commit intoapache:masterfrom
danny0405:HUDI-5234

Conversation

@danny0405
Copy link
Copy Markdown
Contributor

Change Logs

Supports skipping the clustering commits while reading. This can make the streaming read under clustering more efficient because the clustering files are all rewritten with the original records.

Impact

No

Risk level (write none, low medium or high below)

none

Documentation Update

Should update the doc whith this new option read.streaming.skip_clustering.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

public static void writeDataAsInsert(
List<RowData> dataBuffer,
Configuration conf) throws Exception {
InsertFunctionWrapper<RowData> funcWrapper = new InsertFunctionWrapper<>(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: would refactor the code with writer wrapper.

.booleanType()
.defaultValue(false)
.withDescription("Whether to skip clustering instants for streaming read,\n"
+ "to avoid reading duplicates");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the cases which cause reading duplicates in the description?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's needed, can you add more cases for clustering here ? You can add UTs in TestInputFormat.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405, ok. I will do adding the UTs.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405 danny0405 merged commit 431ade0 into apache:master Nov 25, 2022
satishkotha pushed a commit to satishkotha/incubator-hudi that referenced this pull request Dec 12, 2022
Co-authored-by: zhuanshenbsj1 <zhuanshen_bsj@163.com>
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
Co-authored-by: zhuanshenbsj1 <zhuanshen_bsj@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants