Skip to content

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Feb 17, 2022

What is the purpose of the pull request

Redo of #4446

We may find some data which should be rollbacked in hudi table.

Root cause:

Let's first recall how rollback plan generated about log blocks for deltaCommit. Hudi takes two cases into consideration.

For some log file with no base file, they are comprised by records which are all 'insert record'. Delete them directly. Here we assume all inserted record should be covered by this way.
For those fileID which are updated according to inflight commit meta of instant we want to rollback, we append command block to these log file to rollback. Here all updated record are handled.
However, the first condition is not always true. For indexes which can index log file, they could insert record to some existing log file. In current process, inflight hoodieCommitMeta was generated before they are assigned to specific filegroup.

Brief change log

  1. make upsert partitioner generate an execution workload stats which including all fileGroup will be written into comparing with workload generated by input data. This will cover the case that insert data is written into some log files which is recognized as small file when using Hbase Index.
  2. In such case, we cannot guarantee that all log files which contains only insert data could be deleted in rollback. They may be rollback using command block. So handle this case in compactor.

Verify this pull request

  • Added integration tests for end-to-end: spark client

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@nsivabalan
Copy link
Contributor Author

@codope : can you review this patch.

@nsivabalan
Copy link
Contributor Author

@guanziyue : feel free to review the patch.

@nsivabalan
Copy link
Contributor Author

@codope : good to review again. fixed unnecessary updates/populating output workload stats if not required.

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. If the else block is not necessary then it's better to remove it. You can land it after that.

@nsivabalan
Copy link
Contributor Author

@hudi-bot run azure

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan merged commit 4a59876 into apache:master Feb 28, 2022
@nsivabalan nsivabalan added the status:triaged Issue has been reviewed and categorized label Feb 28, 2022
rkkalluri pushed a commit to rkkalluri/hudi that referenced this pull request Mar 6, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:critical Production degraded; pipelines stalled status:triaged Issue has been reviewed and categorized

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants