Skip to content

Conversation

@boneanxs
Copy link
Contributor

@boneanxs boneanxs commented Dec 5, 2022

Change Logs

  1. Add extraPreCommitFunc for BaseHoodieWriteClient.commitStats to allow it execute customized functions(which is checking the checkpoint info and update it in HoodieStreamingSink)
  2. Add a new param hoodie.datasource.write.streaming.checkpoint.identifier to identify each writer's checkpoint info, if not set, will hold an in-memory latestBatchId to avoid the issue [HUDI-4389] Make HoodieStreamingSink idempotent #6098

Impact

Existing jobs which already write _hudi_streaming_sink_checkpoint might lost old checkpoint info, as we use a user-provided identifier to get the checkpoint info, not ${sqlContext.sparkContext.applicationId}-$queryId

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@nsivabalan nsivabalan requested a review from codope December 5, 2022 23:29
@nsivabalan nsivabalan added the priority:blocker Production down; release blocker label Dec 5, 2022
@nsivabalan
Copy link
Contributor

@codope : can you review this code. This is the multi-writer checkpoints for spark structured streaming

@nsivabalan nsivabalan added priority:critical Production degraded; pipelines stalled release-0.12.2 Patches targetted for 0.12.2 and removed priority:blocker Production down; release blocker labels Dec 6, 2022
Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the multi-writer scenario. As it is a breaking change (will break spark streaming pipelines using prior checkpoint method), it is best to take it in the upcoming major release 0.13.0. Approving with some minor changes.

@codope codope added area:concurrency Concurrency control and multi-writer spark-streaming and removed release-0.12.2 Patches targetted for 0.12.2 labels Dec 13, 2022
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope merged commit 2b66889 into apache:master Dec 13, 2022
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
* Checkpoint management for muti-writer scenario

* Remove old logic

* Add sinceVersion to config and minor changes

Co-authored-by: Sagar Sumit <[email protected]>
nsivabalan pushed a commit to nsivabalan/hudi that referenced this pull request Apr 20, 2023
* Checkpoint management for muti-writer scenario

* Remove old logic

* Add sinceVersion to config and minor changes

Co-authored-by: Sagar Sumit <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:concurrency Concurrency control and multi-writer priority:critical Production degraded; pipelines stalled

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants