Skip to content

Conversation

@nsivabalan
Copy link
Contributor

Change Logs

  • Fixing checkpoint management for multiple streaming writers. Fix is that, each writer updates the checkpoint in commit metadata with its own batchId info only. When checking to skip current batch, we walk back in the timeline and find current writer's last committed batchId.
  • Also fixed bulk insert row writer path for checkpoint management with streaming writes.

Impact

  • Ensure idempotency for multiple streaming writers using spark structured streaming.

Risk level (write none, low medium or high below)

low.

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Minor refactoring comment.

object HoodieStreamingSink {

// This constant serves as the checkpoint key for streaming sink so that each microbatch is processed exactly-once.
val SINK_CHECKPOINT_KEY = "_hudi_streaming_sink_checkpoint"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to remove this object now? It's just holding a constant.

@nsivabalan
Copy link
Contributor Author

nsivabalan commented May 1, 2023

@codope : guess we can't define static variables within scala class and hence it has to go into the object. I would prefer to keep it in HoodieStreamingSink if you were suggesting to move it to some other class. since this is applicable only for streaming writes.
Let me know what do you think.

@codope
Copy link
Member

codope commented May 2, 2023

Sounds good. I wanted to move to write config and unify the constants for deltastreamer and spark streaming. But, now I think it makes sense to keep it separate.

@apache apache deleted a comment from hudi-bot May 2, 2023
@codope
Copy link
Member

codope commented May 2, 2023

@hudi-bot
Copy link
Collaborator

hudi-bot commented May 2, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope merged commit b12be0d into apache:master May 2, 2023
yihua pushed a commit to yihua/hudi that referenced this pull request May 15, 2023
…rs (apache#8558)

Each writer updates the checkpoint in commit metadata
with its own batchId info only. When checking to skip the
current batch, we walk back in the timeline and find the
current writer's last committed batchId. Also fixed bulk insert
row writer path for checkpoint management with streaming writes.
yihua pushed a commit to yihua/hudi that referenced this pull request May 15, 2023
…rs (apache#8558)

Each writer updates the checkpoint in commit metadata
with its own batchId info only. When checking to skip the
current batch, we walk back in the timeline and find the
current writer's last committed batchId. Also fixed bulk insert
row writer path for checkpoint management with streaming writes.
yihua pushed a commit to yihua/hudi that referenced this pull request May 17, 2023
…rs (apache#8558)

Each writer updates the checkpoint in commit metadata
with its own batchId info only. When checking to skip the
current batch, we walk back in the timeline and find the
current writer's last committed batchId. Also fixed bulk insert
row writer path for checkpoint management with streaming writes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants