Skip to content

Conversation

@mxm
Copy link
Contributor

@mxm mxm commented Sep 25, 2025

Previously, the DynamicCommitter could commit duplicate WriteResults when recovering from failures, leading to incorrect data in tables. This change gets rid of the unnecessary commits for each WriteResult, which caused the problems in the first place, and instead commits all the WriteResults for each Flink checkpoint / table / branch together.

The WriteResults for each checkpoint / table / branch can safely be committed together, even in the presence of delete files. Since we now commit once per Flink checkpoint and table / branch, we can use the already existing max Flink checkpoint id in the snapshot table properties for the table / branch to detect whether we have already committed.

We intentionally chose not to combine WriteResults across Flink checkpoints, which would be possible when a checkpoint only contains append-only data. While technically feasible, it is a premature optimization which complicates the implementation and maintainability for solving a very rare edge case of committing multiple pending checkpoints. Multiple checkpoints can only occur with a very low checkpoint interval or when concurrent checkpointing is enabled. Even in such a situation, it is preferable to keep the commits separately, as this makes it easier to reason about the applied changes.

Previously, the DynamicCommitter could commit duplicate WriteResults when
recovering from failures, leading to incorrect data in tables. This change
introduces tracking of the maximum committed WriteResult index per checkpoint
to ensure idempotent behavior during recovery scenarios.

Key changes:
- Added MAX_WRITE_RESULT_INDEX snapshot property to track committed WriteResults
- Modified commit logic to skip already committed WriteResults within a checkpoint
- Optimized atomic commits by batching append-only WriteResults into single transactions
- Updated tests to verify idempotent behavior with simulated failures
@mxm
Copy link
Contributor Author

mxm commented Sep 25, 2025

I've revised the approach here quite a bit which now is also reflected in the PR description.

@aiborodin
Copy link
Contributor

@mxm Even with this change, the data loss will still occur for WriteResults with delete files in the scenario described in #14090. For example, consider the case when the DynamicCommitter fails after the first committed RowDelta in the private void commitDeltaTxn() method:

private void commitDeltaTxn(...) {
  // ...
  RowDelta rowDelta = null;
  long checkpointId = -1;

  for (Map.Entry<Long, List<WriteResult>> e : pendingResults.entrySet()) {
      // ...
      if (rowDelta != null
          && writeResults.stream().anyMatch(writeResult -> writeResult.deleteFiles().length > 0)) {
        // ...
        // aiborodin: The data loss occurs if we fail after the first iteration.
        // Flink will re-attempt all dynamic committables from the last checkpoint and skip the
        // remaining committables on line 146 because we already committed the first committable
        // with the current checkpoint ID
        commitOperation(
            table, branch, rowDelta, summary, "rowDelta", newFlinkJobId, operatorId, checkpointId);
        rowDelta = null;
      }

The complete solution is to aggregate all WriteResults for a (checkpoint, table, branch) triplet, which I implemented in #14092 in DynamicWriteResultAggregator. It is valid to aggregate delete files from WriteResults within a single checkpoint because all changes within a checkpoint are logically concurrent and get the same sequence number when committed. The non-dynamic IcebergSink aggregates WriteResults in the same way within the scope of the same checkpoint.

More generally, I think DynamicCommitter should not be responsible for aggregating WriteResults from a single checkpoint. There is a dedicated class for this - DynamicWriteResultAggregator, which decides how commit requests (DynamicCommittables) are created. DynamicCommitter should commit incoming requests and rely on a contract of a single commit request per (table, branch, checkpointId) triplet. That's why in #14092, I changed NavigableMap<Long, List<WriteResult>> pendingResults to NavigableMap<Long, WriteResult> pendingResults in the DynamicCommitter - there should be one and only one commit request per checkpoint to maintain the idempotence contract.

Combining WriteResults across checkpoints for appends is a different story. It is valid to do this in DynamicCommitter because it is the only logical place in the code that has the context across multiple checkpoints, while DynamicWriteResultAggregator always operates within a single checkpoint.

I'm happy to discuss this online in Slack or over a Zoom call to clarify this. @pvary, would you be interested in joining as well?

@mxm
Copy link
Contributor Author

mxm commented Sep 26, 2025

@mxm Even with this change, the data loss will still occur for WriteResults with delete files in the scenario described in #14090. For example, consider the case when the DynamicCommitter fails after the first committed RowDelta in the private void commitDeltaTxn() method:

I have to politely disagree with you here. We commit in two cases:

  1. Whenever a checkpoint contains delete files (only exception, we haven't yet processed any checkpoints)
  2. At the end of processing all checkpoints and their WriteResults

Since the smallest unit at which we do a table snapshot is per Flink checkpoint, we will always be able to recover the commit state by looking up the highest committed checkpoint id from the table summary which is kept per table/branch.

The complete solution is to aggregate all WriteResults for a (checkpoint, table, branch) triplet, which I implemented in #14092 in DynamicWriteResultAggregator. It is valid to aggregate delete files from WriteResults within a single checkpoint because all changes within a checkpoint are logically concurrent and get the same sequence number when committed.

This is precisely what this change does. There is an additional optimization to also optimize multiple commits. I think this makes the code hard to review. I'm going to revert this change and only commit each checkpoint. This makes the code easier to reason about. Also, the situation where we have WriteResults from multiple Flink checkpoints is very rare to occur.

Combining WriteResults across checkpoints for appends is a different story. It is valid to do this in DynamicCommitter because it is the only logical place in the code that has the context across multiple checkpoints, while DynamicWriteResultAggregator always operates within a single checkpoint.

For the sake of simplicity, I will revert the change to combine WriteResults from multiple Flink checkpoints.

… combine append-only WriteResults from multiple checkpoints
@mxm
Copy link
Contributor Author

mxm commented Sep 26, 2025

@aiborodin I've pushed the simplification. We can also discuss on Slack if there are still open questions.

@mxm mxm force-pushed the idempotent-committer branch from 2cc812c to e67c082 Compare September 26, 2025 08:51
Comment on lines +281 to +295
for (List<WriteResult> writeResults : pendingResults.values()) {
for (WriteResult result : writeResults) {
Arrays.stream(result.dataFiles()).forEach(dynamicOverwrite::addFile);
commitOperation(
table,
branch,
dynamicOverwrite,
summary,
"dynamic partition overwrite",
newFlinkJobId,
operatorId,
e.getKey());
}
}

commitOperation(
table,
branch,
dynamicOverwrite,
summary,
"dynamic partition overwrite",
newFlinkJobId,
operatorId,
pendingResults.lastKey());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be:

for (List<WriteResult> writeResults : pendingResults.values()) {
  ReplacePartitions dynamicOverwrite = table.newReplacePartitions().scanManifestsWith(workerPool);
  for (WriteResult result : writeResults) {
    Arrays.stream(result.dataFiles()).forEach(dynamicOverwrite::addFile);
  }

  commitOperation(
      table,
      branch,
      dynamicOverwrite,
      summary,
      "dynamic partition overwrite",
      newFlinkJobId,
      operatorId,
      pendingResults.lastKey());
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to commit the checkpoints one-by-one. What if the replace happened for the same partition? With the proposed method we will have duplicated data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be, but IMHO correctness isn't affected. This is a left-over from the previous commit where we would still combine as many WriteResults as possible into a single table snapshot. Since replace partitions is append-only, I figured we could keep this optimization. However, for the sake of consistency with non-replacing writes, we could also go with your suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, then the overwrite-mode only should be enabled in batch jobs, as it is very hard to make any claims about the results in a streaming job.

Also, with the current implementation it is also problematic, as we could have multiple data files for a given partition, and then, they will replace each other, and only that last one wins 😢

Also, if the checkpoints are turned on, then we will have a same issue as mentioned above, just with a bit bigger amount of data. The 2nd checkpoint might delete data from the 1st checkpoint, because it is replacing the same partition.

So this means that replace partitions is only working if the checkpointing is turned off (or you are lucky 😄)

So essentially, it doesn't matter which solution we choose 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably a topic for another day, as both the IcebergSink and older FlinkSink have this issue. The implementation in this PR is consistent though with how the IcebergSink works, see

for (WriteResult result : pendingResults.values()) {
. Either way, as you said, this feature is questionable to phrase it mildly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline. We can keep your proposed solution here, and we could file another PR to throw an exception/or log an error if checkpointing is on and overwrite flag is used

@pvary
Copy link
Contributor

pvary commented Sep 26, 2025

More generally, I think DynamicCommitter should not be responsible for aggregating WriteResults from a single checkpoint. There is a dedicated class for this - DynamicWriteResultAggregator, which decides how commit requests (DynamicCommittables) are created.

@aiborodin: This seems like a reasonable optimization for me. OTOH the correctness issue could be solved with a lightweight change (don't commit every WriteResult one-by-one, but first aggregate on the Committer side). This could go in to 1.10.1, and then the optimization you have suggested could go in to 1.11.0.

WDYT?

@aiborodin
Copy link
Contributor

This could go in to 1.10.1, and then the optimization you have suggested could go in to 1.11.0.

@pvary I am okay with that. I also considered aggregating WriteResults in the DynamicCommitter, which is what this PR ultimately achieves — but decided in favour of the approach in #14092, which makes more sense architecturally. I would appreciate it if we could continue reviewing #14092 to get it in as well.

As discussed offline. ...

@pvary @mxm It seems there is some communication (and decisions being made) outside of GitHub. I want to be part of those discussions and contribute to them, especially given that apache/iceberg is a community-owned project. Is there some channel where this communication happens? Could I please be added to it? Thanks.

@pvary pvary merged commit 3860284 into apache:main Sep 29, 2025
18 checks passed
@pvary
Copy link
Contributor

pvary commented Sep 29, 2025

Merged to main.
Thanks for identifying the issue @aiborodin and @mxm for the quick fix, while we work on a better solution.
Thanks @shangxinli for the review!

@pvary
Copy link
Contributor

pvary commented Sep 29, 2025

@pvary @mxm It seems there is some communication (and decisions being made) outside of GitHub. I want to be part of those discussions and contribute to them, especially given that apache/iceberg is a community-owned project. Is there some channel where this communication happens? Could I please be added to it? Thanks.

We don't decide things offline. We always try to bring back the result of the discussions to dev list or to the github PR so others can raise their voice too.
If you think we decided on something you don't agree with it, feel free to raise the question again.

We do the "offline" discussions on Slack. I will try to reach out to you there too.

@pvary
Copy link
Contributor

pvary commented Sep 29, 2025

@aiborodin: Do you have a slack user? Are you part of the Iceberg Slack community?

mxm added a commit to mxm/iceberg that referenced this pull request Sep 29, 2025
huaxingao pushed a commit that referenced this pull request Sep 29, 2025
@pvary pvary added this to the Iceberg 1.10.1 milestone Sep 30, 2025
gabeiglio pushed a commit to gabeiglio/iceberg that referenced this pull request Oct 1, 2025
gabeiglio pushed a commit to gabeiglio/iceberg that referenced this pull request Oct 1, 2025
@aiborodin
Copy link
Contributor

Are you part of the Iceberg Slack community?

@pvary I am not, how can I join?

@pvary
Copy link
Contributor

pvary commented Oct 2, 2025

Are you part of the Iceberg Slack community?

@pvary I am not, how can I join?

https://iceberg.apache.org/community/?h=slack#slack

adawrapub pushed a commit to adawrapub/iceberg that referenced this pull request Oct 16, 2025
adawrapub pushed a commit to adawrapub/iceberg that referenced this pull request Oct 16, 2025
huaxingao pushed a commit to huaxingao/iceberg that referenced this pull request Nov 1, 2025
huaxingao pushed a commit to huaxingao/iceberg that referenced this pull request Nov 1, 2025
…the presence of failures (apache#14213)

(cherry picked from commit 441597e)
huaxingao added a commit that referenced this pull request Nov 1, 2025
…e presence of failures (#14461)

* Flink: Ensure DynamicCommitter Idempotence in the presence of failures (#14182)

(cherry picked from commit 3860284)

* Flink: Backport #14182: Ensure DynamicCommitter Idempotence in the presence of failures (#14213)

(cherry picked from commit 441597e)

---------

Co-authored-by: Maximilian Michels <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants