Flink: Ensure DynamicCommitter idempotence in the presence of failures #14182

mxm · 2025-09-25T09:33:49Z

Previously, the DynamicCommitter could commit duplicate WriteResults when recovering from failures, leading to incorrect data in tables. This change gets rid of the unnecessary commits for each WriteResult, which caused the problems in the first place, and instead commits all the WriteResults for each Flink checkpoint / table / branch together.

The WriteResults for each checkpoint / table / branch can safely be committed together, even in the presence of delete files. Since we now commit once per Flink checkpoint and table / branch, we can use the already existing max Flink checkpoint id in the snapshot table properties for the table / branch to detect whether we have already committed.

We intentionally chose not to combine WriteResults across Flink checkpoints, which would be possible when a checkpoint only contains append-only data. While technically feasible, it is a premature optimization which complicates the implementation and maintainability for solving a very rare edge case of committing multiple pending checkpoints. Multiple checkpoints can only occur with a very low checkpoint interval or when concurrent checkpointing is enabled. Even in such a situation, it is preferable to keep the commits separately, as this makes it easier to reason about the applied changes.

Previously, the DynamicCommitter could commit duplicate WriteResults when recovering from failures, leading to incorrect data in tables. This change introduces tracking of the maximum committed WriteResult index per checkpoint to ensure idempotent behavior during recovery scenarios. Key changes: - Added MAX_WRITE_RESULT_INDEX snapshot property to track committed WriteResults - Modified commit logic to skip already committed WriteResults within a checkpoint - Optimized atomic commits by batching append-only WriteResults into single transactions - Updated tests to verify idempotent behavior with simulated failures

flink/v2.1/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicCommitter.java

mxm · 2025-09-25T20:36:26Z

I've revised the approach here quite a bit which now is also reflected in the PR description.

aiborodin · 2025-09-26T06:03:02Z

@mxm Even with this change, the data loss will still occur for WriteResults with delete files in the scenario described in #14090. For example, consider the case when the DynamicCommitter fails after the first committed RowDelta in the private void commitDeltaTxn() method:

private void commitDeltaTxn(...) {
  // ...
  RowDelta rowDelta = null;
  long checkpointId = -1;

  for (Map.Entry<Long, List<WriteResult>> e : pendingResults.entrySet()) {
      // ...
      if (rowDelta != null
          && writeResults.stream().anyMatch(writeResult -> writeResult.deleteFiles().length > 0)) {
        // ...
        // aiborodin: The data loss occurs if we fail after the first iteration.
        // Flink will re-attempt all dynamic committables from the last checkpoint and skip the
        // remaining committables on line 146 because we already committed the first committable
        // with the current checkpoint ID
        commitOperation(
            table, branch, rowDelta, summary, "rowDelta", newFlinkJobId, operatorId, checkpointId);
        rowDelta = null;
      }

The complete solution is to aggregate all WriteResults for a (checkpoint, table, branch) triplet, which I implemented in #14092 in DynamicWriteResultAggregator. It is valid to aggregate delete files from WriteResults within a single checkpoint because all changes within a checkpoint are logically concurrent and get the same sequence number when committed. The non-dynamic IcebergSink aggregates WriteResults in the same way within the scope of the same checkpoint.

More generally, I think DynamicCommitter should not be responsible for aggregating WriteResults from a single checkpoint. There is a dedicated class for this - DynamicWriteResultAggregator, which decides how commit requests (DynamicCommittables) are created. DynamicCommitter should commit incoming requests and rely on a contract of a single commit request per (table, branch, checkpointId) triplet. That's why in #14092, I changed NavigableMap<Long, List<WriteResult>> pendingResults to NavigableMap<Long, WriteResult> pendingResults in the DynamicCommitter - there should be one and only one commit request per checkpoint to maintain the idempotence contract.

Combining WriteResults across checkpoints for appends is a different story. It is valid to do this in DynamicCommitter because it is the only logical place in the code that has the context across multiple checkpoints, while DynamicWriteResultAggregator always operates within a single checkpoint.

I'm happy to discuss this online in Slack or over a Zoom call to clarify this. @pvary, would you be interested in joining as well?

mxm · 2025-09-26T08:13:30Z

@mxm Even with this change, the data loss will still occur for WriteResults with delete files in the scenario described in #14090. For example, consider the case when the DynamicCommitter fails after the first committed RowDelta in the private void commitDeltaTxn() method:

I have to politely disagree with you here. We commit in two cases:

Whenever a checkpoint contains delete files (only exception, we haven't yet processed any checkpoints)
At the end of processing all checkpoints and their WriteResults

Since the smallest unit at which we do a table snapshot is per Flink checkpoint, we will always be able to recover the commit state by looking up the highest committed checkpoint id from the table summary which is kept per table/branch.

The complete solution is to aggregate all WriteResults for a (checkpoint, table, branch) triplet, which I implemented in #14092 in DynamicWriteResultAggregator. It is valid to aggregate delete files from WriteResults within a single checkpoint because all changes within a checkpoint are logically concurrent and get the same sequence number when committed.

This is precisely what this change does. There is an additional optimization to also optimize multiple commits. I think this makes the code hard to review. I'm going to revert this change and only commit each checkpoint. This makes the code easier to reason about. Also, the situation where we have WriteResults from multiple Flink checkpoints is very rare to occur.

Combining WriteResults across checkpoints for appends is a different story. It is valid to do this in DynamicCommitter because it is the only logical place in the code that has the context across multiple checkpoints, while DynamicWriteResultAggregator always operates within a single checkpoint.

For the sake of simplicity, I will revert the change to combine WriteResults from multiple Flink checkpoints.

… combine append-only WriteResults from multiple checkpoints

mxm · 2025-09-26T08:43:37Z

@aiborodin I've pushed the simplification. We can also discuss on Slack if there are still open questions.

pvary · 2025-09-26T09:51:17Z

flink/v2.1/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicCommitter.java

+    for (List<WriteResult> writeResults : pendingResults.values()) {
+      for (WriteResult result : writeResults) {
        Arrays.stream(result.dataFiles()).forEach(dynamicOverwrite::addFile);
-        commitOperation(
-            table,
-            branch,
-            dynamicOverwrite,
-            summary,
-            "dynamic partition overwrite",
-            newFlinkJobId,
-            operatorId,
-            e.getKey());
      }
    }
+
+    commitOperation(
+        table,
+        branch,
+        dynamicOverwrite,
+        summary,
+        "dynamic partition overwrite",
+        newFlinkJobId,
+        operatorId,
+        pendingResults.lastKey());


Should this be:

for (List<WriteResult> writeResults : pendingResults.values()) { ReplacePartitions dynamicOverwrite = table.newReplacePartitions().scanManifestsWith(workerPool); for (WriteResult result : writeResults) { Arrays.stream(result.dataFiles()).forEach(dynamicOverwrite::addFile); } commitOperation( table, branch, dynamicOverwrite, summary, "dynamic partition overwrite", newFlinkJobId, operatorId, pendingResults.lastKey()); }

We still need to commit the checkpoints one-by-one. What if the replace happened for the same partition? With the proposed method we will have duplicated data

It could be, but IMHO correctness isn't affected. This is a left-over from the previous commit where we would still combine as many WriteResults as possible into a single table snapshot. Since replace partitions is append-only, I figured we could keep this optimization. However, for the sake of consistency with non-replacing writes, we could also go with your suggestion.

If I understand correctly, then the overwrite-mode only should be enabled in batch jobs, as it is very hard to make any claims about the results in a streaming job.

Also, with the current implementation it is also problematic, as we could have multiple data files for a given partition, and then, they will replace each other, and only that last one wins 😢

Also, if the checkpoints are turned on, then we will have a same issue as mentioned above, just with a bit bigger amount of data. The 2nd checkpoint might delete data from the 1st checkpoint, because it is replacing the same partition.

So this means that replace partitions is only working if the checkpointing is turned off (or you are lucky 😄)

So essentially, it doesn't matter which solution we choose 😄

That's probably a topic for another day, as both the IcebergSink and older FlinkSink have this issue. The implementation in this PR is consistent though with how the IcebergSink works, see

iceberg/flink/v2.1/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergCommitter.java

Line 224 in e67c082

for (WriteResult result : pendingResults.values()) {

. Either way, as you said, this feature is questionable to phrase it mildly.

As discussed offline. We can keep your proposed solution here, and we could file another PR to throw an exception/or log an error if checkpointing is on and overwrite flag is used

pvary · 2025-09-26T09:55:25Z

More generally, I think DynamicCommitter should not be responsible for aggregating WriteResults from a single checkpoint. There is a dedicated class for this - DynamicWriteResultAggregator, which decides how commit requests (DynamicCommittables) are created.

@aiborodin: This seems like a reasonable optimization for me. OTOH the correctness issue could be solved with a lightweight change (don't commit every WriteResult one-by-one, but first aggregate on the Committer side). This could go in to 1.10.1, and then the optimization you have suggested could go in to 1.11.0.

WDYT?

aiborodin · 2025-09-29T01:00:03Z

This could go in to 1.10.1, and then the optimization you have suggested could go in to 1.11.0.

@pvary I am okay with that. I also considered aggregating WriteResults in the DynamicCommitter, which is what this PR ultimately achieves — but decided in favour of the approach in #14092, which makes more sense architecturally. I would appreciate it if we could continue reviewing #14092 to get it in as well.

As discussed offline. ...

@pvary @mxm It seems there is some communication (and decisions being made) outside of GitHub. I want to be part of those discussions and contribute to them, especially given that apache/iceberg is a community-owned project. Is there some channel where this communication happens? Could I please be added to it? Thanks.

pvary · 2025-09-29T12:30:04Z

Merged to main.
Thanks for identifying the issue @aiborodin and @mxm for the quick fix, while we work on a better solution.
Thanks @shangxinli for the review!

pvary · 2025-09-29T13:36:39Z

@pvary @mxm It seems there is some communication (and decisions being made) outside of GitHub. I want to be part of those discussions and contribute to them, especially given that apache/iceberg is a community-owned project. Is there some channel where this communication happens? Could I please be added to it? Thanks.

We don't decide things offline. We always try to bring back the result of the discussions to dev list or to the github PR so others can raise their voice too.
If you think we decided on something you don't agree with it, feel free to raise the question again.

We do the "offline" discussions on Slack. I will try to reach out to you there too.

pvary · 2025-09-29T13:38:08Z

@aiborodin: Do you have a slack user? Are you part of the Iceberg Slack community?

…the presence of failures

…esence of failures (#14213)

apache#14182)

…the presence of failures (apache#14213)

aiborodin · 2025-10-02T09:46:17Z

Are you part of the Iceberg Slack community?

@pvary I am not, how can I join?

pvary · 2025-10-02T20:44:56Z

Are you part of the Iceberg Slack community?

@pvary I am not, how can I join?

https://iceberg.apache.org/community/?h=slack#slack

apache#14182)

…the presence of failures (apache#14213)

apache#14182) (cherry picked from commit 3860284)

…the presence of failures (apache#14213) (cherry picked from commit 441597e)

…e presence of failures (#14461) * Flink: Ensure DynamicCommitter Idempotence in the presence of failures (#14182) (cherry picked from commit 3860284) * Flink: Backport #14182: Ensure DynamicCommitter Idempotence in the presence of failures (#14213) (cherry picked from commit 441597e) --------- Co-authored-by: Maximilian Michels <[email protected]>

github-actions bot added the flink label Sep 25, 2025

mxm mentioned this pull request Sep 25, 2025

Fix commit idempotence of DynamicIcebergSink #14092

Closed

shangxinli reviewed Sep 25, 2025

View reviewed changes

flink/v2.1/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicCommitter.java Outdated Show resolved Hide resolved

shangxinli reviewed Sep 25, 2025

View reviewed changes

flink/v2.1/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicCommitter.java Outdated Show resolved Hide resolved

Combine as many WriteResults into one snapshot as possible

8193e1b

Simplify the implementation by removing the premature optimization to…

10038d2

… combine append-only WriteResults from multiple checkpoints

Revert unnecessary change

e67c082

mxm force-pushed the idempotent-committer branch from 2cc812c to e67c082 Compare September 26, 2025 08:51

pvary reviewed Sep 26, 2025

View reviewed changes

shangxinli approved these changes Sep 27, 2025

View reviewed changes

pvary approved these changes Sep 29, 2025

View reviewed changes

pvary merged commit 3860284 into apache:main Sep 29, 2025
18 checks passed

mxm added a commit to mxm/iceberg that referenced this pull request Sep 29, 2025

Flink: Backport apache#14182: Ensure DynamicCommitter Idempotence in …

424b7e4

…the presence of failures

mxm mentioned this pull request Sep 29, 2025

Flink: Backport #14182: Ensure DynamicCommitter idempotence in the presence of failures #14213

Merged

huaxingao pushed a commit that referenced this pull request Sep 29, 2025

Flink: Backport #14182: Ensure DynamicCommitter Idempotence in the pr…

441597e

…esence of failures (#14213)

pvary added this to the Iceberg 1.10.1 milestone Sep 30, 2025

gabeiglio pushed a commit to gabeiglio/iceberg that referenced this pull request Oct 1, 2025

Flink: Ensure DynamicCommitter Idempotence in the presence of failures (

18b4bf0

apache#14182)

gabeiglio pushed a commit to gabeiglio/iceberg that referenced this pull request Oct 1, 2025

Flink: Backport apache#14182: Ensure DynamicCommitter Idempotence in …

5d5fcea

…the presence of failures (apache#14213)

aiborodin mentioned this pull request Oct 13, 2025

Flink: Refactor WriteResult aggregation in DynamicIcebergSink #14312

Closed

adawrapub pushed a commit to adawrapub/iceberg that referenced this pull request Oct 16, 2025

Flink: Ensure DynamicCommitter Idempotence in the presence of failures (

21d5edd

apache#14182)

adawrapub pushed a commit to adawrapub/iceberg that referenced this pull request Oct 16, 2025

Flink: Backport apache#14182: Ensure DynamicCommitter Idempotence in …

21944c6

…the presence of failures (apache#14213)

huaxingao pushed a commit to huaxingao/iceberg that referenced this pull request Nov 1, 2025

Flink: Ensure DynamicCommitter Idempotence in the presence of failures (

0144215

apache#14182) (cherry picked from commit 3860284)

huaxingao pushed a commit to huaxingao/iceberg that referenced this pull request Nov 1, 2025

Flink: Backport apache#14182: Ensure DynamicCommitter Idempotence in …

158bdb5

…the presence of failures (apache#14213) (cherry picked from commit 441597e)

mxm mentioned this pull request Nov 12, 2025

Flink: add append capability to dynamic iceberg sink (#14526) #14559

Merged

Flink: Ensure DynamicCommitter idempotence in the presence of failures #14182

Flink: Ensure DynamicCommitter idempotence in the presence of failures #14182

Uh oh!

Conversation

mxm commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mxm commented Sep 25, 2025

Uh oh!

aiborodin commented Sep 26, 2025

Uh oh!

mxm commented Sep 26, 2025

Uh oh!

mxm commented Sep 26, 2025

Uh oh!

pvary Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

pvary Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

mxm Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

pvary Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

mxm Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

pvary Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

pvary commented Sep 26, 2025

Uh oh!

aiborodin commented Sep 29, 2025

Uh oh!

Uh oh!

pvary commented Sep 29, 2025

Uh oh!

pvary commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pvary commented Sep 29, 2025

Uh oh!

aiborodin commented Oct 2, 2025

Uh oh!

pvary commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mxm commented Sep 25, 2025 •

edited

Loading

pvary commented Sep 29, 2025 •

edited

Loading