Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partial-progress.max-failed-commits Incorrectly compare the failureCommit value #12076

Open
1 of 3 tasks
ruotianwang opened this issue Jan 23, 2025 · 9 comments · May be fixed by #12120
Open
1 of 3 tasks

partial-progress.max-failed-commits Incorrectly compare the failureCommit value #12076

ruotianwang opened this issue Jan 23, 2025 · 9 comments · May be fixed by #12120
Assignees
Labels
bug Something isn't working

Comments

@ruotianwang
Copy link

Apache Iceberg version

1.7.1 (latest release)

Query engine

Spark

Please describe the bug 🐞

During the usage of partial-progress.max-failed-commits, we've found that the threshold check's false positive rate is too high. After taking a deep look, within this PR: #9611

It first get the succeededCommits whenever there is a succeed commit, then calculating int failedCommits = maxCommits - commitService.succeededCommits();

However, I've found a couple of cases that even though we defined the partial-progress.max-commits value, internally iceberg would optimize the group file into a lower number of this max-commits. eg: the actual group file can be smaller than maxCommits definition. In this case, the threshold check above will be wrong.

The suggested solution would be instead of calculating succeed commit, we should directly collecting failure commit count and do comparison.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@ruotianwang ruotianwang added the bug Something isn't working label Jan 23, 2025
@ruotianwang
Copy link
Author

ruotianwang commented Jan 23, 2025

cc: @manuzhang the commit owner for visibility

@manuzhang manuzhang self-assigned this Jan 24, 2025
@manuzhang
Copy link
Collaborator

manuzhang commented Jan 24, 2025

Thanks for reporting this issue, I will submit a fix shortly.

@RussellSpitzer
Copy link
Member

I'm not sure I understand this, can you elaborate?

However, I've found a couple of cases that even though we defined the partial-progress.max-commits value, internally iceberg would optimize the group file into a lower number of this max-commits. eg: the actual group file can be smaller than maxCommits definition. In this case, the threshold check above will be wrong.

@ruotianwang
Copy link
Author

@manuzhang Thanks a lot!

@ruotianwang
Copy link
Author

@RussellSpitzer Basically this line: https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java#L380

The maxCommits is not the actual total commit count but from the configuration, which will make failedCommits higher than expected. We could either get the actual total commit count or directly collect failureCommit count within error handling block: https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/actions/BaseCommitService.java#L232-L234

@RussellSpitzer
Copy link
Member

Got it! That makes sense

@manuzhang
Copy link
Collaborator

manuzhang commented Jan 26, 2025

@ruotianwang Are you referring to the case where there are fewer commits than maxCommits due to file group rewrite failure?

int groupsPerCommit = IntMath.divide(ctx.totalGroupCount(), maxCommits, RoundingMode.CEILING);

Given the above algorithm, I can't think of an example that the number of commits can be smaller than maxCommits without any file group failure.

@ruotianwang
Copy link
Author

ruotianwang commented Jan 27, 2025

@manuzhang This maxCommits is directly getting from the configuration: https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java#L458-L460

About here: https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java#L354-L375

I think the edge case is when ctx.totalGroupCount() < maxCommits

Here is what I saw within out application log

25/01/23 20:24:18 INFO RewriteDataFilesSparkAction: Rewrite Files Ready to be Committed - Rewriting 17929 files (BIN-PACK, file group 1/1, PartitionData{*********} (1/1)) in ******** 
25/01/23 20:24:18 INFO BaseMetastoreTableOperations: Refreshing table metadata from new version: s3://*******.metadata.json

The error I saw

py4j.protocol.Py4JJavaError: An error occurred while calling o325.sql. : java.lang.RuntimeException: partial-progress.enabled is true but 9 rewrite commits failed. This is more than the maximum allowed failures of 3. Check the logs to determine why the individual commits failed. If this is persistent it may help to increase partial-progress.max-commits which will split the rewrite operation into smaller commits.

And another table's job

25/01/23 03:49:52 INFO RewriteDataFilesSparkAction: Rewrite Files Ready to be Committed - Rewriting 1584 files (BIN-PACK, file group 2/3, PartitionData{********} (2/2)) in ********* 
25/01/23 03:49:52 INFO BaseMetastoreTableOperations: Refreshing table metadata from new version: *************.metadata.json 
25/01/23 03:49:57 INFO BaseMetastoreTableOperations: Successfully committed to table ********** in 957 ms 
25/01/23 03:49:57 INFO SnapshotProducer: Committed snapshot 1388669599076500919 (BaseRewriteFiles) 
25/01/23 03:49:57 INFO BaseMetastoreTableOperations: Refreshing table metadata from new version: s3://*******.metadata.json

The error I saw

py4j.protocol.Py4JJavaError: An error occurred while calling o325.sql. : java.lang.RuntimeException: partial-progress.enabled is true but 7 rewrite commits failed. This is more than the maximum allowed failures of 3. Check the logs to determine why the individual commits failed. If this is persistent it may help to increase partial-progress.max-commits which will split the rewrite operation into smaller commits.

You can see the total actual rewrite file group is 1 and 3 even though the partial max commit is configured as 10.

@manuzhang
Copy link
Collaborator

I've submitted #12120 to fix. @ruotianwang @RussellSpitzer please help review, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants