Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Oct 18, 2022

What changes were proposed in this pull request?

This adds additional checkpoint rename file check.

Why are the changes needed?

We encountered an issue recently that one customer's structured streaming job failed to read delta file.

The temporary file exists but it was not successfully renamed to final delta file path.

We currently don't check if renamed file exists but assume it successful. As the result, failing to read delta file assumed to be committed in last batch makes re-triggering the job impossible.

We should be able to do a check against checkpoint renamed file to prevent such difficulty in advance.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests.

@viirya
Copy link
Member Author

viirya commented Oct 18, 2022

cc @dongjoon-hyun

Copy link
Contributor

@itholic itholic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good otherwise

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

Are the failures related?

[info] *** 12 TESTS FAILED ***
[error] Failed: Total 6746, Failed 12, Errors 0, Passed 6734, Ignored 5
[error] Failed tests:
[error] 	org.apache.spark.sql.catalyst.expressions.CastWithAnsiOffSuite
[error] 	org.apache.spark.sql.catalyst.util.TimestampFormatterSuite
[error] 	org.apache.spark.sql.catalyst.expressions.CastWithAnsiOnSuite
[error] 	org.apache.spark.sql.catalyst.util.RebaseDateTimeSuite
[error] 	org.apache.spark.sql.catalyst.expressions.TryCastSuite

@dongjoon-hyun
Copy link
Member

Since the first commit passed the UT, I guess it's irrelevant.

@viirya
Copy link
Member Author

viirya commented Oct 19, 2022

It is not. I only add version in the second commit.

@viirya
Copy link
Member Author

viirya commented Oct 19, 2022

Verified locally these tests are passed.

@viirya
Copy link
Member Author

viirya commented Oct 19, 2022

Synced up with master and try again.

@dongjoon-hyun
Copy link
Member

Ya, master branch seems to be broken with the same irrelevant failures, too.

@dongjoon-hyun
Copy link
Member

Merged to master for Apache Spark 3.4.
Thank you, @viirya , @itholic , @HyukjinKwon .

@viirya
Copy link
Member Author

viirya commented Oct 19, 2022

Thanks @dongjoon-hyun @HyukjinKwon @itholic

@HeartSaVioR
Copy link
Contributor

(Just curious about how you came through the case. Was the issue from underlying storage, or inconsistency of file system API? I can't imagine the scenario if the API works as contract and the underlying file system is stable, so...)

SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
### What changes were proposed in this pull request?

This adds additional checkpoint rename file check.

### Why are the changes needed?

We encountered an issue recently that one customer's structured streaming job failed to read delta file.

The temporary file exists but it was not successfully renamed to final delta file path.

We currently don't check if renamed file exists but assume it successful. As the result, failing to read delta file assumed to be committed in last batch makes re-triggering the job impossible.

We should be able to do a check against checkpoint renamed file to prevent such difficulty in advance.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.

Closes apache#38291 from viirya/add_file_check.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants