Skip to content

Conversation

@anishshri-db
Copy link
Contributor

@anishshri-db anishshri-db commented Jan 14, 2024

What changes were proposed in this pull request?

Fix RocksDB state provider race condition during rollback

Why are the changes needed?

The rollback() method in RocksDB is not properly synchronized, thus a race condition can be introduced during rollback when there are tasks trying to commit.

The symptom of the race condition is the following exception being thrown:

`Caused by: java.io.FileNotFoundException: No such file or directory: ...state/0/54/10369.changelog
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:4069)
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3907)
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:3801)
	at com.databricks.common.filesystem.LokiS3FS.getFileStatusNoCache(LokiS3FS.scala:91)
	at com.databricks.common.filesystem.LokiS3FS.getFileStatus(LokiS3FS.scala:86)
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:1525)`

This race condition can happen for the following sequence of events

  1. task A gets cancelled after releasing lock for rocksdb
  2. task B starts and loads 10368
  3. task A performs rocksdb rollback to -1
  4. task B reads data from rocksdb and commits

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests

Was this patch authored or co-authored using generative AI tooling?

No

@anishshri-db anishshri-db changed the title [SPARK-46711] Fix RocksDB state provider race condition during rollback [SPARK-46711][SS] Fix RocksDB state provider race condition during rollback Jan 14, 2024
@anishshri-db
Copy link
Contributor Author

cc - @HeartSaVioR - PTAL. thx !

@anishshri-db
Copy link
Contributor Author

Linter job failure seems unrelated. Seems to be failing for other PRs too

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@HeartSaVioR
Copy link
Contributor

Linter failure is not related to this change. Thanks! Merging to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants