-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server #8079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
103f3ef to
ed1183d
Compare
…e timeline server
ed1183d to
8a753b0
Compare
| * {@link AbstractTableFileSystemView#reset} directly, which may cause stale file system view | ||
| * to be served. | ||
| */ | ||
| protected void runReset() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess the method should be guarded by reset writeLock ? #reset should invoke #runReset instead I guess.
And can we give these methods more straight-forward names?
Like sync -> syncThreadSafely, runSync -> sync, reset->resetThreadSafely, runReset -> reset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is guarded Danny.
As per master.
Non metadata :
sync() {
lock
runSync {
clean and init timeline.
}
release lock
}
w/ Metadata file system view
sync() in MFSV {
super.sync {
lock
runSync {
clean and init timeline.
}
release lock
}
tableMetadata.reset();
}
With this patch, here is how it is changing:
w/o metadata.
sync() {
lock
runSync {
clean and init timeline.
}
release lock
}
no changes
w/ MFSV
sync() in ATFSV (not overridden in MFSV) {
lock
runSync { // overridden in MFSV {
super.runSync {
clean and init timeline.
}
tableMetadata.reset();
}
release lock
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but agree w/ naming. we can call runSyncThreadSafely or syncThreadSafely instead of runSync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, reset() should call runReset(). Not sure why the change is reverted somehow.
I don't think we should change the naming of sync and reset in the interface SyncableFileSystemView as they can also have a different implementation. I've added documentation to make sure how the custom logic should be overridden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t think the doc makes any sense, let’s give them better name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed runSync and runReset methods to avoid confusion and make every implementation explicitly use write lock except remote FSV. If new file system view needs to be added, the author should look at existing implementation for reference. Renaming won't prevent the author doing the wrong thing.
| } | ||
| try { | ||
| // This read lock makes sure that if the local view of the table is being synced, | ||
| // no timeline server requests should be processed or handled until the sync process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think we need a explicit lock obj again, the whole code block is guarded by a synchronized object lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we call this method twice. only one caller is within synchronized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The synchronized won't be executed if the timeline is already refreshed and this method returns false to skip the synchronized block, but in such a case the metadata file system view might not have been fully updated, causing the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danny0405 This is simplified now. You can also check my updated PR description for how the race condition can happen.
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java
Outdated
Show resolved
Hide resolved
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java
Outdated
Show resolved
Hide resolved
| * {@link AbstractTableFileSystemView#reset} directly, which may cause stale file system view | ||
| * to be served. | ||
| */ | ||
| protected void runReset() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is guarded Danny.
As per master.
Non metadata :
sync() {
lock
runSync {
clean and init timeline.
}
release lock
}
w/ Metadata file system view
sync() in MFSV {
super.sync {
lock
runSync {
clean and init timeline.
}
release lock
}
tableMetadata.reset();
}
With this patch, here is how it is changing:
w/o metadata.
sync() {
lock
runSync {
clean and init timeline.
}
release lock
}
no changes
w/ MFSV
sync() in ATFSV (not overridden in MFSV) {
lock
runSync { // overridden in MFSV {
super.runSync {
clean and init timeline.
}
tableMetadata.reset();
}
release lock
}
}
| * {@link AbstractTableFileSystemView#reset} directly, which may cause stale file system view | ||
| * to be served. | ||
| */ | ||
| protected void runReset() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but agree w/ naming. we can call runSyncThreadSafely or syncThreadSafely instead of runSync.
| * Table metadata provided by an internal DFS backed Hudi metadata table, | ||
| * with an intentional delay in `reset()` to test concurrent reads and writes. | ||
| */ | ||
| public class HoodieBackedTestDelayedTableMetadata extends HoodieBackedTableMetadata { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this really have to be in a separate file of its own. can we not embed within another test class where its used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally a good idea to have a separate class for a notable test logic.
| } | ||
| try { | ||
| // This read lock makes sure that if the local view of the table is being synced, | ||
| // no timeline server requests should be processed or handled until the sync process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we call this method twice. only one caller is within synchronized
|
Is this a regression? what version? Can I look at the offending commit to understand how it was before. |
danny0405
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be conservative about the fs view change, sill kind of frightened for the fs view change in release 0.11.x that causes the data loss.
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
me and @bvaradar took a pass at it. Wondering if we can simplify all this more.
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java
Outdated
Show resolved
Hide resolved
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java
Outdated
Show resolved
Hide resolved
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataFileSystemView.java
Show resolved
Hide resolved
I updated the PR description to provide more detailed information. |
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java
Outdated
Show resolved
Hide resolved
…e timeline server (apache#8079) We observe that for MOR table, occasionally (<10% for large tables with frequent updates and compactions), the deltacommit after the compaction commit may add a new log file to the old file slice, not the latest file slice, in the corresponding file group. This happens when both the metadata table and timeline server are enabled. If either is disabled, the problem does not show up. This means that any data written to the log file is lost, i.e., data loss issue. Deeper analysis of the code surfaces that the file system view at the timeline server may serve the stale view, causing the issue. This is because the sync of `HoodieMetadataFileSystemView` is not atomic when the metadata table is enabled. This commit makes the following fixes: - Makes sure all logic in `sync()` and `reset()` are guarded by the write lock. This is done by implementing the logic with the write lock in each file system view, instead of overriding indect method (e.g., `runSync()`). - Simplifies the implementation of `syncIfLocalViewBehind()` to be more readable.
…e timeline server (apache#8079) We observe that for MOR table, occasionally (<10% for large tables with frequent updates and compactions), the deltacommit after the compaction commit may add a new log file to the old file slice, not the latest file slice, in the corresponding file group. This happens when both the metadata table and timeline server are enabled. If either is disabled, the problem does not show up. This means that any data written to the log file is lost, i.e., data loss issue. Deeper analysis of the code surfaces that the file system view at the timeline server may serve the stale view, causing the issue. This is because the sync of `HoodieMetadataFileSystemView` is not atomic when the metadata table is enabled. This commit makes the following fixes: - Makes sure all logic in `sync()` and `reset()` are guarded by the write lock. This is done by implementing the logic with the write lock in each file system view, instead of overriding indect method (e.g., `runSync()`). - Simplifies the implementation of `syncIfLocalViewBehind()` to be more readable.
…e timeline server (apache#8079) We observe that for MOR table, occasionally (<10% for large tables with frequent updates and compactions), the deltacommit after the compaction commit may add a new log file to the old file slice, not the latest file slice, in the corresponding file group. This happens when both the metadata table and timeline server are enabled. If either is disabled, the problem does not show up. This means that any data written to the log file is lost, i.e., data loss issue. Deeper analysis of the code surfaces that the file system view at the timeline server may serve the stale view, causing the issue. This is because the sync of `HoodieMetadataFileSystemView` is not atomic when the metadata table is enabled. This commit makes the following fixes: - Makes sure all logic in `sync()` and `reset()` are guarded by the write lock. This is done by implementing the logic with the write lock in each file system view, instead of overriding indect method (e.g., `runSync()`). - Simplifies the implementation of `syncIfLocalViewBehind()` to be more readable.
…e timeline server (apache#8079) We observe that for MOR table, occasionally (<10% for large tables with frequent updates and compactions), the deltacommit after the compaction commit may add a new log file to the old file slice, not the latest file slice, in the corresponding file group. This happens when both the metadata table and timeline server are enabled. If either is disabled, the problem does not show up. This means that any data written to the log file is lost, i.e., data loss issue. Deeper analysis of the code surfaces that the file system view at the timeline server may serve the stale view, causing the issue. This is because the sync of `HoodieMetadataFileSystemView` is not atomic when the metadata table is enabled. This commit makes the following fixes: - Makes sure all logic in `sync()` and `reset()` are guarded by the write lock. This is done by implementing the logic with the write lock in each file system view, instead of overriding indect method (e.g., `runSync()`). - Simplifies the implementation of `syncIfLocalViewBehind()` to be more readable.
…e timeline server (apache#8079) We observe that for MOR table, occasionally (<10% for large tables with frequent updates and compactions), the deltacommit after the compaction commit may add a new log file to the old file slice, not the latest file slice, in the corresponding file group. This happens when both the metadata table and timeline server are enabled. If either is disabled, the problem does not show up. This means that any data written to the log file is lost, i.e., data loss issue. Deeper analysis of the code surfaces that the file system view at the timeline server may serve the stale view, causing the issue. This is because the sync of `HoodieMetadataFileSystemView` is not atomic when the metadata table is enabled. This commit makes the following fixes: - Makes sure all logic in `sync()` and `reset()` are guarded by the write lock. This is done by implementing the logic with the write lock in each file system view, instead of overriding indect method (e.g., `runSync()`). - Simplifies the implementation of `syncIfLocalViewBehind()` to be more readable. (cherry picked from commit 2ddcf96)
…e timeline server (apache#8079) We observe that for MOR table, occasionally (<10% for large tables with frequent updates and compactions), the deltacommit after the compaction commit may add a new log file to the old file slice, not the latest file slice, in the corresponding file group. This happens when both the metadata table and timeline server are enabled. If either is disabled, the problem does not show up. This means that any data written to the log file is lost, i.e., data loss issue. Deeper analysis of the code surfaces that the file system view at the timeline server may serve the stale view, causing the issue. This is because the sync of `HoodieMetadataFileSystemView` is not atomic when the metadata table is enabled. This commit makes the following fixes: - Makes sure all logic in `sync()` and `reset()` are guarded by the write lock. This is done by implementing the logic with the write lock in each file system view, instead of overriding indect method (e.g., `runSync()`). - Simplifies the implementation of `syncIfLocalViewBehind()` to be more readable.
|
👍 |
…imeline server (apache#224) This PR cherry-picks the critical fix apache#8079. More details can be found there. We observe that for MOR table, occasionally (<10% for large tables with frequent updates and compactions), the deltacommit after the compaction commit may add a new log file to the old file slice, not the latest file slice, in the corresponding file group. This happens when both the metadata table and timeline server are enabled. If either is disabled, the problem does not show up. This means that any data written to the log file is lost, i.e., data loss issue. Deeper analysis of the code surfaces that the file system view at the timeline server may serve the stale view, causing the issue. This is because the sync of `HoodieMetadataFileSystemView` is not atomic when the metadata table is enabled. This commit makes the following fixes: - Makes sure all logic in `sync()` and `reset()` are guarded by the write lock. This is done by implementing the logic with the write lock in each file system view, instead of overriding indect method (e.g., `runSync()`). - Simplifies the implementation of `syncIfLocalViewBehind()` to be more readable.
Change Logs
We observe that for MOR table, occasionally (<10% for large tables with frequent updates and compactions, <1% of updates affected), the deltacommit after the compaction commit may add a new log file to the old file slice, not the latest file slice, in the corresponding file group. This happens when both the metadata table and timeline server are enabled. If either is disabled, the problem does not show up. This means that any data written to the log file is lost, i.e., data loss issue.
Deeper analysis of the code surfaces that the file system view at the timeline server may serve the stale view, causing the issue. This is because the sync of
HoodieMetadataFileSystemViewis not atomic when the metadata table is enabled.Here's a concrete walkthrough of how this problem can happen. When the timeline server is long-running and passed to the write client for reusing (like Deltastreamer), the timeline server internally needs to refresh and sync the outdated file system view if it is behind what the client (which sends the request to the timeline server) sees. This is based on whether the latest instant seen by the client matches what timeline server's file system view has. The sync logic is implemented in
RequestHandler:Here's the implementation of
sync()of the file system view based on the metadata table (HoodieMetadataFileSystemView):The
super.sync()calls thesync()inAbstractTableFileSystemView:Note that, the file system view logic should be guarded by the read-write lock, i.e., any read of the cached file system info should be blocked before updates to the cache in the file system are fully done, which is guarded by the write lock. However, we can see that for
HoodieMetadataFileSystemView, the write lock covers the sync logic partially, andtableMetadata.reset()is not guarded by the write lock, meaning that it can create race condition that the metadata table content can be stale when being read.HoodieBackedTableMetadata,tableMetadata.reset()has logic to execute.FileSystemBackedTableMetadata,tableMetadata.reset()is a no-op, so there is no issue.Let's now look at concurrent requests are handled in
RequestHandlerto cause the stale view to be read. Consider two concurrent requests to the timeline server, REQ1 and REQ2:syncIfLocalViewBehind()returnstrue. Then enters synchronized block from L6 to sync the file system view. In the middle, the timeline has already been updated, yet the metadata and file system view has not been refreshed.syncIfLocalViewBehind()returnsfalse, skipping the synchronized block and continue for getting the information from the file system view. As mentioned above, given that the sync logic is not fully guarded by the write lock, the read lock thinks that the info is fully updated, thus allowing the file system view to return the info, which is stale.To easily reproduce this race condition problem consistently, a new test
TestRemoteFileSystemViewWithMetadataTableis added withHoodieBackedTestDelayedTableMetadatato delay thereset()process. By applying the same change to theHoodieBackedTableMetadata, we can also reproduce the same problem with the Deltastreamer.The fix is the following:
sync()andreset()are guarded by the write lock. This is done by implementing the logic with the write lock in each file system view, instead of overriding indect method (e.g.,runSync()).syncIfLocalViewBehind()to be more readable.History and Impact of the problem:
The problematic implementation of
sync()andreset()inHoodieMetadataFileSystemViewclass is introduced by #4307 to properly sync and reset the file system view based on the metadata table. This did not have an impact at that moment, as Hudi refreshes/synces the file system view explicitly while initializing theHoodieTableinstance for each write, masking the issue. Only when the file system view based on the metadata table is synced because of new requests at the timeline server, there's a chance of theHoodieMetadataFileSystemViewserving a stale view, which is not hit when #4307 is landed. Subsequently, two more optimization PRs, #5617 and #5716, are landed to remove "unnecessary" refreshes/syncs of the file system view when initializing theHoodieTableinstance. It turns out that without this PR properly fixing thesync()andreset()inHoodieMetadataFileSystemView, such refreshes and syncs are necessary to avoid potential stale and inconsistent views. As the probability of hitting this issue is very low, it is not uncovered during manual testing or CI runs.After running the same tests added in this PR, it is found that the following Hudi releases are affected:
0.11.1,0.12.0,0.12.1,0.12.2,0.13.0. This also aligns with the findings around the offending PRs mentioned above. Although the tests pass for0.13.0, because of another bug #8080 introduced in0.13.0release (timeline server instance passed to the write client is ignored due to this bug), there's still an impact on0.13.0. #8134 reproduces the issue for0.13.0.Impact
This PR fixes the data loss issue and makes sure that the timeline server with the metadata table enabled (using
HoodieMetadataFileSystemView) always serves the correct information.This has been tested with deltastreamer writing MOR table with inline compaction in continuous mode, with intentional delay in
HoodieBackedTableMetadatajust likeHoodieBackedTestDelayedTableMetadatato ensure the problem happens consistently before the fix. After the fix, the problem goes away.Risk level
medium
Documentation Update
We need to update the release notes regarding this severe regression that can cause data loss: HUDI-5864.
Contributor's checklist