[HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server #8079

yihua · 2023-03-01T01:46:43Z

Change Logs

We observe that for MOR table, occasionally (<10% for large tables with frequent updates and compactions, <1% of updates affected), the deltacommit after the compaction commit may add a new log file to the old file slice, not the latest file slice, in the corresponding file group. This happens when both the metadata table and timeline server are enabled. If either is disabled, the problem does not show up. This means that any data written to the log file is lost, i.e., data loss issue.

Deeper analysis of the code surfaces that the file system view at the timeline server may serve the stale view, causing the issue. This is because the sync of HoodieMetadataFileSystemView is not atomic when the metadata table is enabled.

Here's a concrete walkthrough of how this problem can happen. When the timeline server is long-running and passed to the write client for reusing (like Deltastreamer), the timeline server internally needs to refresh and sync the outdated file system view if it is behind what the client (which sends the request to the timeline server) sees. This is based on whether the latest instant seen by the client matches what timeline server's file system view has. The sync logic is implemented in RequestHandler:

[L1]  private boolean syncIfLocalViewBehind(Context ctx) {
[L2]    if (isLocalViewBehind(ctx)) {
[L3]      String basePath = ctx.queryParam(RemoteHoodieTableFileSystemView.BASEPATH_PARAM);
[L4]      String lastKnownInstantFromClient = ctx.queryParamAsClass(RemoteHoodieTableFileSystemView.LAST_INSTANT_TS, String.class).getOrDefault(HoodieTimeline.INVALID_INSTANT_TS);
[L5]      SyncableFileSystemView view = viewManager.getFileSystemView(basePath);
[L6]      synchronized (view) {
[L7]        if (isLocalViewBehind(ctx)) {
[L7]          HoodieTimeline localTimeline = viewManager.getFileSystemView(basePath).getTimeline();
[L8]          LOG.info("Syncing view as client passed last known instant " + lastKnownInstantFromClient
[L9]              + " as last known instant but server has the following last instant on timeline :"
[L10]              + localTimeline.lastInstant());
[L11]          view.sync();
[L12]          return true;
[L13]        }
[L14]      }
[L15]    }
[L16]    return false;
[L17]  }

  private boolean isLocalViewBehind(Context ctx) {
    String basePath = ctx.queryParam(RemoteHoodieTableFileSystemView.BASEPATH_PARAM);
    String lastKnownInstantFromClient =
        ctx.queryParamAsClass(RemoteHoodieTableFileSystemView.LAST_INSTANT_TS, String.class).getOrDefault(HoodieTimeline.INVALID_INSTANT_TS);
    String timelineHashFromClient = ctx.queryParamAsClass(RemoteHoodieTableFileSystemView.TIMELINE_HASH, String.class).getOrDefault("");
    HoodieTimeline localTimeline =
        viewManager.getFileSystemView(basePath).getTimeline().filterCompletedOrMajorOrMinorCompactionInstants();
    if (LOG.isDebugEnabled()) {
      LOG.debug("Client [ LastTs=" + lastKnownInstantFromClient + ", TimelineHash=" + timelineHashFromClient
          + "], localTimeline=" + localTimeline.getInstants());
    }

    if ((!localTimeline.getInstantsAsStream().findAny().isPresent())
        && HoodieTimeline.INVALID_INSTANT_TS.equals(lastKnownInstantFromClient)) {
      return false;
    }

    String localTimelineHash = localTimeline.getTimelineHash();
    // refresh if timeline hash mismatches
    if (!localTimelineHash.equals(timelineHashFromClient)) {
      return true;
    }

    // As a safety check, even if hash is same, ensure instant is present
    return !localTimeline.containsOrBeforeTimelineStarts(lastKnownInstantFromClient);
  }

Here's the implementation of sync() of the file system view based on the metadata table (HoodieMetadataFileSystemView):

  @Override
  public void sync() {
    super.sync();
    tableMetadata.reset();
  }

The super.sync() calls the sync() in AbstractTableFileSystemView:

  @Override
  public void sync() {
    HoodieTimeline oldTimeline = getTimeline();
    HoodieTimeline newTimeline = metaClient.reloadActiveTimeline().filterCompletedOrMajorOrMinorCompactionInstants();
    try {
      writeLock.lock();
      runSync(oldTimeline, newTimeline);
    } finally {
      writeLock.unlock();
    }
  }

Note that, the file system view logic should be guarded by the read-write lock, i.e., any read of the cached file system info should be blocked before updates to the cache in the file system are fully done, which is guarded by the write lock. However, we can see that for HoodieMetadataFileSystemView, the write lock covers the sync logic partially, and tableMetadata.reset() is not guarded by the write lock, meaning that it can create race condition that the metadata table content can be stale when being read.

When metadata table is enabled using HoodieBackedTableMetadata, tableMetadata.reset() has logic to execute.
When metadata table is disabled, with FileSystemBackedTableMetadata, tableMetadata.reset() is a no-op, so there is no issue.

Let's now look at concurrent requests are handled in RequestHandler to cause the stale view to be read. Consider two concurrent requests to the timeline server, REQ1 and REQ2:

Thread1 for REQ1: L2 with syncIfLocalViewBehind() returns true. Then enters synchronized block from L6 to sync the file system view. In the middle, the timeline has already been updated, yet the metadata and file system view has not been refreshed.
Thread2 for REQ2: This comes after the timeline has been updated because of REQ1 at the timeline server, but metadata and the file system view is still in progress. At this moment, L2 with syncIfLocalViewBehind() returns false, skipping the synchronized block and continue for getting the information from the file system view. As mentioned above, given that the sync logic is not fully guarded by the write lock, the read lock thinks that the info is fully updated, thus allowing the file system view to return the info, which is stale.

To easily reproduce this race condition problem consistently, a new test TestRemoteFileSystemViewWithMetadataTable is added with HoodieBackedTestDelayedTableMetadata to delay the reset() process. By applying the same change to the HoodieBackedTableMetadata, we can also reproduce the same problem with the Deltastreamer.

The fix is the following:

Makes sure all logic in sync() and reset() are guarded by the write lock. This is done by implementing the logic with the write lock in each file system view, instead of overriding indect method (e.g., runSync()).
Simplifies the implementation of syncIfLocalViewBehind() to be more readable.

History and Impact of the problem:

The problematic implementation of sync() and reset() in HoodieMetadataFileSystemView class is introduced by #4307 to properly sync and reset the file system view based on the metadata table. This did not have an impact at that moment, as Hudi refreshes/synces the file system view explicitly while initializing the HoodieTable instance for each write, masking the issue. Only when the file system view based on the metadata table is synced because of new requests at the timeline server, there's a chance of the HoodieMetadataFileSystemView serving a stale view, which is not hit when #4307 is landed. Subsequently, two more optimization PRs, #5617 and #5716, are landed to remove "unnecessary" refreshes/syncs of the file system view when initializing the HoodieTable instance. It turns out that without this PR properly fixing the sync() and reset() in HoodieMetadataFileSystemView, such refreshes and syncs are necessary to avoid potential stale and inconsistent views. As the probability of hitting this issue is very low, it is not uncovered during manual testing or CI runs.

After running the same tests added in this PR, it is found that the following Hudi releases are affected: 0.11.1, 0.12.0, 0.12.1, 0.12.2, 0.13.0. This also aligns with the findings around the offending PRs mentioned above. Although the tests pass for 0.13.0, because of another bug #8080 introduced in 0.13.0 release (timeline server instance passed to the write client is ignored due to this bug), there's still an impact on 0.13.0. #8134 reproduces the issue for 0.13.0.

Impact

This PR fixes the data loss issue and makes sure that the timeline server with the metadata table enabled (using HoodieMetadataFileSystemView) always serves the correct information.

This has been tested with deltastreamer writing MOR table with inline compaction in continuous mode, with intentional delay in HoodieBackedTableMetadata just like HoodieBackedTestDelayedTableMetadata to ensure the problem happens consistently before the fix. After the fix, the problem goes away.

Risk level

medium

Documentation Update

We need to update the release notes regarding this severe regression that can cause data loss: HUDI-5864.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

…e timeline server

danny0405 · 2023-03-01T09:39:19Z

hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java

+   * {@link AbstractTableFileSystemView#reset} directly, which may cause stale file system view
+   * to be served.
+   */
+  protected void runReset() {


Guess the method should be guarded by reset writeLock ? #reset should invoke #runReset instead I guess.

And can we give these methods more straight-forward names?

Like sync -> syncThreadSafely, runSync -> sync, reset->resetThreadSafely, runReset -> reset

it is guarded Danny.

As per master.

Non metadata :

sync() { lock runSync { clean and init timeline. } release lock }

w/ Metadata file system view

sync() in MFSV { super.sync { lock runSync { clean and init timeline. } release lock } tableMetadata.reset(); }

With this patch, here is how it is changing:
w/o metadata.

sync() { lock runSync { clean and init timeline. } release lock }

no changes

w/ MFSV

sync() in ATFSV (not overridden in MFSV) { lock runSync { // overridden in MFSV { super.runSync { clean and init timeline. } tableMetadata.reset(); } release lock } }

but agree w/ naming. we can call runSyncThreadSafely or syncThreadSafely instead of runSync.

Yes, reset() should call runReset(). Not sure why the change is reverted somehow.

I don't think we should change the naming of sync and reset in the interface SyncableFileSystemView as they can also have a different implementation. I've added documentation to make sure how the custom logic should be overridden.

Don’t think the doc makes any sense, let’s give them better name.

Removed runSync and runReset methods to avoid confusion and make every implementation explicitly use write lock except remote FSV. If new file system view needs to be added, the author should look at existing implementation for reference. Renaming won't prevent the author doing the wrong thing.

danny0405 · 2023-03-01T09:40:46Z

hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java

-    }
+    try {
+      // This read lock makes sure that if the local view of the table is being synced,
+      // no timeline server requests should be processed or handled until the sync process


Don't think we need a explicit lock obj again, the whole code block is guarded by a synchronized object lock.

we call this method twice. only one caller is within synchronized

The synchronized won't be executed if the timeline is already refreshed and this method returns false to skip the synchronized block, but in such a case the metadata file system view might not have been fully updated, causing the issue.

@danny0405 This is simplified now. You can also check my updated PR description for how the race condition can happen.

hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java

nsivabalan · 2023-03-01T16:28:03Z

hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java

+   * {@link AbstractTableFileSystemView#reset} directly, which may cause stale file system view
+   * to be served.
+   */
+  protected void runReset() {


it is guarded Danny.

As per master.

Non metadata :

sync() { lock runSync { clean and init timeline. } release lock }

w/ Metadata file system view

sync() in MFSV { super.sync { lock runSync { clean and init timeline. } release lock } tableMetadata.reset(); }

With this patch, here is how it is changing:
w/o metadata.

sync() { lock runSync { clean and init timeline. } release lock }

no changes

w/ MFSV

sync() in ATFSV (not overridden in MFSV) { lock runSync { // overridden in MFSV { super.runSync { clean and init timeline. } tableMetadata.reset(); } release lock } }

nsivabalan · 2023-03-01T16:28:37Z

hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java

+   * {@link AbstractTableFileSystemView#reset} directly, which may cause stale file system view
+   * to be served.
+   */
+  protected void runReset() {


but agree w/ naming. we can call runSyncThreadSafely or syncThreadSafely instead of runSync.

nsivabalan · 2023-03-01T16:29:03Z

hudi-common/src/test/java/org/apache/hudi/metadata/HoodieBackedTestDelayedTableMetadata.java

+ * Table metadata provided by an internal DFS backed Hudi metadata table,
+ * with an intentional delay in `reset()` to test concurrent reads and writes.
+ */
+public class HoodieBackedTestDelayedTableMetadata extends HoodieBackedTableMetadata {


does this really have to be in a separate file of its own. can we not embed within another test class where its used.

Generally a good idea to have a separate class for a notable test logic.

nsivabalan · 2023-03-01T16:29:53Z

hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java

-    }
+    try {
+      // This read lock makes sure that if the local view of the table is being synced,
+      // no timeline server requests should be processed or handled until the sync process


we call this method twice. only one caller is within synchronized

vinothchandar · 2023-03-02T01:37:58Z

Is this a regression? what version? Can I look at the offending commit to understand how it was before.

danny0405

Let's be conservative about the fs view change, sill kind of frightened for the fs view change in release 0.11.x that causes the data loss.

vinothchandar

me and @bvaradar took a pass at it. Wondering if we can simplify all this more.

hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java

hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataFileSystemView.java

yihua · 2023-03-04T06:40:46Z

Is this a regression? what version? Can I look at the offending commit to understand how it was before.

I updated the PR description to provide more detailed information.

hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java

hudi-bot · 2023-03-04T13:00:08Z

CI report:

103f3ef UNKNOWN
d2b11c5 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

…e timeline server (apache#8079) We observe that for MOR table, occasionally (<10% for large tables with frequent updates and compactions), the deltacommit after the compaction commit may add a new log file to the old file slice, not the latest file slice, in the corresponding file group. This happens when both the metadata table and timeline server are enabled. If either is disabled, the problem does not show up. This means that any data written to the log file is lost, i.e., data loss issue. Deeper analysis of the code surfaces that the file system view at the timeline server may serve the stale view, causing the issue. This is because the sync of `HoodieMetadataFileSystemView` is not atomic when the metadata table is enabled. This commit makes the following fixes: - Makes sure all logic in `sync()` and `reset()` are guarded by the write lock. This is done by implementing the logic with the write lock in each file system view, instead of overriding indect method (e.g., `runSync()`). - Simplifies the implementation of `syncIfLocalViewBehind()` to be more readable.

…e timeline server (apache#8079) We observe that for MOR table, occasionally (<10% for large tables with frequent updates and compactions), the deltacommit after the compaction commit may add a new log file to the old file slice, not the latest file slice, in the corresponding file group. This happens when both the metadata table and timeline server are enabled. If either is disabled, the problem does not show up. This means that any data written to the log file is lost, i.e., data loss issue. Deeper analysis of the code surfaces that the file system view at the timeline server may serve the stale view, causing the issue. This is because the sync of `HoodieMetadataFileSystemView` is not atomic when the metadata table is enabled. This commit makes the following fixes: - Makes sure all logic in `sync()` and `reset()` are guarded by the write lock. This is done by implementing the logic with the write lock in each file system view, instead of overriding indect method (e.g., `runSync()`). - Simplifies the implementation of `syncIfLocalViewBehind()` to be more readable. (cherry picked from commit 2ddcf96)

…e timeline server (apache#8079) We observe that for MOR table, occasionally (<10% for large tables with frequent updates and compactions), the deltacommit after the compaction commit may add a new log file to the old file slice, not the latest file slice, in the corresponding file group. This happens when both the metadata table and timeline server are enabled. If either is disabled, the problem does not show up. This means that any data written to the log file is lost, i.e., data loss issue. Deeper analysis of the code surfaces that the file system view at the timeline server may serve the stale view, causing the issue. This is because the sync of `HoodieMetadataFileSystemView` is not atomic when the metadata table is enabled. This commit makes the following fixes: - Makes sure all logic in `sync()` and `reset()` are guarded by the write lock. This is done by implementing the logic with the write lock in each file system view, instead of overriding indect method (e.g., `runSync()`). - Simplifies the implementation of `syncIfLocalViewBehind()` to be more readable.

Gatsby-Lee · 2024-05-30T19:09:16Z

👍

…imeline server (apache#224) This PR cherry-picks the critical fix apache#8079. More details can be found there. We observe that for MOR table, occasionally (<10% for large tables with frequent updates and compactions), the deltacommit after the compaction commit may add a new log file to the old file slice, not the latest file slice, in the corresponding file group. This happens when both the metadata table and timeline server are enabled. If either is disabled, the problem does not show up. This means that any data written to the log file is lost, i.e., data loss issue. Deeper analysis of the code surfaces that the file system view at the timeline server may serve the stale view, causing the issue. This is because the sync of `HoodieMetadataFileSystemView` is not atomic when the metadata table is enabled. This commit makes the following fixes: - Makes sure all logic in `sync()` and `reset()` are guarded by the write lock. This is done by implementing the logic with the write lock in each file system view, instead of overriding indect method (e.g., `runSync()`). - Simplifies the implementation of `syncIfLocalViewBehind()` to be more readable.

yihua added priority:blocker Production down; release blocker writer-core labels Mar 1, 2023

yihua assigned nsivabalan, xushiyan and danny0405 Mar 1, 2023

yihua force-pushed the HUDI-5863-fix-fsv-timeline-server branch from 103f3ef to ed1183d Compare March 1, 2023 02:22

[HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at th…

8a753b0

…e timeline server

yihua force-pushed the HUDI-5863-fix-fsv-timeline-server branch from ed1183d to 8a753b0 Compare March 1, 2023 08:57

yihua assigned vinothchandar Mar 1, 2023

danny0405 reviewed Mar 1, 2023

View reviewed changes

hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java Outdated Show resolved Hide resolved

nsivabalan reviewed Mar 1, 2023

View reviewed changes

Fix reset

a3062bb

danny0405 requested changes Mar 2, 2023

View reviewed changes

vinothchandar reviewed Mar 2, 2023

View reviewed changes

yihua added 6 commits March 2, 2023 15:32

Add one more test

6f2b9ba

Address review comments

d2b11c5

Remove redundant code and comments

fc4924f

Cleanup

9c8cfc0

Fix sync

c162956

Add docs

7fff406

danny0405 reviewed Mar 4, 2023

View reviewed changes

hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java Outdated Show resolved Hide resolved

danny0405 approved these changes Mar 4, 2023

View reviewed changes

Address review comment

d660deb

yihua merged commit 2ddcf96 into apache:master Mar 4, 2023

yihua mentioned this pull request Mar 8, 2023

[DO NOT MERGE] Showcase HoodieMetadataFileSystemView issue causing inconsistent view and data loss in 0.13.0 #8134

Closed

4 tasks

danny0405 mentioned this pull request Mar 9, 2023

[SUPPORT] data loss in new base file after compaction #8132

Open

nsivabalan mentioned this pull request Mar 12, 2023

[SUPPORT] MOR Table Duplicated Records Found #8121

Closed

codope mentioned this pull request Mar 29, 2023

[SUPPORT]Duplicate data in MOR table Hudi #8236

Closed

yihua mentioned this pull request Mar 30, 2023

[HUDI-1982] Remove unnecessary synchronization #3041

Closed

5 tasks

danny0405 mentioned this pull request Apr 28, 2023

[SUPPORT]Missing data problem，exigency！！！ #6102

Closed

[HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server #8079

[HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server #8079

Uh oh!

Conversation

yihua commented Mar 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan Mar 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yihua Mar 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nsivabalan Mar 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinothchandar commented Mar 2, 2023

Uh oh!

danny0405 left a comment

Choose a reason for hiding this comment

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yihua commented Mar 4, 2023

Uh oh!

Uh oh!

hudi-bot commented Mar 4, 2023

CI report:

Uh oh!

Gatsby-Lee commented May 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

yihua commented Mar 1, 2023 •

edited

Loading

nsivabalan Mar 1, 2023 •

edited

Loading

yihua Mar 4, 2023 •

edited

Loading

nsivabalan Mar 1, 2023 •

edited

Loading