Skip to content

Conversation

@yihua
Copy link
Contributor

@yihua yihua commented Mar 1, 2023

Change Logs

We observe that for MOR table, occasionally (<10% for large tables with frequent updates and compactions, <1% of updates affected), the deltacommit after the compaction commit may add a new log file to the old file slice, not the latest file slice, in the corresponding file group.  This happens when both the metadata table and timeline server are enabled.  If either is disabled, the problem does not show up. This means that any data written to the log file is lost, i.e., data loss issue.

Deeper analysis of the code surfaces that the file system view at the timeline server may serve the stale view, causing the issue.  This is because the sync of HoodieMetadataFileSystemView is not atomic when the metadata table is enabled.

Here's a concrete walkthrough of how this problem can happen. When the timeline server is long-running and passed to the write client for reusing (like Deltastreamer), the timeline server internally needs to refresh and sync the outdated file system view if it is behind what the client (which sends the request to the timeline server) sees. This is based on whether the latest instant seen by the client matches what timeline server's file system view has. The sync logic is implemented in RequestHandler:

[L1]  private boolean syncIfLocalViewBehind(Context ctx) {
[L2]    if (isLocalViewBehind(ctx)) {
[L3]      String basePath = ctx.queryParam(RemoteHoodieTableFileSystemView.BASEPATH_PARAM);
[L4]      String lastKnownInstantFromClient = ctx.queryParamAsClass(RemoteHoodieTableFileSystemView.LAST_INSTANT_TS, String.class).getOrDefault(HoodieTimeline.INVALID_INSTANT_TS);
[L5]      SyncableFileSystemView view = viewManager.getFileSystemView(basePath);
[L6]      synchronized (view) {
[L7]        if (isLocalViewBehind(ctx)) {
[L7]          HoodieTimeline localTimeline = viewManager.getFileSystemView(basePath).getTimeline();
[L8]          LOG.info("Syncing view as client passed last known instant " + lastKnownInstantFromClient
[L9]              + " as last known instant but server has the following last instant on timeline :"
[L10]              + localTimeline.lastInstant());
[L11]          view.sync();
[L12]          return true;
[L13]        }
[L14]      }
[L15]    }
[L16]    return false;
[L17]  }
  private boolean isLocalViewBehind(Context ctx) {
    String basePath = ctx.queryParam(RemoteHoodieTableFileSystemView.BASEPATH_PARAM);
    String lastKnownInstantFromClient =
        ctx.queryParamAsClass(RemoteHoodieTableFileSystemView.LAST_INSTANT_TS, String.class).getOrDefault(HoodieTimeline.INVALID_INSTANT_TS);
    String timelineHashFromClient = ctx.queryParamAsClass(RemoteHoodieTableFileSystemView.TIMELINE_HASH, String.class).getOrDefault("");
    HoodieTimeline localTimeline =
        viewManager.getFileSystemView(basePath).getTimeline().filterCompletedOrMajorOrMinorCompactionInstants();
    if (LOG.isDebugEnabled()) {
      LOG.debug("Client [ LastTs=" + lastKnownInstantFromClient + ", TimelineHash=" + timelineHashFromClient
          + "], localTimeline=" + localTimeline.getInstants());
    }

    if ((!localTimeline.getInstantsAsStream().findAny().isPresent())
        && HoodieTimeline.INVALID_INSTANT_TS.equals(lastKnownInstantFromClient)) {
      return false;
    }

    String localTimelineHash = localTimeline.getTimelineHash();
    // refresh if timeline hash mismatches
    if (!localTimelineHash.equals(timelineHashFromClient)) {
      return true;
    }

    // As a safety check, even if hash is same, ensure instant is present
    return !localTimeline.containsOrBeforeTimelineStarts(lastKnownInstantFromClient);
  }

Here's the implementation of sync() of the file system view based on the metadata table (HoodieMetadataFileSystemView):

  @Override
  public void sync() {
    super.sync();
    tableMetadata.reset();
  }

The super.sync() calls the sync() in AbstractTableFileSystemView:

  @Override
  public void sync() {
    HoodieTimeline oldTimeline = getTimeline();
    HoodieTimeline newTimeline = metaClient.reloadActiveTimeline().filterCompletedOrMajorOrMinorCompactionInstants();
    try {
      writeLock.lock();
      runSync(oldTimeline, newTimeline);
    } finally {
      writeLock.unlock();
    }
  }

Note that, the file system view logic should be guarded by the read-write lock, i.e., any read of the cached file system info should be blocked before updates to the cache in the file system are fully done, which is guarded by the write lock. However, we can see that for HoodieMetadataFileSystemView, the write lock covers the sync logic partially, and tableMetadata.reset() is not guarded by the write lock, meaning that it can create race condition that the metadata table content can be stale when being read.

  • When metadata table is enabled using HoodieBackedTableMetadata, tableMetadata.reset() has logic to execute.
  • When metadata table is disabled, with FileSystemBackedTableMetadata, tableMetadata.reset() is a no-op, so there is no issue.

Let's now look at concurrent requests are handled in RequestHandler to cause the stale view to be read. Consider two concurrent requests to the timeline server, REQ1 and REQ2:

  • Thread1 for REQ1: L2 with syncIfLocalViewBehind() returns true. Then enters synchronized block from L6 to sync the file system view. In the middle, the timeline has already been updated, yet the metadata and file system view has not been refreshed.
  • Thread2 for REQ2: This comes after the timeline has been updated because of REQ1 at the timeline server, but metadata and the file system view is still in progress. At this moment, L2 with syncIfLocalViewBehind() returns false, skipping the synchronized block and continue for getting the information from the file system view. As mentioned above, given that the sync logic is not fully guarded by the write lock, the read lock thinks that the info is fully updated, thus allowing the file system view to return the info, which is stale.

To easily reproduce this race condition problem consistently, a new test TestRemoteFileSystemViewWithMetadataTable is added with HoodieBackedTestDelayedTableMetadata to delay the reset() process. By applying the same change to the HoodieBackedTableMetadata, we can also reproduce the same problem with the Deltastreamer.

The fix is the following:

  • Makes sure all logic in sync() and reset() are guarded by the write lock. This is done by implementing the logic with the write lock in each file system view, instead of overriding indect method (e.g., runSync()).
  • Simplifies the implementation of syncIfLocalViewBehind() to be more readable.

History and Impact of the problem:

The problematic implementation of sync() and reset() in HoodieMetadataFileSystemView class is introduced by #4307 to properly sync and reset the file system view based on the metadata table. This did not have an impact at that moment, as Hudi refreshes/synces the file system view explicitly while initializing the HoodieTable instance for each write, masking the issue. Only when the file system view based on the metadata table is synced because of new requests at the timeline server, there's a chance of the HoodieMetadataFileSystemView serving a stale view, which is not hit when #4307 is landed. Subsequently, two more optimization PRs, #5617 and #5716, are landed to remove "unnecessary" refreshes/syncs of the file system view when initializing the HoodieTable instance. It turns out that without this PR properly fixing the sync() and reset() in HoodieMetadataFileSystemView, such refreshes and syncs are necessary to avoid potential stale and inconsistent views. As the probability of hitting this issue is very low, it is not uncovered during manual testing or CI runs.

After running the same tests added in this PR, it is found that the following Hudi releases are affected: 0.11.1, 0.12.0, 0.12.1, 0.12.2, 0.13.0. This also aligns with the findings around the offending PRs mentioned above. Although the tests pass for 0.13.0, because of another bug #8080 introduced in 0.13.0 release (timeline server instance passed to the write client is ignored due to this bug), there's still an impact on 0.13.0. #8134 reproduces the issue for 0.13.0.

Impact

This PR fixes the data loss issue and makes sure that the timeline server with the metadata table enabled (using HoodieMetadataFileSystemView) always serves the correct information.

This has been tested with deltastreamer writing MOR table with inline compaction in continuous mode, with intentional delay in HoodieBackedTableMetadata just like HoodieBackedTestDelayedTableMetadata to ensure the problem happens consistently before the fix. After the fix, the problem goes away.

Risk level

medium

Documentation Update

We need to update the release notes regarding this severe regression that can cause data loss: HUDI-5864.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@yihua yihua added priority:blocker Production down; release blocker writer-core labels Mar 1, 2023
@yihua yihua force-pushed the HUDI-5863-fix-fsv-timeline-server branch from 103f3ef to ed1183d Compare March 1, 2023 02:22
@yihua yihua force-pushed the HUDI-5863-fix-fsv-timeline-server branch from ed1183d to 8a753b0 Compare March 1, 2023 08:57
* {@link AbstractTableFileSystemView#reset} directly, which may cause stale file system view
* to be served.
*/
protected void runReset() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess the method should be guarded by reset writeLock ? #reset should invoke #runReset instead I guess.

And can we give these methods more straight-forward names?

Like sync -> syncThreadSafely, runSync -> sync, reset->resetThreadSafely, runReset -> reset

Copy link
Contributor

@nsivabalan nsivabalan Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is guarded Danny.

As per master.

Non metadata :

sync() {
   lock
     runSync {
          clean and init timeline. 
       }
  release lock 
}

w/ Metadata file system view

sync() in MFSV {
  super.sync {
       lock
         runSync {
              clean and init timeline. 
           }
      release lock 
   }
  tableMetadata.reset();
}

With this patch, here is how it is changing:
w/o metadata.

sync() {
   lock
     runSync {
          clean and init timeline. 
       }
  release lock 
}

no changes

w/ MFSV

sync() in ATFSV (not overridden in MFSV) {
       lock
         runSync { // overridden in MFSV {
                    super.runSync {
                         clean and init timeline. 
                    }
              tableMetadata.reset();
          }
      release lock 
   }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but agree w/ naming. we can call runSyncThreadSafely or syncThreadSafely instead of runSync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, reset() should call runReset(). Not sure why the change is reverted somehow.

I don't think we should change the naming of sync and reset in the interface SyncableFileSystemView as they can also have a different implementation. I've added documentation to make sure how the custom logic should be overridden.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don’t think the doc makes any sense, let’s give them better name.

Copy link
Contributor Author

@yihua yihua Mar 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed runSync and runReset methods to avoid confusion and make every implementation explicitly use write lock except remote FSV. If new file system view needs to be added, the author should look at existing implementation for reference. Renaming won't prevent the author doing the wrong thing.

}
try {
// This read lock makes sure that if the local view of the table is being synced,
// no timeline server requests should be processed or handled until the sync process
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we need a explicit lock obj again, the whole code block is guarded by a synchronized object lock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we call this method twice. only one caller is within synchronized

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The synchronized won't be executed if the timeline is already refreshed and this method returns false to skip the synchronized block, but in such a case the metadata file system view might not have been fully updated, causing the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 This is simplified now. You can also check my updated PR description for how the race condition can happen.

* {@link AbstractTableFileSystemView#reset} directly, which may cause stale file system view
* to be served.
*/
protected void runReset() {
Copy link
Contributor

@nsivabalan nsivabalan Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is guarded Danny.

As per master.

Non metadata :

sync() {
   lock
     runSync {
          clean and init timeline. 
       }
  release lock 
}

w/ Metadata file system view

sync() in MFSV {
  super.sync {
       lock
         runSync {
              clean and init timeline. 
           }
      release lock 
   }
  tableMetadata.reset();
}

With this patch, here is how it is changing:
w/o metadata.

sync() {
   lock
     runSync {
          clean and init timeline. 
       }
  release lock 
}

no changes

w/ MFSV

sync() in ATFSV (not overridden in MFSV) {
       lock
         runSync { // overridden in MFSV {
                    super.runSync {
                         clean and init timeline. 
                    }
              tableMetadata.reset();
          }
      release lock 
   }
}

* {@link AbstractTableFileSystemView#reset} directly, which may cause stale file system view
* to be served.
*/
protected void runReset() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but agree w/ naming. we can call runSyncThreadSafely or syncThreadSafely instead of runSync.

* Table metadata provided by an internal DFS backed Hudi metadata table,
* with an intentional delay in `reset()` to test concurrent reads and writes.
*/
public class HoodieBackedTestDelayedTableMetadata extends HoodieBackedTableMetadata {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this really have to be in a separate file of its own. can we not embed within another test class where its used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally a good idea to have a separate class for a notable test logic.

}
try {
// This read lock makes sure that if the local view of the table is being synced,
// no timeline server requests should be processed or handled until the sync process
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we call this method twice. only one caller is within synchronized

@vinothchandar
Copy link
Member

Is this a regression? what version? Can I look at the offending commit to understand how it was before.

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be conservative about the fs view change, sill kind of frightened for the fs view change in release 0.11.x that causes the data loss.

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

me and @bvaradar took a pass at it. Wondering if we can simplify all this more.

@yihua
Copy link
Contributor Author

yihua commented Mar 4, 2023

Is this a regression? what version? Can I look at the offending commit to understand how it was before.

I updated the PR description to provide more detailed information.

@hudi-bot
Copy link
Collaborator

hudi-bot commented Mar 4, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit 2ddcf96 into apache:master Mar 4, 2023
nsivabalan pushed a commit to nsivabalan/hudi that referenced this pull request Mar 18, 2023
…e timeline server (apache#8079)

We observe that for MOR table, occasionally (<10% for large tables with frequent updates and compactions), the deltacommit after the compaction commit may add a new log file to the old file slice, not the latest file slice, in the corresponding file group.  This happens when both the metadata table and timeline server are enabled.  If either is disabled, the problem does not show up. This means that any data written to the log file is lost, i.e., data loss issue.

Deeper analysis of the code surfaces that the file system view at the timeline server may serve the stale view, causing the issue.  This is because the sync of `HoodieMetadataFileSystemView` is not atomic when the metadata table is enabled.

This commit makes the following fixes:
- Makes sure all logic in `sync()` and `reset()` are guarded by the write lock.  This is done by implementing the logic with the write lock in each file system view, instead of overriding indect method (e.g., `runSync()`).
- Simplifies the implementation of `syncIfLocalViewBehind()` to be more readable.
nsivabalan pushed a commit to nsivabalan/hudi that referenced this pull request Mar 22, 2023
…e timeline server (apache#8079)

We observe that for MOR table, occasionally (<10% for large tables with frequent updates and compactions), the deltacommit after the compaction commit may add a new log file to the old file slice, not the latest file slice, in the corresponding file group.  This happens when both the metadata table and timeline server are enabled.  If either is disabled, the problem does not show up. This means that any data written to the log file is lost, i.e., data loss issue.

Deeper analysis of the code surfaces that the file system view at the timeline server may serve the stale view, causing the issue.  This is because the sync of `HoodieMetadataFileSystemView` is not atomic when the metadata table is enabled.

This commit makes the following fixes:
- Makes sure all logic in `sync()` and `reset()` are guarded by the write lock.  This is done by implementing the logic with the write lock in each file system view, instead of overriding indect method (e.g., `runSync()`).
- Simplifies the implementation of `syncIfLocalViewBehind()` to be more readable.
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
…e timeline server (apache#8079)

We observe that for MOR table, occasionally (<10% for large tables with frequent updates and compactions), the deltacommit after the compaction commit may add a new log file to the old file slice, not the latest file slice, in the corresponding file group.  This happens when both the metadata table and timeline server are enabled.  If either is disabled, the problem does not show up. This means that any data written to the log file is lost, i.e., data loss issue.

Deeper analysis of the code surfaces that the file system view at the timeline server may serve the stale view, causing the issue.  This is because the sync of `HoodieMetadataFileSystemView` is not atomic when the metadata table is enabled.

This commit makes the following fixes:
- Makes sure all logic in `sync()` and `reset()` are guarded by the write lock.  This is done by implementing the logic with the write lock in each file system view, instead of overriding indect method (e.g., `runSync()`).
- Simplifies the implementation of `syncIfLocalViewBehind()` to be more readable.
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 20, 2023
…e timeline server (apache#8079)

We observe that for MOR table, occasionally (<10% for large tables with frequent updates and compactions), the deltacommit after the compaction commit may add a new log file to the old file slice, not the latest file slice, in the corresponding file group.  This happens when both the metadata table and timeline server are enabled.  If either is disabled, the problem does not show up. This means that any data written to the log file is lost, i.e., data loss issue.

Deeper analysis of the code surfaces that the file system view at the timeline server may serve the stale view, causing the issue.  This is because the sync of `HoodieMetadataFileSystemView` is not atomic when the metadata table is enabled.

This commit makes the following fixes:
- Makes sure all logic in `sync()` and `reset()` are guarded by the write lock.  This is done by implementing the logic with the write lock in each file system view, instead of overriding indect method (e.g., `runSync()`).
- Simplifies the implementation of `syncIfLocalViewBehind()` to be more readable.
flashJd added a commit to flashJd/hudi that referenced this pull request May 5, 2023
…e timeline server (apache#8079)

We observe that for MOR table, occasionally (<10% for large tables with frequent updates and compactions), the deltacommit after the compaction commit may add a new log file to the old file slice, not the latest file slice, in the corresponding file group.  This happens when both the metadata table and timeline server are enabled.  If either is disabled, the problem does not show up. This means that any data written to the log file is lost, i.e., data loss issue.

Deeper analysis of the code surfaces that the file system view at the timeline server may serve the stale view, causing the issue.  This is because the sync of `HoodieMetadataFileSystemView` is not atomic when the metadata table is enabled.

This commit makes the following fixes:
- Makes sure all logic in `sync()` and `reset()` are guarded by the write lock.  This is done by implementing the logic with the write lock in each file system view, instead of overriding indect method (e.g., `runSync()`).
- Simplifies the implementation of `syncIfLocalViewBehind()` to be more readable.

(cherry picked from commit 2ddcf96)
KnightChess pushed a commit to KnightChess/hudi that referenced this pull request Jan 2, 2024
…e timeline server (apache#8079)

We observe that for MOR table, occasionally (<10% for large tables with frequent updates and compactions), the deltacommit after the compaction commit may add a new log file to the old file slice, not the latest file slice, in the corresponding file group.  This happens when both the metadata table and timeline server are enabled.  If either is disabled, the problem does not show up. This means that any data written to the log file is lost, i.e., data loss issue.

Deeper analysis of the code surfaces that the file system view at the timeline server may serve the stale view, causing the issue.  This is because the sync of `HoodieMetadataFileSystemView` is not atomic when the metadata table is enabled.

This commit makes the following fixes:
- Makes sure all logic in `sync()` and `reset()` are guarded by the write lock.  This is done by implementing the logic with the write lock in each file system view, instead of overriding indect method (e.g., `runSync()`).
- Simplifies the implementation of `syncIfLocalViewBehind()` to be more readable.
@Gatsby-Lee
Copy link
Contributor

👍

alexr17 pushed a commit to alexr17/hudi that referenced this pull request Aug 13, 2025
…imeline server (apache#224)

This PR cherry-picks the critical fix apache#8079. More details can be found there.

We observe that for MOR table, occasionally (<10% for large tables with frequent updates and compactions), the deltacommit after the compaction commit may add a new log file to the old file slice, not the latest file slice, in the corresponding file group.  This happens when both the metadata table and timeline server are enabled.  If either is disabled, the problem does not show up. This means that any data written to the log file is lost, i.e., data loss issue.

Deeper analysis of the code surfaces that the file system view at the timeline server may serve the stale view, causing the issue.  This is because the sync of `HoodieMetadataFileSystemView` is not atomic when the metadata table is enabled.

This commit makes the following fixes:
- Makes sure all logic in `sync()` and `reset()` are guarded by the write lock.  This is done by implementing the logic with the write lock in each file system view, instead of overriding indect method (e.g., `runSync()`).
- Simplifies the implementation of `syncIfLocalViewBehind()` to be more readable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

7 participants