Skip to content

HDDS-13770. SstBackup Directory can have orphan files after bootstrap not present in compaction log table#9169

Merged
jojochuang merged 11 commits intoapache:masterfrom
jojochuang:HDDS-13770
Oct 23, 2025
Merged

HDDS-13770. SstBackup Directory can have orphan files after bootstrap not present in compaction log table#9169
jojochuang merged 11 commits intoapache:masterfrom
jojochuang:HDDS-13770

Conversation

@jojochuang
Copy link
Contributor

What changes were proposed in this pull request?

HDDS-13770. SstBackup Directory can have orphan files after bootstrap not present in compaction log table

Please describe your PR in detail:

  • Open snapshot rocksdb in read-only mode, to get the correct list of sst files, use that as the reference to create hard links.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13770

How was this patch tested?

Added unit tests.

The previous implementation had a race condition where a compaction
could occur after a checkpoint was created but before SST files were
linked for backup. This could result in orphan SST files in the
backup directory that were not tracked in the compaction log.

This commit fixes the issue by reading the list of SST files to be
backed up directly from the `compactionLogTable` within the database
checkpoint. This ensures that only the SST files that are part of the
consistent checkpoint are included in the backup.

Changes:
- Added a new public method `getCompactionLogSstFiles` to
  `RocksDBCheckpointDiffer` to read the compaction log from a checkpoint.
- Modified `OMDBCheckpointServletInodeBasedXfer.createAndPrepareCheckpoint`
  to use the new method and link only the necessary SST files.
- This also eliminates the redundant copy of the compaction log directory.

Change-Id: I82a2bb3b28bb191adfdc9c95a691042f3c9e38b8
Change-Id: Iab58267cd61795f7b8ed12febfed2064d3b1d251
Change-Id: I3442fdc585a1c1a59f994e2a6003ed3022085a63
@jojochuang jojochuang requested review from Copilot, sadanand48 and swamirishi and removed request for Copilot, sadanand48 and swamirishi October 17, 2025 18:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes an issue where the SstBackup directory could contain orphan files after bootstrap that are not present in the compaction log table. The solution opens snapshot RocksDB in read-only mode to get the accurate list of SST files from the compaction log table and uses that as the reference to create hard links.

Key changes:

  • Modified checkpoint creation to read SST file list directly from the compaction log table in the checkpoint
  • Added a new method to extract SST files from a checkpoint's compaction log in read-only mode
  • Replaced directory-wide file copying with selective hard link creation based on compaction log entries

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
OMDBCheckpointServletInodeBasedXfer.java Changed checkpoint preparation to selectively link only SST files present in the compaction log instead of linking all files from backup directory
TestOMDbCheckpointServlet.java Added blank line (formatting change only)
RocksDBCheckpointDiffer.java Added new method getCompactionLogSstFiles to read SST file list from checkpoint's compaction log table in read-only mode

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

"directory {}", sstFile, getSstBackupDir());
}
}
} catch (Exception e) {
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching generic Exception is too broad. Catch specific exceptions like RocksDBException or IOException to handle different failure scenarios appropriately and avoid masking unexpected errors.

Copilot uses AI. Check for mistakes.
Comment on lines 1478 to 1481
logEntry.getInputFileInfoList().forEach(f ->
sstFiles.add(f.getFileName()));
logEntry.getOutputFileInfoList().forEach(f ->
sstFiles.add(f.getFileName()));
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using List.add() in a loop can be inefficient if the list grows large. Consider using a Set for sstFiles to automatically handle duplicates and improve lookup performance, or use addAll() if the file info lists can be converted to filename collections.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @jojochuang I have posted comments inline

for (String sstFile : sstFiles) {
Path sstFileToLink = getSstBackupDir().resolve(sstFile);
if (Files.exists(sstFileToLink)) {
Files.createLink(tmpSstBackupDir.resolve(sstFile), sstFileToLink);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to create SstFileLinks we can directly read it and write it to the tar archive stream

@jojochuang jojochuang added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Oct 20, 2025
Change-Id: Ia23af3fea433c58ef9342d62cdfc90ef8678ae99
Copy link
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jojochuang Thanks for addressing the review comments I just have a few more review comments once that is done it should be good to merged after that

Change-Id: I7a32ce946963aec07167d7e7e6036a749af5b612
@swamirishi
Copy link
Contributor

@jojochuang there are a couple of more review comments

Change-Id: I172095c5bd264cbe5327a2115c08d9070331acc0
@jojochuang jojochuang marked this pull request as ready for review October 21, 2025 18:26
Change-Id: I7840100daec44e475160ba261a0a3afbfc99bac4
while (iterator.hasNext()) {
CompactionLogEntry logEntry = iterator.next().getValue();
logEntry.getInputFileInfoList().forEach(f ->
sstFiles.add(Paths.get(f.getFileName())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the path would only have filename here right i.e something like "x.sst" so when the file stream is passed to writeDBToArchive method I believe the absolute path wouldn't be available so it won't be copied to the tarball.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sadanand48 good catch. Can you check again now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes looks good now, As swami suggested will test this part in #9132. I need to make changes there

Change-Id: I52fa1333d85958fc6ed4da4596b97a249bc111e6
Change-Id: Icb404ab9f70624d16661a74095819433d2217401
Change-Id: I72a8373eb17f9f4cdc8597d0c93a3707e2ecd399
@jojochuang jojochuang requested a review from swamirishi October 23, 2025 01:19
Copy link
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @jojochuang @sadanand48 we should add some unit test cases which are clearly missing for this race condition. I was thinking if we can create a mocked RDBstore and create dummy files and track all the files being written to a mocked static function linkAndIncludeFile(File file, String entryName,
ArchiveOutputStream archiveOutput, Path tmpDir)
Let us do this as part of #9132

…ne/om/OMDBCheckpointServletInodeBasedXfer.java

Co-authored-by: Swaminathan Balachandran <swamirishi.sb@gmail.com>
Copy link
Contributor

@sadanand48 sadanand48 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for helping with this patch @jojochuang , LGTM.

@jojochuang jojochuang merged commit 1b595f3 into apache:master Oct 23, 2025
43 checks passed
@jojochuang
Copy link
Contributor Author

Merged. Thanks @swamirishi @sadanand48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants