Skip to content

fix: restore decrease max_fragment_id in manifest#5554

Merged
majin1102 merged 5 commits intolance-format:mainfrom
majin1102:fragment_id_travel
Dec 22, 2025
Merged

fix: restore decrease max_fragment_id in manifest#5554
majin1102 merged 5 commits intolance-format:mainfrom
majin1102:fragment_id_travel

Conversation

@majin1102
Copy link
Copy Markdown
Contributor

Related: #5540

@github-actions github-actions Bot added the bug Something isn't working label Dec 19, 2025
@lance-format lance-format deleted a comment from github-actions Bot Dec 19, 2025
@github-actions
Copy link
Copy Markdown
Contributor

Code Review

Summary

This PR fixes a bug where max_fragment_id can decrease when restoring an older dataset version. The fix ensures that after a restore, the max_fragment_id is the maximum of the restored manifest's value and the current manifest's value.

Analysis

The fix is correct and important. The max_fragment_id is used as a monotonically increasing counter to assign unique fragment IDs. When it decreases, newly created fragments could get IDs that were previously used, causing conflicts with:

  • Index fragment bitmaps (see index/prefilter.rs:64)
  • Fragment-based filtering logic (see index/append.rs:542)
  • Deletion masks

The fix approach is sound: Taking the max of the restored and current manifest's max_fragment_id preserves the monotonicity invariant.

P1 Issues

  1. Test assertion may be flaky: The test calls restore() but then checks dataset_v1.manifest.max_fragment_id(). However, restore() doesn't update the in-memory manifest of the dataset object being used. You should reload the dataset after restore:
// Instead of:
dataset_v1.restore().await.unwrap();
let restored_max = dataset_v1.manifest.max_fragment_id().unwrap_or(0);

// Consider:
dataset_v1.restore().await.unwrap();
let restored_dataset = Dataset::open(test_uri).await.unwrap();
let restored_max = restored_dataset.manifest.max_fragment_id().unwrap_or(0);

Otherwise the test may pass even if the fix doesn't work correctly, since you're checking the old version 1 manifest, not the newly created restored manifest.


Overall this is a good bug fix for an important invariant. Please verify the test is checking the correct manifest.

@github-actions
Copy link
Copy Markdown
Contributor

Code Review

Summary

This PR fixes a bug where restoring an older version of a dataset could decrease max_fragment_id, which could lead to fragment ID collisions when new fragments are created after a restore operation.

P0 Issue - next_row_id has the same vulnerability

The fix for max_fragment_id is correct, but next_row_id has the same issue. When restoring an older version, next_row_id from the old manifest will also be restored, which could cause row ID collisions if the dataset uses stable row IDs.

The fix should also include:

manifest.next_row_id = manifest.next_row_id.max(current_manifest.next_row_id);

This is a critical bug for datasets with stable row IDs enabled, as it could cause duplicate row IDs after a restore operation.

Minor: Test should verify newly restored manifest from disk

The test checks dataset_v1.manifest after calling restore(). While this works because apply_commit updates the in-memory manifest, it would be more robust to also re-open the dataset from disk and verify the persisted max_fragment_id:

// Also verify persisted value
let reopened = Dataset::open(test_uri).await.unwrap();
assert!(reopened.manifest.max_fragment_id().unwrap_or(0) >= latest_max);

@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 19, 2025

Codecov Report

❌ Patch coverage is 96.29630% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/io/commit.rs 95.65% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. Since we ruled out issues with branches, I agree we just need to fix restore. This solution looks good to me.

Just have some minor suggestions for the tests.

Comment thread rust/lance/src/io/commit.rs Outdated
Comment thread rust/lance/src/io/commit.rs Outdated
majin1102 and others added 2 commits December 22, 2025 13:59
Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

Code Review

Summary: This PR fixes a bug where restore() could decrease max_fragment_id, potentially causing fragment ID collisions when new fragments are created after restoring to an earlier version.

Assessment: LGTM

The fix is correct and minimal. The approach of taking the max of the restored manifest's max_fragment_id and the current manifest's max_fragment_id ensures monotonicity.

Minor suggestions (non-blocking):

  1. Test assertion could be stronger: The test verifies restored_max >= latest_max, but since max_fragment_id should be strictly monotonic and version 1 had a lower value, you could assert restored_max == latest_max to be more precise about the expected behavior.

  2. Consider edge case: When current_manifest.max_fragment_id is None (legacy datasets without this field), the .max() will work correctly because None is compared as less than Some(_). This is fine, just noting for documentation.

The test coverage is appropriate for this bug fix.

@github-actions
Copy link
Copy Markdown
Contributor

Code Review

Summary: This PR fixes a bug where restoring an older version could decrease max_fragment_id, which would cause fragment ID collisions on subsequent writes.

Assessment

The fix is correct and well-tested. The approach ensures that when restoring from an old manifest, the max_fragment_id from the current manifest is preserved if it is higher than the restored version.

No P0 or P1 issues found.

Minor observation (not blocking):

  • The test uses memory:// as recommended in CLAUDE.md - good practice.
  • Both call sites (do_commit_detached_transaction and commit_transaction) are updated consistently.

@github-actions
Copy link
Copy Markdown
Contributor

Code Review

P0: Bug - Type mismatch between Option<u32> and Option<u32>.max()

The fix at transaction.rs:1541-1543 has a type issue:

manifest.max_fragment_id = manifest
    .max_fragment_id
    .max(current_manifest.max_fragment_id);

Here max_fragment_id is Option<u32> (see manifest.rs:74). The Option::max() method compares options where None < Some(_), so this mostly works. However, this creates a subtle behavior issue:

  • If manifest.max_fragment_id = None and current_manifest.max_fragment_id = Some(5), result is Some(5)
  • If manifest.max_fragment_id = Some(3) and current_manifest.max_fragment_id = None, result is Some(3)
  • If both are Some, takes the max ✓

The logic is actually correct due to how Option::max() works (None is considered less than Some). After further analysis, this is not a bug - the implementation is correct.

Suggestion: Consider using the accessor method

For consistency with the rest of the codebase, consider using the max_fragment_id() accessor method instead of directly accessing the field, though the current approach is also valid:

manifest.max_fragment_id = manifest
    .max_fragment_id
    .max(current_manifest.max_fragment_id);

The accessor max_fragment_id() returns Option<u64> and computes from fragments if not set, but directly accessing the Option<u32> field is appropriate here since we want to preserve the stored high water mark.

Test Coverage ✓

The test test_restore_does_not_decrease_max_fragment_id properly validates the fix by:

  1. Creating a dataset and appending data to advance max_fragment_id
  2. Restoring an earlier version
  3. Asserting max_fragment_id doesn't decrease

LGTM - the fix is correct and the test coverage is appropriate.

@majin1102 majin1102 merged commit 42b5938 into lance-format:main Dec 22, 2025
39 of 40 checks passed
wjones127 added a commit to wjones127/lance that referenced this pull request Dec 30, 2025
Related: lance-format#5540

---------

Co-authored-by: majin.nathan <majin.nathan@bytedance.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
Related: lance-format#5540

---------

Co-authored-by: majin.nathan <majin.nathan@bytedance.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants