Skip to content

Conversation

@qinsoon
Copy link
Member

@qinsoon qinsoon commented Apr 10, 2024

This PR resolves the issue in #824 (comment). It closes #824.

  • Lock block lists in AbandonedBlockLists when they are accessed. This resolves the data race between AbandonedBlockLists and SweepChunk -- they both access blocks.
  • Add a feature ms_block_list_sanity. This embeds a sanity block list Vec<Block> in BlockList, and checks if the actual block list with side metadata and linked list matches the sanity list. ms_block_list_sanity is enabled for extreme_assertions.

@qinsoon qinsoon added the PR-extended-testing Run extended tests for the pull request label Apr 10, 2024
@qinsoon qinsoon requested a review from wks April 10, 2024 07:17
@wks
Copy link
Collaborator

wks commented Apr 10, 2024

We can add the assertions I mentioned in #1103 (not the early exit(0) of course) into this PR so that we can check if that problem persists after this PR.

@qinsoon
Copy link
Member Author

qinsoon commented Apr 10, 2024

We can add the assertions I mentioned in #1103 (not the early exit(0) of course) into this PR so that we can check if that problem persists after this PR.

I tried. The issue can still be reproduced. This PR does not fix #1103.


#[cfg(feature = "ms_block_list_sanity")]
{
let mut sanity_list = self.sanity_list.lock().unwrap();
Copy link
Collaborator

@wks wks Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this method is not thread-safe, we can use try_lock() instead of lock(), assert that try_lock() always succeeds. If the assertion fails, it means there is a contention on the lock, and the only case that would happen is that two threads attempted to call thread-unsafe methods concurrently.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the method is_empty itself can be called from multiple threads concurrently as long as there isn't another thread calling other methods that mutate this BlockList. In theory, Rust's ownership model and borrow checker can prevent that from happening because it disallows a & to coexist with &mut.

Copy link
Collaborator

@wks wks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this PR fixes the issue, I am OK with it because the main part is just adding a lock.

But I think there is something fundamentally wrong with our implementation. We can discuss more about it on Zulip. My hypothesis is that SweepChunk doesn't really need to sweep the blocks, but only needs to count marked blocks. Once we remove that, there should be no data race anymore. That'll fundamentally address the race this PR is addressing.

Using try_lock on the sanity check should make it fail faster if a race occurs.


#[cfg(feature = "ms_block_list_sanity")]
{
let mut sanity_list = self.sanity_list.lock().unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the method is_empty itself can be called from multiple threads concurrently as long as there isn't another thread calling other methods that mutate this BlockList. In theory, Rust's ownership model and borrow checker can prevent that from happening because it disallows a & to coexist with &mut.


#[cfg(feature = "ms_block_list_sanity")]
{
let mut sanity_list = self.sanity_list.lock().unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if there is a contention on this lock, there must be a race because no two threads should call BlockList::remove at the same time. We can use try_lock() here.


#[cfg(feature = "ms_block_list_sanity")]
{
let mut sanity_list = self.sanity_list.lock().unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto


#[cfg(feature = "ms_block_list_sanity")]
{
let mut sanity_list = self.sanity_list.lock().unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same.

@qinsoon
Copy link
Member Author

qinsoon commented Apr 12, 2024

I am closing this PR. See discussion in #1103, and the new PR #1112.

@qinsoon qinsoon closed this Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR-extended-testing Run extended tests for the pull request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Native Mark-Sweep: block sweeping anomaly

2 participants