Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LibOS] Make get_new_id immediately move ID ownership if needed #109

Merged
merged 1 commit into from
Oct 2, 2021

Conversation

boryspoplawski
Copy link
Contributor

@boryspoplawski boryspoplawski commented Sep 30, 2021

Description of the changes

Previously allocating ID for a new process did not immediately change the ownership of this ID (it happened later in the child process). This could cause a issues when one thread issues a fork syscall and another exits, releasing its own thread ID - IPC leader might get notified to release an ID range, which it thinks is a part of a bigger one, since it was not yet notified about the ID ownership change.
This commit causes allocating a new ID to also immediately notify IPC leader about ownership change, atomically w.r.t. other threads in the current process.


This change is Reviewable

Copy link

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL) (waiting on @boryspoplawski)

a discussion (no related file):
For my own understanding, is this a scenario for the previous bug:

[P1 is the IPC leader]
P1 forks P2
P2 spawns new thread T3

P2 forks P4    simulteneously     T3 exits
[P2 now removes from owned P4 and T3, which triggers "release range 3-4" IPC msg]
P1 receives "release range 3-4" and marks these two as unused

P4 starts and performs `ipc_change_id_owner(4, <vmid>)`
P1 receives ipc_change_id_owner(4) and panics -- ID 4 unknown!

And now your PR fixes it like this:

[P1 is the IPC leader]
P1 forks P2
P2 spawns new thread T3

P2 forks P4    simulteneously     T3 exits
P2 removes from owned T3
P2 performs `ipc_change_id_owner(4, <child-vmid>)`

P1 never receives "release range 4" with this new PR, and ID 4 is always marked as used

Is my understanding correct?



LibOS/shim/src/ipc/shim_ipc_pid.c, line 207 at r1 (raw file):

    if (range->start != start || range->end != end) {
        BUG();
    }

Looks like a remnant from your gdb session? Not blocking though.

Copy link
Contributor Author

@boryspoplawski boryspoplawski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL) (waiting on @dimakuv)

a discussion (no related file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

For my own understanding, is this a scenario for the previous bug:

[P1 is the IPC leader]
P1 forks P2
P2 spawns new thread T3

P2 forks P4    simulteneously     T3 exits
[P2 now removes from owned P4 and T3, which triggers "release range 3-4" IPC msg]
P1 receives "release range 3-4" and marks these two as unused

P4 starts and performs `ipc_change_id_owner(4, <vmid>)`
P1 receives ipc_change_id_owner(4) and panics -- ID 4 unknown!

And now your PR fixes it like this:

[P1 is the IPC leader]
P1 forks P2
P2 spawns new thread T3

P2 forks P4    simulteneously     T3 exits
P2 removes from owned T3
P2 performs `ipc_change_id_owner(4, <child-vmid>)`

P1 never receives "release range 4" with this new PR, and ID 4 is always marked as used

Is my understanding correct?

Almost. There is a difference between releasing and removing from owned IDs.
[P2 now removes from owned P4 and T3, which triggers "release range 3-4" IPC msg] - P4 was only removed from owned, while T3 was marked as no longer used (hence releasable). This caused the range to be split locally into three ranges: 3-3, 4-4 and 5-X The message that went to IPC leader was release range 3-3, but the leader only knew about range 3-X, and due to how range lookups currently work, it found this range and deleted it anyway.
Also note these steps happen under a lock:

P2 removes from owned T3
P2 performs `ipc_change_id_owner(4, <child-vmid>)`


LibOS/shim/src/ipc/shim_ipc_pid.c, line 207 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Looks like a remnant from your gdb session? Not blocking though.

I've actually debugged this purely by reading code :)

No, this change is intended. It's just like a release-build assert. I wanted to add this, since the culprit of the original issue was desynchronized state and this was the only place it could be detected (we had an assert, but seems bug reporters ran this in release build).

Copy link

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (1 more required)

a discussion (no related file):

but the leader only knew about range 3-X, and due to how range lookups currently work, it found this range and deleted it anyway.

Oooh! Ok. That's not a bug? Anyway, resolving.


Copy link
Contributor Author

@boryspoplawski boryspoplawski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (1 more required)

a discussion (no related file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

but the leader only knew about range 3-X, and due to how range lookups currently work, it found this range and deleted it anyway.

Oooh! Ok. That's not a bug? Anyway, resolving.

What do you mean? There is nothing sensible we can do in case of a desync, so I've opted for a loud BUG().


dimakuv
dimakuv previously approved these changes Oct 1, 2021
Copy link

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (1 more required)

a discussion (no related file):

Previously, boryspoplawski (Borys Popławski) wrote…

What do you mean? There is nothing sensible we can do in case of a desync, so I've opted for a loud BUG().

Well, yes, true. Nevermind.


Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (1 more required) (waiting on @boryspoplawski)


-- commits, line 6 at r1:

a issues

Copy link
Contributor Author

@boryspoplawski boryspoplawski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (1 more required) (waiting on @mkow)


-- commits, line 6 at r1:

Previously, mkow (Michał Kowalczyk) wrote…
a issues

TODO: at rebase

Copy link
Contributor Author

@boryspoplawski boryspoplawski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL) (waiting on @mkow)


-- commits, line 6 at r1:

Previously, boryspoplawski (Borys Popławski) wrote…

TODO: at rebase

Done.

Previously allocating ID for a new process did not immediately change
the ownership of this ID (it happened later in the child process). This
could cause issues when one thread does a `fork` syscall and another
exits, releasing its own thread ID - IPC leader might get notified to
release an ID range, which it thinks is a part of a bigger one, since it
was not yet notified about the ID ownership change.
This commit causes allocating a new ID to also immediately notify IPC
leader about ownership change, atomically w.r.t. other threads in the
current process.

Signed-off-by: Borys Popławski <[email protected]>
Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: ITL)

@boryspoplawski
Copy link
Contributor Author

Jenkins, retest Jenkins-Direct-Sanitizers please (sendfile09 from LTP timed out, known issue)

Copy link

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

@dimakuv dimakuv merged commit 91ac1a8 into master Oct 2, 2021
@dimakuv dimakuv deleted the borys/ipc_ipc_ipc branch October 2, 2021 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants