Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LibOS] Convert pseudo filesystem to use inodes #402

Merged
merged 1 commit into from
Feb 15, 2022
Merged

Conversation

pwmarcz
Copy link
Contributor

@pwmarcz pwmarcz commented Feb 14, 2022

Description of the changes

In addition, pseudo_node objects are now tracked by number and checkpointed.

How to test this PR?

This should have a pretty good coverage on Jenkins.


This change is Reviewable

Copy link
Contributor Author

@pwmarcz pwmarcz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 2 files reviewed, all discussions resolved, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL)


LibOS/shim/include/shim_fs_pseudo.h, line 69 at r1 (raw file):

#define PSEUDO_PERM_FILE_RW PERM_rw_rw_rw_

/* Maximum count of created nodes (`pseudo_add_*` calls) */

FYI: The current number is something around 70.


LibOS/shim/include/shim_fs_pseudo.h, line 79 at r1 (raw file):

 * The node can then be further customized by directly modifying other fields.
 *
 * NOTE: We assume that all Gramine processes within a single run create exactly the same set of

FYI: The exact pseudo nodes created depends on filesystems that we initialize. Currently, /sys is initialized depending on manifest, and /dev/attestation depending on whether we're running under SGX.

A safer way would be to use paths for checkpointing, e.g. /dev/null or /proc/!pid/cmdline; note that some pseudo-nodes like !pid would need an artificial name). But it's more complicated, and IMO not worth it.

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @pwmarcz)


LibOS/shim/include/shim_fs_pseudo.h, line 79 at r1 (raw file):

Previously, pwmarcz (Paweł Marczewski) wrote…

FYI: The exact pseudo nodes created depends on filesystems that we initialize. Currently, /sys is initialized depending on manifest, and /dev/attestation depending on whether we're running under SGX.

A safer way would be to use paths for checkpointing, e.g. /dev/null or /proc/!pid/cmdline; note that some pseudo-nodes like !pid would need an artificial name). But it's more complicated, and IMO not worth it.

Hmm, how does this work with pseudo nodes created dynamically, like e.g. /proc/<pid>/fd/*? I think we don't create them right now, but this implementation seems to disallow such a thing to exist?


LibOS/shim/src/fs/shim_fs_pseudo.c, line 213 at r2 (raw file):

Note that we might not be holding `g_dcache_lock` here, so we cannot call
`pseudo_readdir`.

Is it still safe to traverse the children list without any locks? I guess you're assuming here that the pseudo directories can't change in runtime?


LibOS/shim/src/fs/shim_fs_pseudo.c, line 503 at r2 (raw file):

g_pseudo_node_count++

Shouldn't this be atomic? Or you're also assuming here that we create pseudo nodes only during initialization?

Copy link
Contributor Author

@pwmarcz pwmarcz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @mkow)


LibOS/shim/include/shim_fs_pseudo.h, line 79 at r1 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Hmm, how does this work with pseudo nodes created dynamically, like e.g. /proc/<pid>/fd/*? I think we don't create them right now, but this implementation seems to disallow such a thing to exist?

That's right. The list of pseudo-nodes is static. This checkpointing scheme also assumes that: otherwise you couldn't easily send a handle to a child process and expect it to find the same node.

However, these nodes can generate a dynamic list of files. We do have /proc/<pid>/fd/<n>, and it is a single node. The name_exists / list_names callbacks define the file names (e.g. 0, 1, 2...).


LibOS/shim/src/fs/shim_fs_pseudo.c, line 213 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…
Note that we might not be holding `g_dcache_lock` here, so we cannot call
`pseudo_readdir`.

Is it still safe to traverse the children list without any locks? I guess you're assuming here that the pseudo directories can't change in runtime?

Yes, it cannot change in runtime.


LibOS/shim/src/fs/shim_fs_pseudo.c, line 503 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…
g_pseudo_node_count++

Shouldn't this be atomic? Or you're also assuming here that we create pseudo nodes only during initialization?

Yes. This is an assumption this module always made. Notice that there are no locks taken for adding the new node to linked lists below.

Copy link

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @mkow and @pwmarcz)

a discussion (no related file):
I'm confused as to what happens in e.g. this case:

  1. Parent process creates a child process and opens /proc/<child-pid>/exe.
  2. Child process terminates.
  3. Parent process creates another child process.
  4. New child process gets a checkpoint with the inode for /proc/<child-pid>/exe and the opened shim handle.
  5. Child process reads from the opened shim handle:
    • In the old code, pseudo_read() would return -ENOENT because it would try to pseudo_find and fail to find that <child-pid>
    • In your new code, pseudo_read() would return the string that was memorized by the parent process in hdl->info.str.mem.

Is my understanding correct? Also, is this how the Linux kernel does it?

In other words, this PR fixes the problem of "dangling shim handles" -- the handles' inodes (and corresponding pseudo_nodes) are now correctly checkpointed and "survive" in the child process.



LibOS/shim/include/shim_fs_pseudo.h, line 79 at r1 (raw file):

Previously, pwmarcz (Paweł Marczewski) wrote…

That's right. The list of pseudo-nodes is static. This checkpointing scheme also assumes that: otherwise you couldn't easily send a handle to a child process and expect it to find the same node.

However, these nodes can generate a dynamic list of files. We do have /proc/<pid>/fd/<n>, and it is a single node. The name_exists / list_names callbacks define the file names (e.g. 0, 1, 2...).

The assumption sounds reasonable. I cannot come up with a hypothetical case when we would diverge a list of pseudo nodes we create in the child process...


LibOS/shim/src/fs/shim_fs_pseudo.c, line 22 at r2 (raw file):

static struct pseudo_node* g_pseudo_nodes[PSEUDO_MAX_NODES];
static unsigned int g_pseudo_node_count;

Maybe add a comment here that g_pseudo_nodes is used only in two cases:

  1. Filled during each-process initialization (mounting of pseudo-FSes), assumed to be deterministic in all processes of the same Gramine instance
  2. Queried during checkpoint-restore in the child process, to bind inodes to pseudo_nodes created in step (1)

Maybe also add a node that in the parent process, this is basically not used at all.


LibOS/shim/src/fs/shim_fs_pseudo.c, line 46 at r2 (raw file):

 * Note that `pseudo_find` might fail for such a checkpointed dentry: for instance, we might have a
 * dentry for `/proc/<pid>` where the process does not exist anymore. Ideally, we would invalidate
 * such dentries; for now, all operations on them will return -ENOENT.

"Ideally, we would invalidate such dentries" -- we don't do it even in the new PR, right?


LibOS/shim/src/fs/shim_fs_pseudo.c, line 64 at r2 (raw file):

static int pseudo_icheckpoint(struct shim_inode* inode, void** out_data, size_t* out_size) {
    unsigned int* id = malloc(sizeof(*id));

Do we just memory leak this object? How is it supposed to be freed / garbage collected?

Ah, I guess it will be freed in the generic checkpoint code (or I guess in the inode-checkpoint code), after it sends all out_data to the child process. Is it so?

Copy link
Contributor Author

@pwmarcz pwmarcz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 2 files reviewed, 7 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)

a discussion (no related file):
I think that's correct.

Also, is this how the Linux kernel does it?

I haven't looked too closely. I think there is a similar layer of abstraction to our "str" files, so I would expect the value to be memoized.



LibOS/shim/include/shim_fs_pseudo.h, line 79 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

The assumption sounds reasonable. I cannot come up with a hypothetical case when we would diverge a list of pseudo nodes we create in the child process...

Mostly I'd be afraid of doing this by mistake - so far, the initialization has some conditions, but they at most depend on the manifest.


LibOS/shim/src/fs/shim_fs_pseudo.c, line 22 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Maybe add a comment here that g_pseudo_nodes is used only in two cases:

  1. Filled during each-process initialization (mounting of pseudo-FSes), assumed to be deterministic in all processes of the same Gramine instance
  2. Queried during checkpoint-restore in the child process, to bind inodes to pseudo_nodes created in step (1)

Maybe also add a node that in the parent process, this is basically not used at all.

I added a comment, explaining what this array contains and how we use it. The assumption is explained in the header file as well, but maybe it doesn't hurt to repeat it here.


LibOS/shim/src/fs/shim_fs_pseudo.c, line 46 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

"Ideally, we would invalidate such dentries" -- we don't do it even in the new PR, right?

Right. However, the rest of the paragraph does not apply to pseudo_find (we don't even call it for dentries restored from a checkpoint). And the fact that we don't invalidate no-longer-existing files is documented in the header file already.

By the way, this is similar to the problem we discussed recently: files disappearing on the host, or modified by another process. The solution is also to mark these files (dentries) as invalid once we notice that. But this should be less problematic after inodes are introduced everywhere, and dentries are just a name-to-inode association.


LibOS/shim/src/fs/shim_fs_pseudo.c, line 64 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Do we just memory leak this object? How is it supposed to be freed / garbage collected?

Ah, I guess it will be freed in the generic checkpoint code (or I guess in the inode-checkpoint code), after it sends all out_data to the child process. Is it so?

Yes, it will be freed by the caller, and this is explained in the documentation for icheckpoint.

This interface is a bit awkward - as explained in the previous PR (#297), it is done this way so that it's self-contained. If/when we rewrite the checkpoint system, it will probably make more sense for icheckpoint/irestore to send/receive data directly "over the wire".

Copy link
Contributor Author

@pwmarcz pwmarcz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 2 files reviewed, 7 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)

a discussion (no related file):

Previously, pwmarcz (Paweł Marczewski) wrote…

I think that's correct.

Also, is this how the Linux kernel does it?

I haven't looked too closely. I think there is a similar layer of abstraction to our "str" files, so I would expect the value to be memoized.

And, this is speculation, but I would expect Linux to do what I'm planning to do later (see the comment about invalidating files): keep a handle associated with an inode object, even though that inode object is no longer reachable by other means.


dimakuv
dimakuv previously approved these changes Feb 15, 2022
Copy link

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r3, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (1 more required), "fixup! " found in commit messages' one-liners (waiting on @mkow)


LibOS/shim/src/fs/shim_fs_pseudo.c, line 46 at r2 (raw file):

Previously, pwmarcz (Paweł Marczewski) wrote…

Right. However, the rest of the paragraph does not apply to pseudo_find (we don't even call it for dentries restored from a checkpoint). And the fact that we don't invalidate no-longer-existing files is documented in the header file already.

By the way, this is similar to the problem we discussed recently: files disappearing on the host, or modified by another process. The solution is also to mark these files (dentries) as invalid once we notice that. But this should be less problematic after inodes are introduced everywhere, and dentries are just a name-to-inode association.

Agreed, thanks for the explanation.

mkow
mkow previously approved these changes Feb 15, 2022
Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r3, all commit messages.
Reviewable status: all files reviewed, all discussions resolved, "fixup! " found in commit messages' one-liners


LibOS/shim/include/shim_fs_pseudo.h, line 79 at r1 (raw file):

Previously, pwmarcz (Paweł Marczewski) wrote…

Mostly I'd be afraid of doing this by mistake - so far, the initialization has some conditions, but they at most depend on the manifest.

Fair. I feel a bit uneasy about the fact that we can't assert that the trees are exactly the same in every process, but I don't have an idea for a simple solution to this.

In addition, `pseudo_node` objects are now tracked by number and
checkpointed.

Signed-off-by: Paweł Marczewski <[email protected]>
Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: ITL)

Copy link

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

@mkow mkow merged commit b571f2a into master Feb 15, 2022
@mkow mkow deleted the pawel/pseudo-inode branch February 15, 2022 17:26
@pwmarcz pwmarcz mentioned this pull request Feb 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants