Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVF: Add Secure Validator Mode #2486

Merged
merged 28 commits into from
Dec 5, 2023
Merged

Conversation

mrcnski
Copy link
Contributor

@mrcnski mrcnski commented Nov 24, 2023

Summary

Running --validator without all security features available (with some exceptions) requires the --insecure-validator-i-know-what-i-do flag.

Overview

We've recently added Linux-only security features. This PR makes them "required" to enforce protection on validators (and thus the network), though there is a CLI flag to bypass the restrictions.

Screenshots

If all capabilities are present, or the insecure flag is passed

Screenshot 2023-11-24 at 14 44 30

If some optional capabilities are missing

Screenshot 2023-11-24 at 14 34 29

If some required capabilities are missing

Screenshot 2023-11-21 at 19 52 44

Screenshot 2023-11-09 at 11 20 59

Yeah, this is really ugly, because we return SubsystemError from candidate-validation. We could move the check to before subsystem initialization, like we do for the worker version check. But I believe we want subsystems to be in charge of their own initialization and requirements (see #586.)

Exceptions

Either one of the landlock or the unshare-and-change-root capability are allowed to be missing if the other is present, because either provides FS security. We decided to be lax about this as Landlock requires Linux 5.13+ and many validators are not yet on this version. See #1444 (comment).

Other

In this PR I also did the following. (It might be good to review commit-by-commit.)

  • Shutdown the worker when we can’t enable security features that we were able to enable in the host.
  • Add secure_validator_mode to SecurityStatus, pass it to workers this way.
  • Pass security options to workers over the socket instead of CLI args. (Looks much cleaner.)
  • Cleaned up a couple of old TODOs.
  • Fix check_seccomp and check_landlock errors not being logged to stderr
    (involved a bit of a refactor)

Some refactors:

  • Refactored spawn_with_program_path
  • Refactored worker data (see “Worker data” at PVF: Refactor argument passing #1576). This was driving me crazy.
  • Move change_root to its own file
  • Refactor change_root errors

TODO

  • Should we merge this for 1.5? I vote yes, but objections are welcome:

There was an informative warning that was shown to validators so that they could prepare to meet the new security requirements. However, due to #2426, we announced to validators that they need to ignore the warning for one release. We could show the fixed warning in 1.5 and merge this in 1.6, but we may not have time with on-demand coming. See release calendar.

However, we recently decided that Landlock is optional and most validators are on x86-64. Almost no validators should actually need the insecure flag, so I believe that more warning time is not needed.

  • How to abort subsystem initialization?

Currently it is pretty ugly (see screenshots above under "If some required capabilities are missing"). But I believe we want subsystems to be in charge of their own initialization and requirements (see #586.) We can move the check out of the subsystem, or make that shutdown less ugly maybe...

Related

Closes #1444
Closes #881
Closes #2495

Previous attempt: paritytech/polkadot#7073

- Shutdown the worker when we can’t enable security features that we were able to enable in the host.

- Add `secure_validator_mode` to `SecurityStatus`, pass it to workers this way.

- Pass security options to worker over the socket instead of CLI args. (Looks much cleaner.)

- Refactor `spawn_with_program_path` to only log once

- Refactored worker data (see “Worker data” at #1576). This was driving me crazy.

- Cleaned up a couple of old TODOs.
- (involved a bit of a refactor)
- Move change_root to its own file
- Refactor change_root errors
@mrcnski mrcnski added the T0-node This PR/Issue is related to the topic “node”. label Nov 24, 2023
@mrcnski mrcnski self-assigned this Nov 24, 2023
polkadot/node/core/pvf/src/security.rs Show resolved Hide resolved
polkadot/node/core/pvf/tests/it/main.rs Outdated Show resolved Hide resolved
Comment on lines +133 to +140
// Landlock is present on relatively recent Linuxes. This is optional if the unshare
// capability is present, providing FS sandboxing a different way.
CannotEnableLandlock(_) => security_status.can_unshare_user_namespace_and_change_root,
// seccomp should be present on all modern Linuxes unless it's been disabled.
CannotEnableSeccomp(_) => false,
CannotUnshareUserNamespaceAndChangeRoot(_) => false,
// Should always be present on modern Linuxes. If not, Landlock also provides FS
// sandboxing, so don't enforce this.
CannotUnshareUserNamespaceAndChangeRoot(_) => security_status.can_enable_landlock,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why we allow any security features to not be enabled in secure mode.
I view secure mode as an all-or-nothing type thing, with our safest options and recommendations. Otherwise, why bother having a secure mode when we could instead do a best-effort security and just log the errors, without killing the process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. Our approach has changed a bit over time, but we settled on this for mostly practical reasons. Since we have two FS sandboxing features, a validator can be missing one and still be reasonably safe. On the other hand, we "require" seccomp because it's our only networking sandboxing feature.

You can see in telemetry that a large amount of validators (like at least 1/3) is on Ubuntu 20.04.6 LTS. (A bit better than when I last checked in May, it was more like 50%.) Enforcing 5.13+ would "require" a lot of people to upgrade on a timeline that may not be convenient for them.

Remember, if a validator can't run securely he has to either upgrade his setup (which may be painful), use the insecure flag (which is not ideal), or stay on an old node version (not great either). So, we tried to minimize "breakage". We've also communicated already that only one of the FS features will be "required". We could always switch to enforcing everything, and communicate it properly in advance, but with Godzilla PolkaVM coming it may not be worth it.

A final note. Some security capabilities being present on some machines and missing on others may lead to indeterminism that can be exploited to cause disagreements/disputes/stalls. However, we are so far from determinism anyway that this was, in the end, a low priority of the security work. If we had more time, we could have done that virtualization work while iteratively making the security "requirements" more strict. And I have been putting "require" in quotes because, as Basti has said multiple times, we should not require anything. None of this is in the spec, and validators can always build from source with the barriers removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are so far from determinism anyway

To clarify, if an attacker can execute arbitrary code, there are already many other means by which he can obtain and exploit randomness. We decided that protecting the chain was not feasible with this work, and we have governance to deal with that sort of thing, so we prioritized securing validator machines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see in telemetry that a large amount of validators (like at least 1/3) is on Ubuntu 20.04.6 LTS

The telemetry page unfortunately show 70% of the kernel versions as "unknown" :(

According to this page, Ubuntu 20.04.5 should have at least kernel version 5.15.

use the insecure flag (which is not ideal)

Well then, what's the purpose of having the insecure flag? The insecure flag should still apply security sandboxing on a best-effort basis, is that right?

Anyway, those were my 2 cents, I won't be against merging it as it is either :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Alin has a point. Even in insecure mode, still enabling in best-effort makes senses. For secure mode: As @mrcnski said, we have multiple layers of restricting filesystem access. If we assume each and every one of them is sufficient by itself, it makes sense to say the validator is secure as long as at least one of them can be enabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it from here, but I guess that is wrong. 🤷 Someone in the comments there even proved it is wrong. I guess Ubuntu "patch" versions can bump the Linux kernel version? Anyway, that is good news. I can raise a follow-up for enforcing it.

Well that's older I would assume. I assume ubuntu LTS should get updates that include kernel upgrades.

We want validators to be secure because a validator getting compromised is a big deal, but we have the constraint that we can't actually require it. So we just make it inconvenient and scary to run insecurely.

Of course, I was suggesting to encourage them to run with all layers.

If we assume each and every one of them is sufficient by itself, it makes sense to say the validator is secure as long as at least one of them can be enabled.

Agreed, but then why do we have landlock is we think chroot is enough 😛

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but then why do we have landlock is we think chroot is enough 😛

Indeed, birdcage removed landlock because they deemed it was redundant on top of mount namespaces.

There are a few reasons we have it:

  1. A lot of this sandboxing work was a research project for me, and I found out about namespaces/pivot_root after landlock. (Still don't fully understand the former tbh.) As I understand from the above discussion, namespaces alone are merely "good enough".
  2. Landlock provides finer-grained control (didn't end up being useful for us, but gives us more flexibility for the design space in the future).
  3. Landlock is still expanding its capabilities and will only get more powerful with time.
  4. I believe it's possible that unshare/pivot_root may not be available on some systems, so it's nice to have a fallback.
  5. I don't see any downsides to having both, and defense-in-depth has been our philosophy since the beginning stages. It's also the approach advocated by the landlock dev in the above PR:

layering of mount/Landlock definitely works. I'm suggesting to use namespaces, then seccomp (locking down the availability of user namespaces and uncommon syscalls), then Landlock (partial restrictions but could still help with fine-grained access control, if supported by the running system).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's possible that unshare/pivot_root may not be available on some systems, so it's nice to have a fallback.

I think that's highly unlikely

That being said, sounds fair 👍🏻

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @alindima!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, birdcage removed landlock because they deemed it was redundant on top of mount namespaces.

I found out that a different project, Emilua, has taken the opposite approach:

Linux namespaces are too convoluted and it became clear how inappropriate they really are to just create sandboxes. Starting from the just newly released version, Landlock and Capsicum were added as alternatives to Linux namespaces.

Their sandboxing docs are really good. It's a bit too late for my work to benefit from them, oh well.

Copy link
Contributor

@s0me0ne-unkn0wn s0me0ne-unkn0wn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good!

My only doubt is why we don't enforce the requirement for development mode (running --dev --alice on an insecure host warns about security but doesn't terminate). I understand it's for a better developer experience. But I ask myself again, how good is it that the DX diverges so much from the UX? We almost never test things in a production context during development. That may result in some undiscovered pitfalls and the wider the gap between DX and UX, the more pitfalls may fit into it. Just a general thought to think about.

@eskimor
Copy link
Member

eskimor commented Dec 1, 2023

Looks really good!

My only doubt is why we don't enforce the requirement for development mode (running --dev --alice on an insecure host warns about security but doesn't terminate). I understand it's for a better developer experience. But I ask myself again, how good is it that the DX diverges so much from the UX? We almost never test things in a production context during development. That may result in some undiscovered pitfalls and the wider the gap between DX and UX, the more pitfalls may fit into it. Just a general thought to think about.

Good remark! Especially we should notice when the UX sucks ... The problem is, some developers us MAC and there secure mode does not work - right?

@mrcnski
Copy link
Contributor Author

mrcnski commented Dec 1, 2023

We almost never test things in a production context during development.

It's a really good point, but it's not realistic for devs to test everything. I think between the merge window closing and the actual release, a dedicated person/process should be doing actual testing of the node and we should be open to backporting fixes. I didn't mention it before because I didn't want to sound like I was making lame excuses for things like this and this. But like eskimor says, I and many other devs work on a Mac. I do my due diligence by testing on a Linux VPS, but we can't expect all devs to do that (and my VPS account is root so it didn't catch some things.)

Thanks for the review! 😊

@mrcnski
Copy link
Contributor Author

mrcnski commented Dec 1, 2023

@pepoviola zombienet tests are failing. I don't know how to check it but i guess it's because the new argument is required in CI. Can you help me update zombienet in CI to include the fix?

@pepoviola
Copy link
Contributor

pepoviola commented Dec 1, 2023

@pepoviola zombienet tests are failing. I don't know how to check it but i guess it's because the new argument is required in CI. Can you help me update zombienet in CI to include the fix?

yes, we need a new version with the fix we merge in zombienet. Let me create a new one and update the version here.
(paritytech/zombienet#1571)
Thx!

@pepoviola pepoviola requested a review from a team as a code owner December 1, 2023 14:04
@pepoviola
Copy link
Contributor

@mrcnski this test fails because we are first spawning a node using the latest version from parity registry and then replacing the binary with the one built on the pr, since the former doesn't have the --insecure-validator-i-know-what-i-do the cmd generated to run the node doesn't includes this flag and then when we replace the binary the execution fails (logs). Let me check how we can modify the test to cover this case.

@mrcnski
Copy link
Contributor Author

mrcnski commented Dec 3, 2023

Thanks @pepoviola! Can we re-generate the cmd? And not to rush you, but just note that I have to press the merge button by Wednesday to get this into the next release.

@pepoviola
Copy link
Contributor

Thanks @pepoviola! Can we re-generate the cmd? And not to rush you, but just note that I have to press the merge button by Wednesday to get this into the next release.

This require some changes in zombienet, let me try to add the posibility to add/remove flags in the DSL and we can touch base tomorrow.
Thx!!

@pepoviola
Copy link
Contributor

Hi @mrcnski, last commit fix the upgrade test.
Thx!

@mrcnski mrcnski enabled auto-merge (squash) December 5, 2023 11:16
@mrcnski mrcnski merged commit c046a9d into master Dec 5, 2023
114 of 115 checks passed
@mrcnski mrcnski deleted the mrcnski/pvf-add-secure-validator-mode branch December 5, 2023 12:32
ordian added a commit that referenced this pull request Dec 11, 2023
* tsv-disabling: (155 commits)
  Fix failing rc-automation GHA (#2648)
  [ci] Return CI_IMAGE variable (#2647)
  Support querying peer reputation (#2392)
  [ci] Update rust to 1.74 (#2545)
  Relax approval requirements on CI files (#2564)
  Added AllSiblingSystemParachains matcher to be used at a parachain level (#2422)
  Improve polkadot sdk docs (#2631)
  Bridges subtree update (#2602)
  pallet-xcm: add new flexible `transfer_assets()` call/extrinsic (#2388)
  [ci] Move rust-features.sh script to .gitlab folder (#2630)
  Bump parity-db from 0.4.10 to 0.4.12 (#2635)
  sp-core: Rename VrfOutput to VrfPreOutput (#2534)
  chore: fix typo (#2596)
  Bump tracing-core from 0.1.31 to 0.1.32 (#2618)
  chore: fixed std wasm build of xcm (#2535)
  Fix PRdoc that have been previously drafted with older schema (#2623)
  Github Workflow migrations (#1574)
  Bridges update subtree (#2625)
  PVF: Add Secure Validator Mode (#2486)
  statement-distribution: Add tests for incoming acknowledgements (#2498)
  ...
ordian added a commit that referenced this pull request Dec 15, 2023
* tsv-disabling: (155 commits)
  Fix failing rc-automation GHA (#2648)
  [ci] Return CI_IMAGE variable (#2647)
  Support querying peer reputation (#2392)
  [ci] Update rust to 1.74 (#2545)
  Relax approval requirements on CI files (#2564)
  Added AllSiblingSystemParachains matcher to be used at a parachain level (#2422)
  Improve polkadot sdk docs (#2631)
  Bridges subtree update (#2602)
  pallet-xcm: add new flexible `transfer_assets()` call/extrinsic (#2388)
  [ci] Move rust-features.sh script to .gitlab folder (#2630)
  Bump parity-db from 0.4.10 to 0.4.12 (#2635)
  sp-core: Rename VrfOutput to VrfPreOutput (#2534)
  chore: fix typo (#2596)
  Bump tracing-core from 0.1.31 to 0.1.32 (#2618)
  chore: fixed std wasm build of xcm (#2535)
  Fix PRdoc that have been previously drafted with older schema (#2623)
  Github Workflow migrations (#1574)
  Bridges update subtree (#2625)
  PVF: Add Secure Validator Mode (#2486)
  statement-distribution: Add tests for incoming acknowledgements (#2498)
  ...
@matthewmarcus
Copy link

matthewmarcus commented Feb 20, 2024

Hey @mrcnski. Not sure if this is the right place to share this and thus ask for some guidance. So please pardon me if this is out of line (I'm not a regular github user).

After updating our Kusama validator to the latest version (1.7.0 - and now 1.7.1), we're getting errors (see below) similar to those you shared above in the original post. We're running our validator on a bare metal box running Ubuntu 20.04 with the latest kernel (6.7.5). Additionally, we have SECURE BOOT disabled in the BIOS.

We thought b/c we were running an older kernel version (5.15.0-88-generic) that updating to the latest version would solve our problem. Alas, it did not. We are able to run our node using the "insecure" flag, but obviously that isn't preferable and we'd like to keep our validator secure and not put the network at risk in any way.

Might you (or anyone else) be able to give us some guidance on what we need to do to alleviate our issue?

Errors:

Running validation of malicious PVF code has a higher risk of compromising this machine.
- Cannot enable landlock (ABI 1), a Linux 5.13+ kernel security feature: not available: Could not fully enable: NotEnforced
- Cannot unshare user namespace and change root, which are Linux-specific kernel security features: not available: unshare user and mount namespaces: Operation not permitted (os error 1)
- Optional: Cannot call clone with all sandboxing flags, a Linux-specific kernel security features: not available: could not clone, errno: EPERM: Operation not permitted

@mrcnski
Copy link
Contributor Author

mrcnski commented Feb 20, 2024

Off the top of my head I'm not sure what the problem could be. What architecture are you on? Only x86-64 is supported.

@s0me0ne-unkn0wn did the security code change in the latest release? Maybe @maksimryndin's refactor broke something?

@maksimryndin
Copy link
Contributor

Hey @mrcnski. Not sure if this is the right place to share this and thus ask for some guidance. So please pardon me if this is out of line (I'm not a regular github user).

After updating our Kusama validator to the latest version (1.7.0 - and now 1.7.1), we're getting errors (see below) similar to those you shared above in the original post. We're running our validator on a bare metal box running Ubuntu 20.04 with the latest kernel (6.7.5). Additionally, we have SECURE BOOT disabled in the BIOS.

We thought b/c we were running an older kernel version (5.15.0-88-generic) that updating to the latest version would solve our problem. Alas, it did not. We are able to run our node using the "insecure" flag, but obviously that isn't preferable and we'd like to keep our validator secure and not put the network at risk in any way.

Might you (or anyone else) be able to give us some guidance on what we need to do to alleviate our issue?

Errors:

Running validation of malicious PVF code has a higher risk of compromising this machine. - Cannot enable landlock (ABI 1), a Linux 5.13+ kernel security feature: not available: Could not fully enable: NotEnforced - Cannot unshare user namespace and change root, which are Linux-specific kernel security features: not available: unshare user and mount namespaces: Operation not permitted (os error 1) - Optional: Cannot call clone with all sandboxing flags, a Linux-specific kernel security features: not available: could not clone, errno: EPERM: Operation not permitted

@matthewmarcus moved the conversation to #2728 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”.
Projects
Status: Completed
7 participants