PVF: Add Secure Validator Mode #2486

mrcnski · 2023-11-24T13:56:31Z

Summary

Running --validator without all security features available (with some exceptions) requires the --insecure-validator-i-know-what-i-do flag.

Overview

We've recently added Linux-only security features. This PR makes them "required" to enforce protection on validators (and thus the network), though there is a CLI flag to bypass the restrictions.

Screenshots

If all capabilities are present, or the insecure flag is passed

If some optional capabilities are missing

If some required capabilities are missing

Yeah, this is really ugly, because we return SubsystemError from candidate-validation. We could move the check to before subsystem initialization, like we do for the worker version check. But I believe we want subsystems to be in charge of their own initialization and requirements (see #586.)

Exceptions

Either one of the landlock or the unshare-and-change-root capability are allowed to be missing if the other is present, because either provides FS security. We decided to be lax about this as Landlock requires Linux 5.13+ and many validators are not yet on this version. See #1444 (comment).

Other

In this PR I also did the following. (It might be good to review commit-by-commit.)

Shutdown the worker when we can’t enable security features that we were able to enable in the host.
Add secure_validator_mode to SecurityStatus, pass it to workers this way.
Pass security options to workers over the socket instead of CLI args. (Looks much cleaner.)
Cleaned up a couple of old TODOs.
Fix check_seccomp and check_landlock errors not being logged to stderr
(involved a bit of a refactor)

Some refactors:

Refactored spawn_with_program_path
Refactored worker data (see “Worker data” at PVF: Refactor argument passing #1576). This was driving me crazy.
Move change_root to its own file
Refactor change_root errors

TODO

Should we merge this for 1.5? I vote yes, but objections are welcome:

There was an informative warning that was shown to validators so that they could prepare to meet the new security requirements. However, due to #2426, we announced to validators that they need to ignore the warning for one release. We could show the fixed warning in 1.5 and merge this in 1.6, but we may not have time with on-demand coming. See release calendar.

However, we recently decided that Landlock is optional and most validators are on x86-64. Almost no validators should actually need the insecure flag, so I believe that more warning time is not needed.

How to abort subsystem initialization?

Currently it is pretty ugly (see screenshots above under "If some required capabilities are missing"). But I believe we want subsystems to be in charge of their own initialization and requirements (see #586.) We can move the check out of the subsystem, or make that shutdown less ugly maybe...

- Shutdown the worker when we can’t enable security features that we were able to enable in the host. - Add `secure_validator_mode` to `SecurityStatus`, pass it to workers this way. - Pass security options to worker over the socket instead of CLI args. (Looks much cleaner.) - Refactor `spawn_with_program_path` to only log once - Refactored worker data (see “Worker data” at #1576). This was driving me crazy. - Cleaned up a couple of old TODOs.

- (involved a bit of a refactor) - Move change_root to its own file - Refactor change_root errors

polkadot/node/core/pvf/common/src/worker/security/mod.rs

…-mode' into mrcnski/pvf-add-secure-validator-mode

polkadot/node/core/pvf/src/security.rs

polkadot/node/core/pvf/common/src/worker/security/change_root.rs

polkadot/node/core/pvf/tests/it/main.rs

alindima · 2023-11-28T16:17:30Z

polkadot/node/core/pvf/src/security.rs

+			// Landlock is present on relatively recent Linuxes. This is optional if the unshare
+			// capability is present, providing FS sandboxing a different way.
+			CannotEnableLandlock(_) => security_status.can_unshare_user_namespace_and_change_root,
+			// seccomp should be present on all modern Linuxes unless it's been disabled.
 			CannotEnableSeccomp(_) => false,
-			CannotUnshareUserNamespaceAndChangeRoot(_) => false,
+			// Should always be present on modern Linuxes. If not, Landlock also provides FS
+			// sandboxing, so don't enforce this.
+			CannotUnshareUserNamespaceAndChangeRoot(_) => security_status.can_enable_landlock,


I don't see why we allow any security features to not be enabled in secure mode.
I view secure mode as an all-or-nothing type thing, with our safest options and recommendations. Otherwise, why bother having a secure mode when we could instead do a best-effort security and just log the errors, without killing the process.

I see what you mean. Our approach has changed a bit over time, but we settled on this for mostly practical reasons. Since we have two FS sandboxing features, a validator can be missing one and still be reasonably safe. On the other hand, we "require" seccomp because it's our only networking sandboxing feature.

You can see in telemetry that a large amount of validators (like at least 1/3) is on Ubuntu 20.04.6 LTS. (A bit better than when I last checked in May, it was more like 50%.) Enforcing 5.13+ would "require" a lot of people to upgrade on a timeline that may not be convenient for them.

Remember, if a validator can't run securely he has to either upgrade his setup (which may be painful), use the insecure flag (which is not ideal), or stay on an old node version (not great either). So, we tried to minimize "breakage". We've also communicated already that only one of the FS features will be "required". We could always switch to enforcing everything, and communicate it properly in advance, but with ~~Godzilla~~ PolkaVM coming it may not be worth it.

A final note. Some security capabilities being present on some machines and missing on others may lead to indeterminism that can be exploited to cause disagreements/disputes/stalls. However, we are so far from determinism anyway that this was, in the end, a low priority of the security work. If we had more time, we could have done that virtualization work while iteratively making the security "requirements" more strict. And I have been putting "require" in quotes because, as Basti has said multiple times, we should not require anything. None of this is in the spec, and validators can always build from source with the barriers removed.

we are so far from determinism anyway

To clarify, if an attacker can execute arbitrary code, there are already many other means by which he can obtain and exploit randomness. We decided that protecting the chain was not feasible with this work, and we have governance to deal with that sort of thing, so we prioritized securing validator machines.

You can see in telemetry that a large amount of validators (like at least 1/3) is on Ubuntu 20.04.6 LTS

The telemetry page unfortunately show 70% of the kernel versions as "unknown" :(

According to this page, Ubuntu 20.04.5 should have at least kernel version 5.15.

use the insecure flag (which is not ideal)

Well then, what's the purpose of having the insecure flag? The insecure flag should still apply security sandboxing on a best-effort basis, is that right?

Anyway, those were my 2 cents, I won't be against merging it as it is either :)

I think Alin has a point. Even in insecure mode, still enabling in best-effort makes senses. For secure mode: As @mrcnski said, we have multiple layers of restricting filesystem access. If we assume each and every one of them is sufficient by itself, it makes sense to say the validator is secure as long as at least one of them can be enabled.

I got it from here, but I guess that is wrong. 🤷 Someone in the comments there even proved it is wrong. I guess Ubuntu "patch" versions can bump the Linux kernel version? Anyway, that is good news. I can raise a follow-up for enforcing it.

Well that's older I would assume. I assume ubuntu LTS should get updates that include kernel upgrades.

We want validators to be secure because a validator getting compromised is a big deal, but we have the constraint that we can't actually require it. So we just make it inconvenient and scary to run insecurely.

Of course, I was suggesting to encourage them to run with all layers.

If we assume each and every one of them is sufficient by itself, it makes sense to say the validator is secure as long as at least one of them can be enabled.

Agreed, but then why do we have landlock is we think chroot is enough 😛

Agreed, but then why do we have landlock is we think chroot is enough 😛

Indeed, birdcage removed landlock because they deemed it was redundant on top of mount namespaces.

There are a few reasons we have it:

A lot of this sandboxing work was a research project for me, and I found out about namespaces/pivot_root after landlock. (Still don't fully understand the former tbh.) As I understand from the above discussion, namespaces alone are merely "good enough".

Landlock provides finer-grained control (didn't end up being useful for us, but gives us more flexibility for the design space in the future).

Landlock is still expanding its capabilities and will only get more powerful with time.

I believe it's possible that unshare/pivot_root may not be available on some systems, so it's nice to have a fallback.

I don't see any downsides to having both, and defense-in-depth has been our philosophy since the beginning stages. It's also the approach advocated by the landlock dev in the above PR:

layering of mount/Landlock definitely works. I'm suggesting to use namespaces, then seccomp (locking down the availability of user namespaces and uncommon syscalls), then Landlock (partial restrictions but could still help with fine-grained access control, if supported by the running system).

I believe it's possible that unshare/pivot_root may not be available on some systems, so it's nice to have a fallback.

I think that's highly unlikely

That being said, sounds fair 👍🏻

Thanks for the review @alindima!

Indeed, birdcage removed landlock because they deemed it was redundant on top of mount namespaces.

I found out that a different project, Emilua, has taken the opposite approach:

Linux namespaces are too convoluted and it became clear how inappropriate they really are to just create sandboxes. Starting from the just newly released version, Landlock and Capsicum were added as alternatives to Linux namespaces.

Their sandboxing docs are really good. It's a bit too late for my work to benefit from them, oh well.

s0me0ne-unkn0wn

Looks really good!

My only doubt is why we don't enforce the requirement for development mode (running --dev --alice on an insecure host warns about security but doesn't terminate). I understand it's for a better developer experience. But I ask myself again, how good is it that the DX diverges so much from the UX? We almost never test things in a production context during development. That may result in some undiscovered pitfalls and the wider the gap between DX and UX, the more pitfalls may fit into it. Just a general thought to think about.

eskimor · 2023-12-01T10:33:08Z

Looks really good!

My only doubt is why we don't enforce the requirement for development mode (running --dev --alice on an insecure host warns about security but doesn't terminate). I understand it's for a better developer experience. But I ask myself again, how good is it that the DX diverges so much from the UX? We almost never test things in a production context during development. That may result in some undiscovered pitfalls and the wider the gap between DX and UX, the more pitfalls may fit into it. Just a general thought to think about.

Good remark! Especially we should notice when the UX sucks ... The problem is, some developers us MAC and there secure mode does not work - right?

mrcnski · 2023-12-01T10:50:59Z

We almost never test things in a production context during development.

It's a really good point, but it's not realistic for devs to test everything. I think between the merge window closing and the actual release, a dedicated person/process should be doing actual testing of the node and we should be open to backporting fixes. I didn't mention it before because I didn't want to sound like I was making lame excuses for things like this and this. But like eskimor says, I and many other devs work on a Mac. I do my due diligence by testing on a Linux VPS, but we can't expect all devs to do that (and my VPS account is root so it didn't catch some things.)

Thanks for the review! 😊

mrcnski · 2023-12-01T13:12:35Z

@pepoviola zombienet tests are failing. I don't know how to check it but i guess it's because the new argument is required in CI. Can you help me update zombienet in CI to include the fix?

pepoviola · 2023-12-01T13:23:36Z

@pepoviola zombienet tests are failing. I don't know how to check it but i guess it's because the new argument is required in CI. Can you help me update zombienet in CI to include the fix?

yes, we need a new version with the fix we merge in zombienet. Let me create a new one and update the version here.
(paritytech/zombienet#1571)
Thx!

pepoviola · 2023-12-01T21:23:56Z

@mrcnski this test fails because we are first spawning a node using the latest version from parity registry and then replacing the binary with the one built on the pr, since the former doesn't have the --insecure-validator-i-know-what-i-do the cmd generated to run the node doesn't includes this flag and then when we replace the binary the execution fails (logs). Let me check how we can modify the test to cover this case.

mrcnski · 2023-12-03T11:02:39Z

Thanks @pepoviola! Can we re-generate the cmd? And not to rush you, but just note that I have to press the merge button by Wednesday to get this into the next release.

pepoviola · 2023-12-03T17:35:42Z

Thanks @pepoviola! Can we re-generate the cmd? And not to rush you, but just note that I have to press the merge button by Wednesday to get this into the next release.

This require some changes in zombienet, let me try to add the posibility to add/remove flags in the DSL and we can touch base tomorrow.
Thx!!

pepoviola · 2023-12-05T01:24:45Z

Hi @mrcnski, last commit fix the upgrade test.
Thx!

* tsv-disabling: (155 commits) Fix failing rc-automation GHA (#2648) [ci] Return CI_IMAGE variable (#2647) Support querying peer reputation (#2392) [ci] Update rust to 1.74 (#2545) Relax approval requirements on CI files (#2564) Added AllSiblingSystemParachains matcher to be used at a parachain level (#2422) Improve polkadot sdk docs (#2631) Bridges subtree update (#2602) pallet-xcm: add new flexible `transfer_assets()` call/extrinsic (#2388) [ci] Move rust-features.sh script to .gitlab folder (#2630) Bump parity-db from 0.4.10 to 0.4.12 (#2635) sp-core: Rename VrfOutput to VrfPreOutput (#2534) chore: fix typo (#2596) Bump tracing-core from 0.1.31 to 0.1.32 (#2618) chore: fixed std wasm build of xcm (#2535) Fix PRdoc that have been previously drafted with older schema (#2623) Github Workflow migrations (#1574) Bridges update subtree (#2625) PVF: Add Secure Validator Mode (#2486) statement-distribution: Add tests for incoming acknowledgements (#2498) ...

matthewmarcus · 2024-02-20T03:07:20Z

Hey @mrcnski. Not sure if this is the right place to share this and thus ask for some guidance. So please pardon me if this is out of line (I'm not a regular github user).

After updating our Kusama validator to the latest version (1.7.0 - and now 1.7.1), we're getting errors (see below) similar to those you shared above in the original post. We're running our validator on a bare metal box running Ubuntu 20.04 with the latest kernel (6.7.5). Additionally, we have SECURE BOOT disabled in the BIOS.

We thought b/c we were running an older kernel version (5.15.0-88-generic) that updating to the latest version would solve our problem. Alas, it did not. We are able to run our node using the "insecure" flag, but obviously that isn't preferable and we'd like to keep our validator secure and not put the network at risk in any way.

Might you (or anyone else) be able to give us some guidance on what we need to do to alleviate our issue?

Errors:

Running validation of malicious PVF code has a higher risk of compromising this machine.
- Cannot enable landlock (ABI 1), a Linux 5.13+ kernel security feature: not available: Could not fully enable: NotEnforced
- Cannot unshare user namespace and change root, which are Linux-specific kernel security features: not available: unshare user and mount namespaces: Operation not permitted (os error 1)
- Optional: Cannot call clone with all sandboxing flags, a Linux-specific kernel security features: not available: could not clone, errno: EPERM: Operation not permitted

mrcnski · 2024-02-20T11:21:06Z

Off the top of my head I'm not sure what the problem could be. What architecture are you on? Only x86-64 is supported.

@s0me0ne-unkn0wn did the security code change in the latest release? Maybe @maksimryndin's refactor broke something?

maksimryndin · 2024-02-20T12:45:45Z

Hey @mrcnski. Not sure if this is the right place to share this and thus ask for some guidance. So please pardon me if this is out of line (I'm not a regular github user).

After updating our Kusama validator to the latest version (1.7.0 - and now 1.7.1), we're getting errors (see below) similar to those you shared above in the original post. We're running our validator on a bare metal box running Ubuntu 20.04 with the latest kernel (6.7.5). Additionally, we have SECURE BOOT disabled in the BIOS.

We thought b/c we were running an older kernel version (5.15.0-88-generic) that updating to the latest version would solve our problem. Alas, it did not. We are able to run our node using the "insecure" flag, but obviously that isn't preferable and we'd like to keep our validator secure and not put the network at risk in any way.

Might you (or anyone else) be able to give us some guidance on what we need to do to alleviate our issue?

Errors:

Running validation of malicious PVF code has a higher risk of compromising this machine. - Cannot enable landlock (ABI 1), a Linux 5.13+ kernel security feature: not available: Could not fully enable: NotEnforced - Cannot unshare user namespace and change root, which are Linux-specific kernel security features: not available: unshare user and mount namespaces: Operation not permitted (os error 1) - Optional: Cannot call clone with all sandboxing flags, a Linux-specific kernel security features: not available: could not clone, errno: EPERM: Operation not permitted

@matthewmarcus moved the conversation to #2728 (comment)

mrcnski added 10 commits November 21, 2023 14:29

PVF: Add Secure Validator Mode

4cb5913

Merge branch 'master' into mrcnski/pvf-add-secure-validator-mode

c3d512c

Minor fix

f44ca5f

Merge branch 'master' into mrcnski/pvf-add-secure-validator-mode

69c117f

Make only one FS security feature required

cb91ed1

Refactor secure mode code a bit

61b80d8

Fix check_seccomp and check_landlock errs not being logged to stderr

ba03492

- (involved a bit of a refactor) - Move change_root to its own file - Refactor change_root errors

Merge branch 'master' into mrcnski/pvf-add-secure-validator-mode

0ce35ab

Fix unshare error (cannot run in multithreaded context)

ec9bd4b

mrcnski added the T0-node This PR/Issue is related to the topic “node”. label Nov 24, 2023

mrcnski requested review from eskimor, bkchr and s0me0ne-unkn0wn November 24, 2023 13:56

mrcnski self-assigned this Nov 24, 2023

Update some comments

090d2d9

mrcnski commented Nov 24, 2023

View reviewed changes

polkadot/node/core/pvf/common/src/worker/security/mod.rs Show resolved Hide resolved

mrcnski added 4 commits November 24, 2023 15:40

Some fixes

dc7961d

Fix bench

8a72556

Merge branch 'master' into mrcnski/pvf-add-secure-validator-mode

a4d173c

fix CI

6520dd8

This was referenced Nov 26, 2023

PVF: Refactor argument passing #1576

Open

Pass --insecure-validator-i-know-what-i-do flag by default paritytech/zombienet#1551

Closed

mrcnski requested a review from alindima November 27, 2023 12:55

mrcnski added 2 commits November 27, 2023 20:00

Remove old TODO

f5f11bd

Merge remote-tracking branch 'origin/mrcnski/pvf-add-secure-validator…

3aa8bd4

…-mode' into mrcnski/pvf-add-secure-validator-mode

alindima reviewed Nov 28, 2023

View reviewed changes

mrcnski added 3 commits November 29, 2023 11:47

Address some review comments

da7127c

Merge branch 'master' into mrcnski/pvf-add-secure-validator-mode

13f5e92

Update doc

26ca19c

eskimor approved these changes Dec 1, 2023

View reviewed changes

s0me0ne-unkn0wn approved these changes Dec 1, 2023

View reviewed changes

mrcnski added 2 commits December 1, 2023 11:53

Merge branch 'master' into mrcnski/pvf-add-secure-validator-mode

fccc36a

Fix rustdoc warning

0ffa527

bump zombienet version

8fa147f

pepoviola requested a review from a team as a code owner December 1, 2023 14:04

Baby's first prdoc

f1b3539

pepoviola added 2 commits December 4, 2023 20:14

Zombienet, add script to update cmd and fix upgrade-node test

5509b31

bump zombienet version

2209855

mrcnski added 2 commits December 5, 2023 11:43

Merge branch 'master' into mrcnski/pvf-add-secure-validator-mode

0469f56

Fix prdoc 😢

27b45a0

mrcnski enabled auto-merge (squash) December 5, 2023 11:16

mrcnski merged commit c046a9d into master Dec 5, 2023
114 of 115 checks passed

mrcnski deleted the mrcnski/pvf-add-secure-validator-mode branch December 5, 2023 12:32

github-actions bot mentioned this pull request Feb 19, 2024

Update substrate/polkadot/cumulus from v1.3.0 to v1.6.0 moondance-labs/tanssi#419

Closed

bkchr mentioned this pull request Feb 20, 2024

Got Essential task overseer failed error after upgrading Kusama and Polkadot validator to v1.5.0 #2728

Open

2 tasks

github-actions bot mentioned this pull request Mar 13, 2024

Update polkadot-sdk from v1.3.0 to v1.7.2 moonbeam-foundation/moonbeam#2703

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PVF: Add Secure Validator Mode #2486

PVF: Add Secure Validator Mode #2486

mrcnski commented Nov 24, 2023 •

edited

Loading

alindima Nov 28, 2023

mrcnski Nov 28, 2023

mrcnski Nov 28, 2023

alindima Nov 29, 2023

eskimor Nov 29, 2023

alindima Nov 29, 2023

mrcnski Nov 29, 2023

alindima Nov 29, 2023

mrcnski Nov 29, 2023

mrcnski Dec 5, 2023

s0me0ne-unkn0wn left a comment

eskimor commented Dec 1, 2023

mrcnski commented Dec 1, 2023

mrcnski commented Dec 1, 2023

pepoviola commented Dec 1, 2023 •

edited

Loading

pepoviola commented Dec 1, 2023

mrcnski commented Dec 3, 2023

pepoviola commented Dec 3, 2023

pepoviola commented Dec 5, 2023

matthewmarcus commented Feb 20, 2024 •

edited

Loading

mrcnski commented Feb 20, 2024

maksimryndin commented Feb 20, 2024

PVF: Add Secure Validator Mode #2486

PVF: Add Secure Validator Mode #2486

Conversation

mrcnski commented Nov 24, 2023 • edited Loading

Summary

Overview

Screenshots

If all capabilities are present, or the insecure flag is passed

If some optional capabilities are missing

If some required capabilities are missing

Exceptions

Other

TODO

Related

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s0me0ne-unkn0wn left a comment

Choose a reason for hiding this comment

eskimor commented Dec 1, 2023

mrcnski commented Dec 1, 2023

mrcnski commented Dec 1, 2023

pepoviola commented Dec 1, 2023 • edited Loading

pepoviola commented Dec 1, 2023

mrcnski commented Dec 3, 2023

pepoviola commented Dec 3, 2023

pepoviola commented Dec 5, 2023

matthewmarcus commented Feb 20, 2024 • edited Loading

mrcnski commented Feb 20, 2024

maksimryndin commented Feb 20, 2024

mrcnski commented Nov 24, 2023 •

edited

Loading

pepoviola commented Dec 1, 2023 •

edited

Loading

matthewmarcus commented Feb 20, 2024 •

edited

Loading