Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seccomp: force -ENOSYS as new default action #2737

Closed
wants to merge 2 commits into from
Closed

seccomp: force -ENOSYS as new default action #2737

wants to merge 2 commits into from

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented Jan 14, 2021

Having -EPERM is the default was a fairly significant mistake from a
future-proofing standpoint in that it makes any new syscall return a
non-ignorable error (from glibc's point of view). We need to correct
this now because faccessat2(2) is something glibc critically needs to
have support for, but they're blocked on container runtimes because we
return -EPERM unconditionally (leading to confusion in glibc).

Unfortunately there are several issues which stop us from having a clean
solution to this problem:

  1. libseccomp has several limitations which require us to emulate
    behaviour we want:

    a. We cannot do logic based on syscall number, meaning we cannot
    specify a "largest known syscall number";
    b. libseccomp doesn't know in which kernel version a syscall was
    added, and has no API for "minimum kernel version" so we cannot
    simply ask libseccomp to generate sane -ENOSYS rules for us.
    c. Additional seccomp rules for the same syscall are not treated as
    distinct rules -- if rules overlap, seccomp will merge them. This
    means we cannot add per-syscall -EPERM fallbacks;
    d. There is no inverse operation for SCMP_CMP_MASKED_EQ;
    e. libseccomp does not allow you to specify multiple rules for a
    single argument, making it impossible to invert OR rules for
    arguments.

  2. The runtime-spec does not have any way of specifying:

    a. The errno for the default action;
    b. The minimum kernel version or "newest syscall at time of profile
    creation"; nor
    c. Which syscalls were intentionally excluded from the allow list
    (weird syscalls that are no longer used were excluded entirely,
    but Docker et al expect those syscalls to get EPERM not ENOSYS).

  3. Certain syscalls should not return -ENOSYS (especially only for
    certain argument combinations) because this could also trigger glibc
    confusion. This means we have to return -EPERM for certain syscalls
    but not as a global default.

  4. There is not an obvious (and reasonable) upper limit to syscall
    numbers, so we cannot create a set of rules for each syscall above
    the largest syscall number in libseccomp. This means we must handle
    inverse rules as described below.

  5. Any syscall can be specified multiple times, which can make
    generation of hotfix rules much harder.

As a result, we have to work around all of these things by coming up
with a heuristic to stop the bleeding. In the future we could hopefully
improve the situation in the runtime-spec and libseccomp.

The solution is to switch the default action errno to -ENOSYS, and
simply pick a Linux kernel version which is treated as the universal
minimum version (Linux 3.0 was picked since basically no widely-used
distribution ships an older kernel, and it contains enough common
syscalls which would cause confusion if -ENOSYS was returned).

We then generate -EPERM rules for each syscall which:

  1. Existed in Linux 3.0 (read: has a syscall number smaller than
    "setns" on that platform) and does not have any explicit rules in
    the configuration. These syscalls get an unconditional -EPERM,
    because they are assumed to have been excluded intentionally.

  2. Have an explicit conditional rule. These syscalls get a set of
    -EPERM rules which invert the logic of the conditional, to ensure
    that syscalls which are only partially permitted do not return
    -ENOSYS.

    This is trivial for all operations except SCMP_CMP_MASKED_EQ,
    which has no inverse operatin. For MASKED_EQ we generate a series of
    MASKED_EQ rules except for the value which was checked in the
    original rule.

    There are additional complications, which are all treated as simply
    "too hard" and will thus fall back to -ENOSYS:

    a. Any syscall specified in more than one seccomp rule is treated as
    "too hard to handle" and will be ignored. This is because they
    are logical OR rules but things become far too complicated if the
    action is non-trivial.
    b. Syscalls which have OR rules (the same argument specified
    multiple times) because it would require AND support in
    libseccomp, which doesn't exist.
    c. Syscalls which do not have allow rules (ACT_LOG or ACT_ALLOW),
    because other actions have different logical meanings to allow
    rules and it'd be too complicated to handle those as well for a
    hotfix.

None of the above happens if the default action is not SCMP_RET_ERRNO.

Fixes #2151
Signed-off-by: Aleksa Sarai [email protected]

@cyphar
Copy link
Member Author

cyphar commented Jan 14, 2021

I'm not sure if this works in all scenarios because I haven't tested this thoroughly yet, but I'm pushing this so folks can comment.

/cc @richfelker @jethrob @fweimer Does this seem reasonable-ish to you? This is more complicated than I'd like but it's a result of several legacy aspects of runc's seccomp code (one of which is #2735 -- and is probably a bug).

@richfelker
Copy link

@cyphar: I still don't entirely understand the need for inverse rules rather than a well-defined fallback order, but if the latter is hard, I guess they work. At least the patch does not look too large/invasive and the high-level explanation of what it's doing seems to be reasonable and in alignment with what I expected.

@cyphar
Copy link
Member Author

cyphar commented Jan 14, 2021

@richfelker As far as I understand, libseccomp doesn't have per-syscall fallback order (or at least it's not guaranteed to do so) -- if you add an unconditional -EPERM rule for a syscall it will remove any other rules present for that syscall. I'm pretty sure that the code for generating this is imperfect so we could probably hack around it (by doing MASKED_EQ of 0 to 0) but it seems a bit like playing with fire to do something like that.

Once seccomp/libseccomp#286 is resolved we can switch away from this workaround.

@cyphar
Copy link
Member Author

cyphar commented Jan 15, 2021

It appears that libseccomp doesn't support a single rule referencing the same argument more than once (this is enforced by db_col_rule_add but appears to be undocumented). This means that we cannot invert OR-ified rules properly (as per De-Morgan's laws, a negation of OR statements becomes an AND statement of negations).

Just to make sure, I've tried adding a basic fallback rule (even with the MASKED_EQ) trick and as far as I can tell it does override existing rules -- though strangely it only overrides conditional rules, and non-conditional rules are not overridden (my guess it's because they are treated separately internally). So my comment above about per-syscall overrides is also correct.

I'm starting to get a creeping feeling that this might've been simpler to implement in BPF without any other workarounds.

/cc @pcmoore Am I misunderstanding libseccomp, or are these real restrictions on filters created with libseccomp?

@cyphar
Copy link
Member Author

cyphar commented Jan 15, 2021

Okay this latest version handles things pretty well. The main thing which we cannot handle is OR rules for arguments (because of the above libseccomp restriction) but this doesn't impact the default Docker rules and thus is not a major issue -- besides I would argue the current implementation is actually a bug (see #2735).

All-in-all, I've tested this with a variety of cases and it does appear to work correctly. I'm marking this as ready for review -- PTAL @opencontainers/runc-maintainers. I'd also appreciate a review from @justincormack et al.

@cyphar cyphar marked this pull request as ready for review January 15, 2021 07:18
Having -EPERM is the default was a fairly significant mistake from a
future-proofing standpoint in that it makes any new syscall return a
non-ignorable error (from glibc's point of view). We need to correct
this now because faccessat2(2) is something glibc *critically* needs to
have support for, but they're blocked on container runtimes because we
return -EPERM unconditionally (leading to confusion in glibc).

Unfortunately there are several issues which stop us from having a clean
solution to this problem:

 1. libseccomp has several limitations which require us to emulate
    behaviour we want:

    a. We cannot do logic based on syscall number, meaning we cannot
       specify a "largest known syscall number";
    b. libseccomp doesn't know in which kernel version a syscall was
       added, and has no API for "minimum kernel version" so we cannot
       simply ask libseccomp to generate sane -ENOSYS rules for us.
    c. Additional seccomp rules for the same syscall are not treated as
       distinct rules -- if rules overlap, seccomp will merge them. This
       means we cannot add per-syscall -EPERM fallbacks;
    d. There is no inverse operation for SCMP_CMP_MASKED_EQ;
    e. libseccomp does not allow you to specify multiple rules for a
       single argument, making it impossible to invert OR rules for
       arguments.

 2. The runtime-spec does not have any way of specifying:

    a. The errno for the default action;
    b. The minimum kernel version or "newest syscall at time of profile
       creation"; nor
    c. Which syscalls were intentionally excluded from the allow list
       (weird syscalls that are no longer used were excluded entirely,
       but Docker et al expect those syscalls to get EPERM not ENOSYS).

 3. Certain syscalls should not return -ENOSYS (especially only for
    certain argument combinations) because this could also trigger glibc
    confusion. This means we have to return -EPERM for certain syscalls
    but not as a global default.

 4. There is not an obvious (and reasonable) upper limit to syscall
    numbers, so we cannot create a set of rules for each syscall above
    the largest syscall number in libseccomp. This means we must handle
    inverse rules as described below.

 5. Any syscall can be specified multiple times, which can make
    generation of hotfix rules much harder.

As a result, we have to work around all of these things by coming up
with a heuristic to stop the bleeding. In the future we could hopefully
improve the situation in the runtime-spec and libseccomp.

The solution is to switch the default action errno to -ENOSYS, and
simply pick a Linux kernel version which is treated as the universal
minimum version (Linux 3.0 was picked since basically no widely-used
distribution ships an older kernel, and it contains enough common
syscalls which would cause confusion if -ENOSYS was returned).

We then generate -EPERM rules for each syscall which:

 1. Existed in Linux 3.0 (read: has a syscall number smaller than
    "setns" on that platform) and does not have any explicit rules in
    the configuration. These syscalls get an unconditional -EPERM,
    because they are assumed to have been excluded intentionally.

 2. Have an explicit *conditional* rule. These syscalls get a set of
    -EPERM rules which invert the logic of the conditional, to ensure
    that syscalls which are only partially permitted do not return
    -ENOSYS.

    This is trivial for all operations *except* SCMP_CMP_MASKED_EQ,
    which has no inverse operatin. For MASKED_EQ we generate a series of
    MASKED_EQ rules except for the value which was checked in the
    original rule.

    There are additional complications, which are all treated as simply
    "too hard" and will thus fall back to -ENOSYS:

    a. Any syscall specified in more than one seccomp rule is treated as
       "too hard to handle" and will be ignored. This is because they
       are logical OR rules but things become far too complicated if the
       action is non-trivial.
    b. Syscalls which have OR rules (the same argument specified
       multiple times) because it would require AND support in
       libseccomp, which doesn't exist.
    c. Syscalls which do not have allow rules (ACT_LOG or ACT_ALLOW),
       because other actions have different logical meanings to allow
       rules and it'd be too complicated to handle those as well for a
       hotfix.

None of the above happens if the default action is not SCMP_RET_ERRNO.

Signed-off-by: Aleksa Sarai <[email protected]>
@cyphar cyphar requested review from AkihiroSuda and a team January 15, 2021 08:11
@cyphar
Copy link
Member Author

cyphar commented Jan 15, 2021

I have a test program I've been using, but I'm not sure what's the nicest way to convert it to an integration test:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <errno.h>
#include <fcntl.h>

#include <sys/syscall.h>
#include <linux/openat2.h>

int openat2(int dfd, const char *path, struct open_how *how, size_t usize)
{
	int ret = syscall(SYS_openat2, dfd, path, how, usize);
	if (ret < 0)
		ret = -errno;
	return ret;
}

int pidfd_open(pid_t pid, unsigned int flags)
{
	int ret = syscall(SYS_pidfd_open, pid, flags);
	if (ret < 0)
		ret = -errno;
	return ret;
}

int _dup3(int oldfd, int newfd, int flags)
{
	int ret = dup3(oldfd, newfd, flags);
	if (ret < 0)
		ret = -errno;
	return ret;
}

int fake_syscall(void)
{
	int ret = syscall(1000);
	if (ret < 0)
		ret = -errno;
	return ret;
}

#define debug_assert(lval, rval)											\
	do {																	\
		int L = (lval), R = (rval);											\
		if (L != R) {														\
			printf("debug_assert(%s == %s) failed: %d != %d\n", #lval, #rval, L, R); \
		}																	\
	} while (0)

int main(void)
{
	// Basic rule.

	// Multiple arguments with OR rules.
	debug_assert(openat2(100, NULL, NULL, 0), -EINVAL);
	debug_assert(openat2(9001, NULL, NULL, 0), -EINVAL);
	debug_assert(openat2(0, NULL, NULL, 0), -ENOSYS); // :(
	debug_assert(openat2(0, NULL, NULL, 0), -ENOSYS); // :(

	// Multiple arguments with AND rules.
	debug_assert(pidfd_open(1, 1337), -EINVAL);
	debug_assert(pidfd_open(1, 0), -EPERM);
	debug_assert(pidfd_open(500, 1337), -EPERM);
	debug_assert(pidfd_open(500, 500), -EPERM);

	// Multiple rules for the same syscall.
	debug_assert(_dup3(0, -100, 0xFFFF), -ENOSYS);
	debug_assert(_dup3(1, -100, 0xFFFF), -EINVAL);
	debug_assert(_dup3(2, -100, 0xFFFF), -ENOSYS);
	debug_assert(_dup3(3, -100, 0xFFFF), -EINVAL);

	// Out-of-bounds syscall.
	debug_assert(fake_syscall(), -ENOSYS);
}

With the seccomp section in config.json being set to:

config.json
		"seccomp": {
			"defaultAction": "SCMP_ACT_ERRNO",
			"architectures": [
				"SCMP_ARCH_X86",
				"SCMP_ARCH_X32",
				"SCMP_ARCH_X86_64"
			],
			"syscalls": [
				{
					"action": "SCMP_ACT_ALLOW",
					"names": [
						"accept",
						"accept4",
						"access",
						"adjtimex",
						"alarm",
						"arch_prctl",
						"bind",
						"brk",
						"capget",
						"capset",
						"chdir",
						"chmod",
						"chown",
						"chown32",
						"clock_adjtime",
						"clock_adjtime64",
						"clock_getres",
						"clock_getres_time64",
						"clock_gettime",
						"clock_gettime64",
						"clock_nanosleep",
						"clock_nanosleep_time64",
						"close",
						"connect",
						"copy_file_range",
						"creat",
						"dup",
						"dup2",
						"epoll_create",
						"epoll_create1",
						"epoll_ctl",
						"epoll_ctl_old",
						"epoll_pwait",
						"epoll_wait",
						"epoll_wait_old",
						"eventfd",
						"eventfd2",
						"execve",
						"execveat",
						"exit",
						"exit_group",
						"faccessat",
						"faccessat2",
						"fadvise64",
						"fadvise64_64",
						"fallocate",
						"fanotify_mark",
						"fchdir",
						"fchmod",
						"fchmodat",
						"fchown",
						"fchown32",
						"fchownat",
						"fcntl",
						"fcntl64",
						"fdatasync",
						"fgetxattr",
						"flistxattr",
						"flock",
						"fork",
						"fremovexattr",
						"fsetxattr",
						"fstat",
						"fstat64",
						"fstatat64",
						"fstatfs",
						"fstatfs64",
						"fsync",
						"ftruncate",
						"ftruncate64",
						"futex",
						"futex_time64",
						"futimesat",
						"getcpu",
						"getcwd",
						"getdents",
						"getdents64",
						"getegid",
						"getegid32",
						"geteuid",
						"geteuid32",
						"getgid",
						"getgid32",
						"getgroups",
						"getgroups32",
						"getitimer",
						"getpeername",
						"getpgid",
						"getpgrp",
						"getpid",
						"getppid",
						"getpriority",
						"getrandom",
						"getresgid",
						"getresgid32",
						"getresuid",
						"getresuid32",
						"getrlimit",
						"get_robust_list",
						"getrusage",
						"getsid",
						"getsockname",
						"getsockopt",
						"get_thread_area",
						"gettid",
						"gettimeofday",
						"getuid",
						"getuid32",
						"getxattr",
						"inotify_add_watch",
						"inotify_init",
						"inotify_init1",
						"inotify_rm_watch",
						"io_cancel",
						"ioctl",
						"io_destroy",
						"io_getevents",
						"io_pgetevents",
						"io_pgetevents_time64",
						"ioprio_get",
						"ioprio_set",
						"io_setup",
						"io_submit",
						"io_uring_enter",
						"io_uring_register",
						"io_uring_setup",
						"ipc",
						"kill",
						"lchown",
						"lchown32",
						"lgetxattr",
						"link",
						"linkat",
						"listen",
						"listxattr",
						"llistxattr",
						"_llseek",
						"lremovexattr",
						"lseek",
						"lsetxattr",
						"lstat",
						"lstat64",
						"madvise",
						"membarrier",
						"memfd_create",
						"mincore",
						"mkdir",
						"mkdirat",
						"mknod",
						"mknodat",
						"mlock",
						"mlock2",
						"mlockall",
						"mmap",
						"mmap2",
						"modify_ldt",
						"mprotect",
						"mq_getsetattr",
						"mq_notify",
						"mq_open",
						"mq_timedreceive",
						"mq_timedreceive_time64",
						"mq_timedsend",
						"mq_timedsend_time64",
						"mq_unlink",
						"mremap",
						"msgctl",
						"msgget",
						"msgrcv",
						"msgsnd",
						"msync",
						"munlock",
						"munlockall",
						"munmap",
						"nanosleep",
						"newfstatat",
						"_newselect",
						"open",
						"openat",
						"pause",
						"pidfd_send_signal",
						"pipe",
						"pipe2",
						"poll",
						"ppoll",
						"ppoll_time64",
						"prctl",
						"pread64",
						"preadv",
						"preadv2",
						"prlimit64",
						"pselect6",
						"pselect6_time64",
						"pwrite64",
						"pwritev",
						"pwritev2",
						"read",
						"readahead",
						"readlink",
						"readlinkat",
						"readv",
						"recv",
						"recvfrom",
						"recvmmsg",
						"recvmmsg_time64",
						"recvmsg",
						"remap_file_pages",
						"removexattr",
						"rename",
						"renameat",
						"renameat2",
						"restart_syscall",
						"rmdir",
						"rseq",
						"rt_sigaction",
						"rt_sigpending",
						"rt_sigprocmask",
						"rt_sigqueueinfo",
						"rt_sigreturn",
						"rt_sigsuspend",
						"rt_sigtimedwait",
						"rt_sigtimedwait_time64",
						"rt_tgsigqueueinfo",
						"sched_getaffinity",
						"sched_getattr",
						"sched_getparam",
						"sched_get_priority_max",
						"sched_get_priority_min",
						"sched_getscheduler",
						"sched_rr_get_interval",
						"sched_rr_get_interval_time64",
						"sched_setaffinity",
						"sched_setattr",
						"sched_setparam",
						"sched_setscheduler",
						"sched_yield",
						"seccomp",
						"select",
						"semctl",
						"semget",
						"semop",
						"semtimedop",
						"semtimedop_time64",
						"send",
						"sendfile",
						"sendfile64",
						"sendmmsg",
						"sendmsg",
						"sendto",
						"setfsgid",
						"setfsgid32",
						"setfsuid",
						"setfsuid32",
						"setgid",
						"setgid32",
						"setgroups",
						"setgroups32",
						"setitimer",
						"setpgid",
						"setpriority",
						"setregid",
						"setregid32",
						"setresgid",
						"setresgid32",
						"setresuid",
						"setresuid32",
						"setreuid",
						"setreuid32",
						"setrlimit",
						"set_robust_list",
						"setsid",
						"setsockopt",
						"set_thread_area",
						"set_tid_address",
						"setuid",
						"setuid32",
						"setxattr",
						"shmat",
						"shmctl",
						"shmdt",
						"shmget",
						"shutdown",
						"sigaltstack",
						"signalfd",
						"signalfd4",
						"sigprocmask",
						"sigreturn",
						"socket",
						"socketcall",
						"socketpair",
						"splice",
						"stat",
						"stat64",
						"statfs",
						"statfs64",
						"statx",
						"symlink",
						"symlinkat",
						"sync",
						"sync_file_range",
						"syncfs",
						"sysinfo",
						"tee",
						"tgkill",
						"time",
						"timer_create",
						"timer_delete",
						"timer_getoverrun",
						"timer_gettime",
						"timer_gettime64",
						"timer_settime",
						"timer_settime64",
						"timerfd_create",
						"timerfd_gettime",
						"timerfd_gettime64",
						"timerfd_settime",
						"timerfd_settime64",
						"times",
						"tkill",
						"truncate",
						"truncate64",
						"ugetrlimit",
						"umask",
						"uname",
						"unlink",
						"unlinkat",
						"utime",
						"utimensat",
						"utimensat_time64",
						"utimes",
						"vfork",
						"vmsplice",
						"wait4",
						"waitid",
						"waitpid",
						"write",
						"writev"
					]
				},
				{
					"action": "SCMP_ACT_ALLOW",
					"names": [
						"dup3"
					],
					"args": [
						{
							"index": 0,
							"value": 1,
							"op": "SCMP_CMP_EQ"
						}
					]
				},
				{
					"action": "SCMP_ACT_ALLOW",
					"names": [
						"dup3"
					],
					"args": [
						{
							"index": 0,
							"value": 2,
							"op": "SCMP_CMP_GT"
						}
					]
				},
				{
					"action": "SCMP_ACT_ALLOW",
					"names": [
						"pidfd_open"
					],
					"args": [
						{
							"index": 0,
							"value": 1,
							"op": "SCMP_CMP_EQ"
						},
						{
							"index": 1,
							"value": 1337,
							"op": "SCMP_CMP_EQ"
						}
					]
				},
				{
					"action": "SCMP_ACT_ALLOW",
					"names": [
						"openat2"
					],
					"args": [
						{
							"index": 0,
							"value": 100,
							"op": "SCMP_CMP_EQ"
						},
						{
							"index": 0,
							"value": 9001,
							"op": "SCMP_CMP_EQ"
						}
					]
				},
				{
					"action": "SCMP_ACT_ALLOW",
					"names": [
						"clone"
					],
					"args": [
						{
							"index": 0,
							"value": 2114060288,
							"op": "SCMP_CMP_MASKED_EQ"
						}
					]
				}
			]
		}

@giuseppe
Copy link
Member

it seems to work well with the default Podman seccomp profile, that has some weird handling for socket: https://github.com/containers/common/blob/master/pkg/seccomp/seccomp.json#L791-L897

@AkihiroSuda
Copy link
Member

I have a test program I've been using, but I'm not sure what's the nicest way to convert it to an integration test:

I think we can just compile and run the C code from bats

@cyphar
Copy link
Member Author

cyphar commented Jan 15, 2021

@giuseppe That example is actually ignored entirely by the workaround code -- you will get -ENOSYS as a fallback because we can't generate a proper rule-set for the fallback. Hopefully that doesn't break anything 🤞.

@cyphar
Copy link
Member Author

cyphar commented Jan 15, 2021

I hit my first -ENOSYS issue -- renameat2 in glibc hides -ENOSYS and will return a fake -EINVAL if you have non-zero flag arguments. I've worked around it but boy do I really hope there aren't many more regressions like this...

@giuseppe
Copy link
Member

@giuseppe That example is actually ignored entirely by the workaround code -- you will get -ENOSYS as a fallback because we can't generate a proper rule-set for the fallback. Hopefully that doesn't break anything crossed_fingers.

I see. I guess it works because socket doesn't rely on the default return value in this case, and it will be allowed by the next rule.

Differently, both personality and clone get -ENOSYS by default, but I think these syscalls are more difficult to cause issues

@cyphar
Copy link
Member Author

cyphar commented Jan 15, 2021

clone shouldn't be getting ENOSYS -- while the podman/docker profile has multiple references to it, only one rule is actually generated (at least according to my tests). clone should still be getting EPERM but yeah the personality configuration appears to trigger ENOSYS too...

@fweimer-rh
Copy link

I hit my first -ENOSYS issue -- renameat2 in glibc hides -ENOSYS and will return a fake -EINVAL if you have non-zero flag arguments. I've worked around it but boy do I really hope there aren't many more regressions like this...

mlock2 does this as well. The idea is to mirror what the kernel would do if it encounters an unknown/unsupported flag. If the system call does not have a flag argument, then all flags are unsupported.

I don't see how this could be a problem. Could you please elaborate?

You can make a direct system call using syscall (SYS_renameat2, …) to avoid this, but SYS_renameat2 might not be defined if it's missing from the kernel headers (likewise the renameat2 function is only available since glibc 2.28).

@cyphar
Copy link
Member Author

cyphar commented Jan 15, 2021

@fweimer-rh The (potential) issue is that programs which used to get -EPERM when a seccomp profile was violated (such as "no clone(CLONE_NEWUSER) for containers") may now start getting strange errors like -EINVAL or -ENOENT depending on the glibc fallback code. While it is still an error, it is possible that userspace programs have fallbacks for -EPERM but not for -EINVAL (or -ENOSYS). runc has fallback code that uses os.IsPermission which won't catch -EINVAL nor -ENOENT.

I don't think there is a nice solution for this outside of making libseccomp be able to handle this whole problem more nicely, but it is something that does worry me a fair bit...

@fweimer-rh
Copy link

@cyphar The EINVAL behavior was specifically motivated by the existing kernel behavior. clone(2) says this:

EINVAL

CLONE_NEWUSER was specified in the flags mask, but the kernel was not configured with the CONFIG_USER_NS option.

My mental model is that seccomp is essentially deconfiguring otherwise supported kernel interfaces, so returning EINVAL for filtered flags is not wrong per se.

(Note that glibc doesn't do anything like that with clone3 currently, so this is hypothetical at this point, but clone(2) is fairly well-documented, so it's a good example.)

@cyphar
Copy link
Member Author

cyphar commented Jan 15, 2021

@fweimer-rh

My mental model is that seccomp is essentially deconfiguring otherwise supported kernel interfaces, so returning EINVAL for filtered flags is not wrong per se.

I appreciate this is a bit of a philosophical discussion, but my mental model is that we are acting like an ad-hoc MAC or LSM-like system which is explicitly restricting features. This isn't true for unknown syscalls, but it is definitely true for cases like clone(CLONE_NEWUSER).

@pcmoore
Copy link

pcmoore commented Jan 15, 2021

It appears that libseccomp doesn't support a single rule referencing the same argument more than once (this is enforced by db_col_rule_add but appears to be undocumented).

My hope was that the seccomp_rule_add(3) manpage made this clear (excerpt below), suggestions on how to improve the documentation are always welcome.

When adding syscall argument comparisons to the filter it is important to remember that while it is possible to have multiple comparisons in a single rule, you can only compare each argument once in a single rule. In other words, you can not have multiple comparisons of the 3rd syscall argument in a single rule.

@cyphar
Copy link
Member Author

cyphar commented Jan 15, 2021

@pcmoore It seems that text is only available in man '3+1' seccomp_rule_add on my distribution, not man 3 secomp_rule_add. So it is documented, just not in the version of the documentation I was reading. 😬

@cyphar
Copy link
Member Author

cyphar commented Jan 15, 2021

Seems like the --debug test is flaky, it's failed about 30% of the time I've re-run CI on this PR...

@cyphar
Copy link
Member Author

cyphar commented Jan 15, 2021

Hmm, now the test is failing on CentOS 7 with no warnings or any output other than runc run exiting with a status of 218. Oh and it looks like the GHA integration tests aren't catching test failures under Vagrant? 😬

@pcmoore
Copy link

pcmoore commented Jan 15, 2021

@pcmoore It seems that text is only available in man '3+1' seccomp_rule_add on my distribution, not man 3 secomp_rule_add. So it is documented, just not in the version of the documentation I was reading.

Sometimes it's easy to miss things in the docs too; I've done it more times than I care to remember.

The man 3+1 <blah> is new to me, and a quick search through man's manpage doesn't show anything jumping out. Odd. Out of curiosity, what distro are you using?

@justincormack
Copy link
Contributor

This seems totally the wrong place to do this, runc is (sigh) just doing what it is told and if you have to second guess its behaviour too thats even worse. There are only a few seccomp policies in the world, just fix them.

@richfelker
Copy link

It appears that libseccomp doesn't support a single rule referencing the same argument more than once (this is enforced by db_col_rule_add but appears to be undocumented).

My hope was that the seccomp_rule_add(3) manpage made this clear (excerpt below), suggestions on how to improve the documentation are always welcome.

I don't think this is a documentation problem. The underlying problem is that the DSL/libseccomp API for this is just utterly misdesigned in a way that's anti-intuitive, harder to use than, and less powerful than, raw BPF. Is there no way it can be fixed to be somewhat less bad while maintaining compatibility with important caller expectations? If not, can the necessary fixups just be done on the resulting BPF rather than via the deficient libseccomp API?

@pcmoore
Copy link

pcmoore commented Jan 15, 2021

I don't think this is a documentation problem. The underlying problem is that the DSL/libseccomp API for this is just utterly misdesigned in a way that's anti-intuitive, harder to use than, and less powerful than, raw BPF. Is there no way it can be fixed to be somewhat less bad while maintaining compatibility with important caller expectations? If not, can the necessary fixups just be done on the resulting BPF rather than via the deficient libseccomp API?

That libseccomp library is just awful isn't it? As I understand it the libseccomp maintainers are open to accepting patches to improve the library; which is good, because they are clearly a bunch of idiots who don't know what they are doing.

Alternatively you could simply write your own seccomp filters in raw BPF, that's clearly what a real programmer would do in this case. The libseccomp maintainers openly say that libseccomp is not for everyone or every use case; they say the goal is to lower the bar for the common seccomp use cases in hopes that more applications/developers will make use of seccomp. However, what they should really be saying is that libseccomp is always worse than writing raw BPF and only idiots should use it.

@cyphar
Copy link
Member Author

cyphar commented Jan 16, 2021

@pcmoore

The man 3+1 is new to me, and a quick search through man's manpage doesn't show anything jumping out. Odd. Out of curiosity, what distro are you using?

openSUSE. The seccomp_rule_add(3+1) man page is an entirely separate doc and has several paragraphs that are missing from seccomp_rule_add(3).

@justincormack

This seems totally the wrong place to do this, runc is (sigh) just doing what it is told and if you have to second guess its behaviour too thats even worse. There are only a few seccomp policies in the world, just fix them.

So switch the default errno to -ENOSYS and then add explicit -EPERM rules? The issue is that the current implementation of this is a workaround (I would hope it would only exist for a few months at most) -- the long-term goal would be that seccomp profiles indicate what their minimum kernel version is, so that the -EPERM/-ENOSYS split is not handled by the profile author explicitly.

The other (potentially simpler) solution is to temporarily implement this in BPF ourselves until libseccomp has the minimum kernel feature we need (there are some other features that would be nice like supporting multiple AND rules for the same argument, but they aren't mandatory).

You also can't specify the return value in the default action, so we will need to change runc somehow.

Vagrantfile.centos7 Outdated Show resolved Hide resolved
@cyphar
Copy link
Member Author

cyphar commented Jan 16, 2021

@richfelker

If not, can the necessary fixups just be done on the resulting BPF rather than via the deficient libseccomp API?

Leaving aside the way you phrased your point, I just tried my hand at this and it's at least somewhat non-trivial to get right. Because libseccomp doesn't have per-syscall fallback paths everything goes through a final reta instruction at the end (except for per-syscall SECCOMP_RET_ERRNO rules maybe?) -- the result is that we would need to figure out how to disentangle the different failures when patching. We would also need to do this in an architecture-specific way and we'd still have issues with holes in the syscall table.

All of this is far more complicated than just making the default -ENOSYS and returning -EPERM for syscalls present in Linux 3.0 but not mentioned in the filter. For the cases where we are returning -ENOSYS but we previously returned -EPERM, this PR does handle more of those cases but not all of them -- and ideally this hotfix should be as simple as possible without breaking containers.


So for now it may just be simplest to do the most basic -ENOSYS switch with trivial -EPERM handling (return -EPERM for any syscall not mentioned in the profile at all but was present in Linux 3.0). I think that this PR will lead to less breakages than the most trivial -EPERM handling, but maybe we should see what breaks before we commit to more complicated filter modifications.

After that I can work on cooking up our own BPF generation that handles these cases in the way we would like them to be handled. It is a bit sad we'll have to reinvent most of libseccomp's architecture handling, but we will still use the libseccomp syscall name lookup and architecture information to avoid having to handle all of that ourselves too... And we will need kernel version information anyway (which I will work on implementing in libseccomp).

I do think long-term we could switch back to libseccomp but there are a few other limitations we would ideally like to have fixed in libseccomp (the multiple-rule-for-a-single-argument restriction would be great to remove -- separately from this whole issue).

The slightly positive thing is that we can use the golang.org/x/net/bpf emulation code to do unit testing of our generated programs.

@cyphar
Copy link
Member Author

cyphar commented Jan 17, 2021

#2746 is a hotfix variant of this PR that only does -EPERM injection for the "easy" case. I am working on the custom BPF generation separately...

@giuseppe
Copy link
Member

You also can't specify the return value in the default action, so we will need to change runc somehow.

do you think it makes sense to update the runtime specs now to allow overriding the default action?

@cyphar
Copy link
Member Author

cyphar commented Jan 17, 2021

@giuseppe I assume you meant the default errno for the default action? That's probably not a bad idea to do anyway, but that's not actually enough to solve this problem -- ideally the ENOSYS behaviour should be auto-generated by the container runtime because it'd be too complicated for users of runc to generate it (not to mention the whole thing is a footgun).

So we will also need to add either a "minimal kernel version" field to the spec or otherwise define the -ENOSYS behaviour we need.

There are also several aspects of runc's seccomp implementation that aren't really in line with the runtime-spec -- such as #2735 (though the runtime-spec's definition of seccomp rules is woefully vague so you could argue that it's compliant purely because it's so ill-defined). We need to fix those too.

@giuseppe
Copy link
Member

@giuseppe I assume you meant the default errno for the default action? That's probably not a bad idea to do anyway, but that's not actually enough to solve this problem -- ideally the ENOSYS behaviour should be auto-generated by the container runtime because it'd be too complicated for users of runc to generate it (not to mention the whole thing is a footgun).

yes, I was referring to the errno for the default action, but perhaps it is not necessary with the approach in #2746? Would exposing lastSyscall (and default to setns if it is not specified) be better?

@cyphar
Copy link
Member Author

cyphar commented Jan 18, 2021

@giuseppe A better solution is something more like seccomp/libseccomp#11 -- you specify the minimum kernel version of the profile -- any syscalls newer than that kernel version get -ENOSYS rather than -EPERM. In principle you don't really need lastSyscall (or even a "minimum kernel version") because that information is already present in the profile (the newest syscall in the profile is the latest one) -- but in order for this to be robust we need kernel version information in libseccomp (syscalls aren't always added sequentially to the syscall table).

But I would prefer we don't touch the runtime-spec until libseccomp has the features we need to make a decision about what the right API should be. The current seccomp API was pretty much wholesale copied from the internal runc implementation, which is quite worrying because the runc implementation has weird footguns like #2735 and #1847. I don't want a(nother) runc hotfix to be codified in the spec.

@fweimer-rh
Copy link

What is the impact of the Linux 3.0 cut-off point? I'm asking because modern glibc assumes at least kernel 3.2 (main differences are prlimit64 and accept4 on some architectures, I think). If I understand things correctly, it means that getrlimit fails with EPERM with the 3.2 limit, and with ENOSYS with the 3.0 limit because the implementation calls prlimit64 internally, unless there is a rule for prlimit64 in the policy. In theory, getrlimit and pr64limit with a zero PID and a null pointer for the new limit should be treated identically. But maybe this has to be punted to policy authors.

A complete different concern is who maintains the name-to-syscall mapping. @cyphar You said you wanted to get this from libseccomp. However, downstream, at least we don't treat libseccomp as a package that we need to keep updating continuously for each system call that is added to the upstream (!) kernel. Maybe this is something that we need to change downstream, and I will have to start a discussion about that.

@cyphar
Copy link
Member Author

cyphar commented Jan 18, 2021

@fweimer-rh Well this is intended as a hotfix, if you think 3.2 is a better cutoff point (the fact it's the minimum version required for glibc is probably the best reason to pick it) we can use that instead. Especially if you think a Linux 3.0 cut-off will break glibc installations. But we don't plan to have this workaround stick around for a very long time -- which is why I've hacked up #2746 as an even simpler implementation of this PR so that we can have something merged.

The longer-term goal is to get rid of the hard-coded hotfix and instead either:

  1. Get policy authors to specify what kernel version the profile was written for --effectively indicating what set of syscalls had been audited when creating the profile. Any syscall that existed in the specified kernel version will get the default action (-EPERM) but any newer syscalls (syscalls which did not exist in the syscall table in that kernel version) will get -ENOSYS to be future-compatible. This is the most "correct" solution but requires users to specify the kernel version.

  2. Auto-detect what kernel version the profile was written for by taking the newest syscall referenced in the profile and using its syscall version as the target kernel version. This requires no additional configuration, but will give the wrong results in some cases.

I think the way we may implement this eventually is to have (1) be the preferred method but if no kernel version is specified we fall back to (2). The reason we need to do this on a kernel version level rather than syscall number level is that the syscall table can have holes in it, and thus any simplistic "just check if the syscall number is larger than X" checks can be wrong -- this happened recently with close_range -- it was added after openat2 despite openat2 having a larger syscall number.

However this long-term solution requires features in libseccomp (or some other source) which don't currently exist -- see below.

A complete different concern is who maintains the name-to-syscall mapping. @cyphar You said you wanted to get this from libseccomp.

We already use libseccomp for this purpose (for two reasons -- there is no real kernel API for it, and we only care about syscall names in the context of seccomp). If we change sources of this information that'd be fine so long as the mapping didn't change for old syscalls.

I agree that this is less than ideal because it requires libseccomp to be updated fairly regularly, but right now these are the tools we have at our disposal (and in practice it hasn't been a huge problem -- the libseccomp maintainers are pretty reactive to new syscall additions). However it should also be noted that for libseccomp's usecase we also need to have the following information which probably won't exist in a hypothetical kernel API:

  1. Syscall numbers for different architectures because libseccomp supports generating a filter that works on more than one architecture. libseccomp already has this information.
  2. In what kernel version a syscall first appeared. This is not currently provided by libseccomp, but is something they are working on (RFE: support "maximum kernel version" seccomp/libseccomp#11) and will be necessary for the long-term solution I described above.

@cyphar
Copy link
Member Author

cyphar commented Jan 18, 2021

I've thought of a nicer interim solution to this problem -- inspired by @richfelker, except we prepend our -ENOSYS handling -- which I'll implement in this PR tomorrow.

But we still need a better long-term solution.

@cyphar
Copy link
Member Author

cyphar commented Jan 19, 2021

#2750 is a cleaner (though still not ideal) solution to this problem. Closing in favour of that implementation.

@cyphar cyphar closed this Jan 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

seccomp filter should return ENOSYS for unknown syscalls
8 participants