Merge pull request #28951 from saschagrunert/seccomp-default-blog

Add seccomp default feature blog post
kubernetes · Aug 24, 2021 · 157f1d7 · 157f1d7
2 parents 6e7b621 + 84e472e
commit 157f1d7
Showing 1 changed file with 267 additions and 0 deletions.
diff --git a/content/en/blog/_posts/2021-08-25-seccomp-default.md b/content/en/blog/_posts/2021-08-25-seccomp-default.md
@@ -0,0 +1,267 @@
+---
+layout: blog
+title: "Enable seccomp for all workloads with a new v1.22 alpha feature"
+date: 2021-08-25
+slug: seccomp-default
+---
+
+**Author:** Sascha Grunert, Red Hat
+
+This blog post is about a new Kubernetes feature introduced in v1.22, which adds
+an additional security layer on top of the existing seccomp support. Seccomp is
+a security mechanism for Linux processes to filter system calls (syscalls) based
+on a set of defined rules. Applying seccomp profiles to containerized workloads
+is one of the key tasks when it comes to enhancing the security of the
+application deployment. Developers, site reliability engineers and
+infrastructure administrators have to work hand in hand to create, distribute
+and maintain the profiles over the applications life-cycle.
+
+You can use the [`securityContext`][seccontext] field of Pods and their
+containers can be used to adjust security related configurations of the
+workload. Kubernetes introduced dedicated [seccomp related API
+fields][seccontext] in this `SecurityContext` with the [graduation of seccomp to
+General Availability (GA)][ga] in v1.19.0. This enhancement allowed an easier
+way to specify if the whole pod or a specific container should run as:
+
+[seccontext]: /docs/reference/kubernetes-api/workload-resources/pod-v1/#security-context-1
+[ga]: https://kubernetes.io/blog/2020/08/26/kubernetes-release-1.19-accentuate-the-paw-sitive/#graduated-to-stable
+
+- `Unconfined`: seccomp will not be enabled
+- `RuntimeDefault`: the container runtimes default profile will be used
+- `Localhost`: a node local profile will be applied, which is being referenced
+  by a relative path to the seccomp profile root (`<kubelet-root-dir>/seccomp`)
+  of the kubelet
+
+With the graduation of seccomp, nothing has changed from an overall security
+perspective, because `Unconfined` is still the default. This is totally fine if
+you consider this from the upgrade path and backwards compatibility perspective of
+Kubernetes releases. But it also means that it is more likely that a workload
+runs without seccomp at all, which should be fixed in the long term.
+
+## `SeccompDefault` to the rescue
+
+Kubernetes v1.22.0 introduces a new kubelet [feature gate][gate]
+`SeccompDefault`, which has been added in `alpha` state as every other new
+feature. This means that it is disabled by default and can be enabled manually
+for every single Kubernetes node.
+
+[gate]: /docs/reference/command-line-tools-reference/feature-gates
+
+What does the feature do? Well, it just changes the default seccomp profile from
+`Unconfined` to `RuntimeDefault`. If not specified differently in the pod
+manifest, then the feature will add a higher set of security constraints by
+using the default profile of the container runtime. These profiles may differ
+between runtimes like [CRI-O][crio] or [containerd][ctrd]. They also differ for
+its used hardware architectures. But generally speaking, those default profiles
+allow a common amount of syscalls while blocking the more dangerous ones, which
+are unlikely or unsafe to be used in a containerized application.
+
+[crio]: https://github.com/cri-o/cri-o/blob/fe30d62/vendor/github.com/containers/common/pkg/seccomp/default_linux.go#L45
+[ctrd]: https://github.com/containerd/containerd/blob/e1445df/contrib/seccomp/seccomp_default.go#L51
+
+### Enabling the feature
+
+Two kubelet configuration changes have to be made to enable the feature:
+
+1. **Enable the feature** gate by setting the `SeccompDefault=true` via the command
+   line (`--feature-gates`) or the [kubelet configuration][kubelet] file.
+2. **Turn on the feature** by enabling the feature by adding the
+   `--seccomp-default` command line flag or via the [kubelet
+   configuration][kubelet] file (`seccompDefault: true`).
+
+[kubelet]: /docs/tasks/administer-cluster/kubelet-config-file
+
+The kubelet will error on startup if only one of the above steps have been done.
+
+### Trying it out
+
+If the feature is enabled on a node, then you can create a new workload like
+this:
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: test-pod
+spec:
+  containers:
+    - name: test-container
+      image: nginx:1.21
+```
+
+Now it is possible to inspect the used seccomp profile by using
+[`crictl`][crictl] while investigating the containers [runtime
+specification][rspec]:
+
+[crictl]: https://github.com/kubernetes-sigs/cri-tools
+[rspec]: https://github.com/opencontainers/runtime-spec/blob/0c021c1/config-linux.md#seccomp
+
+```bash
+CONTAINER_ID=$(sudo crictl ps -q --name=test-container)
+sudo crictl inspect $CONTAINER_ID | jq .info.runtimeSpec.linux.seccomp
+```
+
+```yaml
+{
+  "defaultAction": "SCMP_ACT_ERRNO",
+  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32"],
+  "syscalls": [
+    {
+      "names": ["_llseek", "_newselect", "accept", …, "write", "writev"],
+      "action": "SCMP_ACT_ALLOW"
+    },
+    …
+  ]
+}
+```
+
+You can see that the lower level container runtime ([CRI-O][crio-home] and
+[runc][runc] in our case), successfully applied the default seccomp profile.
+This profile denies all syscalls per default, while allowing commonly used ones
+like [`accept`][accept] or [`write`][write].
+
+[crio-home]: https://github.com/cri-o/cri-o
+[runc]: https://github.com/opencontainers/runc
+[accept]: https://man7.org/linux/man-pages/man2/accept.2.html
+[write]: https://man7.org/linux/man-pages/man2/write.2.html
+
+Please note that the feature will not influence any Kubernetes API for now.
+Therefore, it is not possible to retrieve the used seccomp profile via `kubectl`
+`get` or `describe` if the [`SeccompProfile`][api] field is unset within the
+`SecurityContext`.
+
+[api]: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#security-context-1
+
+The feature also works when using multiple containers within a pod, for example
+if you create a pod like this:
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: test-pod
+spec:
+  containers:
+    - name: test-container-nginx
+      image: nginx:1.21
+      securityContext:
+        seccompProfile:
+          type: Unconfined
+    - name: test-container-redis
+      image: redis:6.2
+```
+
+then you should see that the `test-container-nginx` runs without a seccomp profile:
+
+```bash
+sudo crictl inspect $(sudo crictl ps -q --name=test-container-nginx) |
+    jq '.info.runtimeSpec.linux.seccomp == null'
+true
+```
+
+Whereas the container `test-container-redis` runs with `RuntimeDefault`:
+
+```bash
+sudo crictl inspect $(sudo crictl ps -q --name=test-container-redis) |
+    jq '.info.runtimeSpec.linux.seccomp != null'
+true
+```
+
+The same applies to the pod itself, which also runs with the default profile:
+
+```bash
+sudo crictl inspectp (sudo crictl pods -q --name test-pod) |
+    jq '.info.runtimeSpec.linux.seccomp != null'
+true
+```
+
+### Upgrade strategy
+
+It is recommended to enable the feature in multiple steps, whereas different
+risks and mitigations exist for each one.
+
+#### Feature gate enabling
+
+Enabling the feature gate at the kubelet level will not turn on the feature, but
+will make it possible by using the `SeccompDefault` kubelet configuration or the
+`--seccomp-default` CLI flag. This can be done by an administrator for the whole
+cluster or only a set of nodes.
+
+#### Testing the Application
+
+If you're trying this within a dedicated test environment, you have to ensure
+that the application code does not trigger syscalls blocked by the
+`RuntimeDefault` profile before enabling the feature on a node. This can be done
+by:
+
+- _Recommended_: Analyzing the code (manually or by running the application with
+  [strace][strace]) for any executed syscalls which may be blocked by the
+  default profiles. If that's the case, then you can override the default by
+  explicitly setting the pod or container to run as `Unconfined`. Alternatively,
+  you can create a custom seccomp profile (see optional step below).
+  profile based on the default by adding the additional syscalls to the
+  `"action": "SCMP_ACT_ALLOW"` section.
+
+- _Recommended_: Manually set the profile to the target workload and use a
+  rolling upgrade to deploy into production. Rollback the deployment if the
+  application does not work as intended.
+
+- _Optional_: Run the application against an end-to-end test suite to trigger
+  all relevant code paths with `RuntimeDefault` enabled. If a test fails, use
+  the same mitigation as mentioned above.
+
+- _Optional_: Create a custom seccomp profile based on the default and change
+  its default action from `SCMP_ACT_ERRNO` to `SCMP_ACT_LOG`. This means that
+  the seccomp filter for unknown syscalls will have no effect on the application
+  at all, but the system logs will now indicate which syscalls may be blocked.
+  This requires at least a Kernel version 4.14 as well as a recent [runc][runc]
+  release. Monitor the application hosts audit logs (defaults to
+  `/var/log/audit/audit.log`) or syslog entries (defaults to `/var/log/syslog`)
+  for syscalls via `type=SECCOMP` (for audit) or `type=1326` (for syslog).
+  Compare the syscall ID with those [listed in the Linux Kernel
+  sources][syscalls] and add them to the custom profile. Be aware that custom
+  audit policies may lead into missing syscalls, depending on the configuration
+  of auditd.
+
+- _Optional_: Use cluster additions like the [Security Profiles Operator][spo]
+  for profiling the application via its [log enrichment][logs] capabilities or
+  recording a profile by using its [recording feature][rec]. This makes the
+  above mentioned manual log investigation obsolete.
+
+[syscalls]: https://github.com/torvalds/linux/blob/7bb7f2a/arch/x86/entry/syscalls/syscall_64.tbl
+[spo]: https://github.com/kubernetes-sigs/security-profiles-operator
+[logs]: https://github.com/kubernetes-sigs/security-profiles-operator/blob/c90ef3a/installation-usage.md#record-profiles-from-workloads-with-profilerecordings
+[rec]: https://github.com/kubernetes-sigs/security-profiles-operator/blob/c90ef3a/installation-usage.md#using-the-log-enricher
+[strace]: https://man7.org/linux/man-pages/man1/strace.1.html
+
+#### Deploying the modified application
+
+Based on the outcome of the application tests, it may be required to change the
+application deployment by either specifying `Unconfined` or a custom seccomp
+profile. This is not the case if the application works as intended with
+`RuntimeDefault`.
+
+#### Enable the kubelet configuration
+
+If everything went well, then the feature is ready to be enabled by the kubelet
+configuration or its corresponding CLI flag. This should be done on a per-node
+basis to reduce the overall risk of missing a syscall during the investigations
+when running the application tests. If it's possible to monitor audit logs
+within the cluster, then it's recommended to do this for eventually missed
+seccomp events. If the application works as intended then the feature can be
+enabled for further nodes within the cluster.
+
+## Conclusion
+
+Thank you for reading this blog post! I hope you enjoyed to see how the usage of
+seccomp profiles has been evolved in Kubernetes over the past releases as much
+as I do. On your own cluster, change the default seccomp profile to
+`RuntimeDefault` (using this new feature) and see the security benefits, and, of
+course, feel free to reach out any time for feedback or questions.
+
+---
+
+_Editor's note: If you have any questions or feedback about this blog post, feel
+free to reach out via the [Kubernetes slack in #sig-node][slack]._
+
+[slack]: https://kubernetes.slack.com/messages/sig-node