Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 112 additions & 0 deletions enhancements/security/file-integrity-operator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
---
title: OpenShift node file integrity monitoring
authors:
- "@mrogers950"
reviewers:
- "@cgwalters"
- "@ashcrow"
- "@jhrozek"
approvers:
- "@JAORMX"
creation-date: 2019-10-21
last-updated: 2019-11-04
status: provisional
---

# Cluster Node File Integrity Operator

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift/docs]

## Open Questions [optional]

## Summary

This enhancement describes a new security feature for OpenShift. Many security-conscious customers want to be informed when files on a host's filesystem are modified in a way that is unexpected, as this may indicate an attack or compromise. It proposes a "file-integrity-operator" that provides file integrity monitoring of select files on the host filesystems of the cluster nodes. It periodically runs a verification check on the watched files and provides logs of any changes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing some points I raised in internal discussion: I just don't see this implementation as worth it.

Security is full of tradeoffs - I can add three locks to my front door for example. Is it worth it? Probably not...it'd be a daily usability hit for a very marginal benefit.

Humans (including humans attempting to compromise computer systems) will follow the path of least resistance. If they see 3 locks on my front door, they'll check the back door, where I didn't put 3 locks.

In this case, the back door is basically killing or compromising the AIDE daemonset.

This is going to be a very minor speedup to any attacker that has studied the system beforehand.

Further, it raises the risk of a lot of false positives. How for example would an AIDE system distinguish between "attacker compromised /usr/bin/bash" and "OSTree update changed /usr/bin/bash". Similarly for files in /etc that may be changed via MachineConfig, or certificates on the host that end up being rotated, etc.

Really a lot of this boils down to:

providing a log of files that have been modified since the initial run of the DaemonSet pods.

Is just not even close to sufficient.

Finally, another thing is that "periodically scan the whole filesystem" is a known way to cause performance hits. It means that files that were unused are suddenly brought into the page cache, potentially evicting hot files. It causes I/O contention.

I understand we're trying to meet a compliance standard, and we also can't let the perfect be the enemy of the good. We have to start somewhere, and I (and our customers) certainly appreciate the efforts here.

But my bottom line is that this implementation overall will cause more problems (in false positives and also periodic perf hits) than it will solve.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give a little bit more context on where this initiative started.... This is ont the ultimate file integrity silver bullet that will solve all of our issues. This initiative came from the need to comply with federal regulations (through the FedRAMP program, moderate baseline to be precise), which requires such a system to ensure file integrity to be in place. So, from a compliance point of view, either you're compliant and you can sell to folks that require this sort of thing (US public sector, finance, healthcare), or you're not and you can't sell. AIDE was chosen as a first approach since this is how we currently make RHEL be compliant. Given that this is an operator, we can further iterate by creating another provider that would be called by the operator, that would give more security assurances. But to begin with, lets just enable customers to be able to comply with regulations, and lets enable them to at least be able to use OpenShift.

On the other hand, this is not being recommended as a default, it would be something you enable through OperatorHub.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing some points I raised in internal discussion: I just don't see this implementation as worth it.

Security is full of tradeoffs - I can add three locks to my front door for example. Is it worth it? Probably not...it'd be a daily usability hit for a very marginal benefit.

Humans (including humans attempting to compromise computer systems) will follow the path of least resistance. If they see 3 locks on my front door, they'll check the back door, where I didn't put 3 locks.

In this case, the back door is basically killing or compromising the AIDE daemonset.

This is going to be a very minor speedup to any attacker that has studied the system beforehand.

While this might be true, the same could be said about basically any OS-level security component that is not backed by a HSM or similar hardware root of trust, and even then that is just moving the goal post - you have to then trust that your hardware vendor is not selling its keys to a nation state. Given a sophisticated enough attacker it all falls apart, so you can't really include this worst case scenario when developing a threat model.

Like Juan mentioned there is the possibility of extending this to different file integrity providers but we want to tackle the FedRAMP "moderate" baseline initially (we also would not want to make the HSM/TPM a barrier for entry to be compliant if the spec does not call for it). Even if at a practical level the AIDE solution only provides the ability to better post-mortem a compromise because you have a log of the files that changed, that is still valuable to the organization.

Further, it raises the risk of a lot of false positives. How for example would an AIDE system distinguish between "attacker compromised /usr/bin/bash" and "OSTree update changed /usr/bin/bash". Similarly for files in /etc that may be changed via MachineConfig, or certificates on the host that end up being rotated, etc.

We will need to come up with a good strategy for false positives. In the case of an OSTree update, this sounds like something the file integrity operator might be able to handle if it can detect that there was a cluster update and update the checksum database.


## Motivation

In addition to the reasons stated in the Summary section, as part of the FedRAMP gap assessment of OpenShift/RHCOS, it has been identified that to fulfill several NIST SP800-53 security controls we need to constantly do integrity checks on configuration files (CM-3 & CM-6), as well as critical system paths and binaries (boot configuration, drivers, firmware, libraries) (SI-7). Besides verifying the files, we need to be able to report which files changed and in what manner, in order for the organization to better determine if the change has been authorized or not. In order to fulfull the controls the file integrity checks need to be done using a state-of-the-practice integrity checking mechanism (e.g., parity checks, cyclical redundancy checks, cryptographic hashes). If using cryptographic hashes for integrity checks, such algorithms need to be FIPS-approved.

## Goals

Provide a way for security-conscious customers to be alerted when changes are made to files on the host operating system in a way that is satisfactory for FedRAMP compliance.

## Proposal

The proposed design and current [Proof-of-concept operator](https://github.com/mrogers950/file-integrity-operator) is as follows:
* Deploying node monitoring pods - The file-integrity-operator deploys daemonSets that run a privileged [AIDE](https://aide.github.io/) pod on each master and worker. This AIDE pod does a hostmount of / to the /hostroot directory in the pod. The privileged access for the AIDE pod is needed for the hostmount, so an SELinux policy is applied that will restrict write access to files other than the AIDE database and log.
* A worker node may be RHCOS or RHEL (UPI). This means there will potentially be different default AIDE configurations. When deploying daemonSets for the workers, the operator will try to determine the OS type and deploy with the appropriate config.
* Scan database initialization - AIDE works off of a database of file checksums. This database must be initialzed during the first run of the AIDE pods, and at times may need to be re-initialized if the AIDE configuration changes.
* Running scans - The AIDE process runs in a loop in the pod, periodically running integrity checks (customizable with Spec.ScanInterval).
* Log scan results - The AIDE process can write scan results to syslog, files, and standard output. How we handle the logs depends on the approach we want to take:
* Approach A: Log to syslog only and leave collection up to the clusterlogging components. This is likely the preferred method.
* Approach B: Log to files on the host, and expose them to the user through configMaps.
* AIDE Configuration - A user-provided AIDE configuration can be provided in order to allow customers to modify the integrity check policy.
* The admin first creates a configMap containing their aide.conf.
* A FileIntegrity CR is posted that defines the Spec.Config items.
* Spec.Config.Name: The name of a configMap that contains the admin-provided aide.conf
* Spec.Config.Namespace: The namespace of the configMap
* Spec.Config.Key: The data key in the configMap that holds the aide.conf
* In most cases the provided aide.conf will be specific to standalone host and not tailored for the operator. On reconcile the operator reads the configuration from the configMap and applies a few conversions allowing it to work with the pod configuration.
* Prefix /hostroot/ to each file selection line.
* Change database and log path parameters.
* Change allowed checksum types to FIPS-approved variants.
* After conversion the operator's AIDE configuration configMap is updated with the new config, and the daemonSet's pods are restarted.
* "Provider" types - Including a provider field in the API definition leaves us room to later define a different integrity method. Suggested future methods are fs-verity and IMA/EVM. Initially, this always defaults to "aide".

### API Specification

```
apiVersion: file-integrity.openshift.io/v1alpha1
kind: FileIntegrity
metadata:
name: example-fileintegrity
spec:
provider: aide
scanInterval: 5m
config:
name:
namespace:
key
```

### Test Plan

* Basic functionality
1. Install file-integrity-operator.
2. Ensure operator roll-out, check for running daemonSet pods.
3. Modify a file on the host and verify detection of the change.
* Configuration
* Unit test aide.conf conversion functions.
* Verify the aide.conf can be changed and propagated to the daemonSet pods.
* Cluster upgrade testing

### Graduation Criteria

**Note:** *Section not required until targeted at a release.*

TBD

### Upgrade / Downgrade Strategy

The operator will handle configuration and image versioning for its operand. AIDE and its configuration are mature and not expected to have breaking changes between releases. Because of this stability it is likely that the container image we use for AIDE does not need to always upgrade between OpenShift 4 releases.

### Version Skew Strategy

The operator is intended to be the sole controller of its operand resources (configmaps, daemonSets, AIDE container image versions), so there should not be version skew issues.

## Implementation History

* Initial POC at https://github.com/mrogers950/file-integrity-operator

## Drawbacks

* After a cluster upgrade new versions of the node OS will result in false positives as packages are updated.
* One possible solution is to save the pre-upgrade database and logs, and re-initialize the AIDE database after upgrade.
* AIDE runs periodically, the longer the interval the higher the chances of missing a file change from an actual attack.