-
Notifications
You must be signed in to change notification settings - Fork 19
Ceph and Faraday
Containers are stateless, but applications that run on Hyades often need to store state. Since this has to be resistant to machine failure, this means that data will need to be replicated across nodes in a consistent fashion. There are a number of upstream projects that can implement this kind of disk cluster, but out of all the possible alternatives, Ceph seems to be the only real option for Hyades.
Ceph is a good system for Hyades, because it allows using one unified pool spread across a large number of machines, and can provide multiple different volumes from that unified pool. It can provide both block storage and object storage from this pool of resources (Ceph's filesystem support is not secure enough for our needs), and seems to be decently well written and actively maintained.
However, Ceph is woefully insecure, in a large number of ways. Ceph uses CephX as its protocol for secure communication, which is a hand-rolled reimplementation of Kerberos with unspecified changes. Most importantly, it doesn't provide confidentiality, so any data sent over it (including configuration data and confidental user data) would be leaked to any network attacker.
See our page on The Keysystem for details on our threat model.
The worst part is that CephX uses symmetric keys for authentication (because it's based on kerberos), and in certain cases transmits these symmetric keys over other CephX channels... which means that a network attacker that can read messages can also spoof them. Even if we encrypted data before it got into the Ceph cluster, an attacker could still perform arbitrary deletions or other operations on the stored data.
In order to combat this, we wrap Ceph in a theoretically-airtight "safety blanket", known as Faraday, which is a component that we're developing ourselves. Faraday is a management overlay for the WireGuard overlay network, which acts like a modern secure VPN or IPsec alternative. This allows us to only rely on CephX's authentication, and protect the actual transmission of data and keys.
The reason that Faraday is needed, and not just WireGuard, is that WireGuard has only two topologies: having all nodes connect through a central server, or having all nodes communicate with all other nodes directly. While the latter is what we want (we can't have our disk cluster bottlenecked or made unreliable by a single central server), WireGuard doesn't provide any easy way to add or remove nodes. Faraday is simply WireGuard plus a custom subsystem to monitor the network topology and update the WireGuard configuration of each node as the topology changes.
There are two types of changes to the network topology: additions and removals. Faraday handles them differently. Additions to the network are propagated to all nodes through a central server. Removals from the network (which can happen not just when a machine is intentionally turned off, but also when it experiences a hardware failure or some other kind of fault) are tracked by the peers themselves.
Since the central server is only used to track changes in network topology, momentary failure of the central server will not be harmful to the cluster. Even if the central server reboots, it will be able to reconstruct the network topology as it receives periodic health updates from nodes.
We don't use the central server to track removals (except to communicate cluster information to new nodes) because broken links in the network can cause instances where the central server can't communicate with a node (hence thinking that it's offline) but other nodes can. In such cases, we would still like those other nodes to communicate with that node. Similarly, if the central node has just rebooted, we don't want it to tell individual nodes that a certain slow node is gone, just because it hasn't checked in yet.
There is also the case where the link between two nodes is momentarily broken. Using this architecture, they would stop communicating with each other even when the link starts working again. To prevent this, nodes will check with the central server that a node is indeed offline before considering that node removed from the network.
The main repository for Faraday is on GitHub. You should refer to it for information on how Farady is architected, and its readme is likely a more up-to-date source than this page.