Built with NixOS and Kubernetes.
When adding a new node, you need to create a bootable USB. To do that, you need to build an ISO file. If building with Apple Silicon Mac (or other non x86_64 architecture), see this post. Otherwise, follow the steps below.
cd iso
nix build .#nixosConfigurations.exampleIso.config.system.build.isoImage
Resulting ISO will be in the result
directory. Then burn that ISO to a USB drive, then boot the new node from the USB drive.
Node secrets are managed with sops-nix.
On first boot, each node generates a key located at /etc/ssh/ssh_host_ed25519_key.pub
. After the install, the key is converted to age
and printed in the terminal. Copy this, and add it to the .sops.yaml
file.
To create your own key for local development, generate an ssh key, and convert it to age
:
nix-shell -p ssh-to-age --run 'cat /YOUR/KEY/PATH.pub | ssh-to-age'
Then add the output to the .sops.yaml
file.
To create a keys.txt
for local secrets management, run the following command:
nix run nixpkgs#ssh-to-age -- -private-key -i ~/YOUR/KEY/PATH > keys.txt
This is needed if you're updating the secrets locally.
To update the new keys across all nodes, run the following command:
nix-shell -p sops --run "SOPS_AGE_KEY_FILE=./keys.txt sops updatekeys secrets/secrets.yaml"
Then commit the changes to the .sops.yaml
file and the nodes will be updated on their next rebuild.
Make sure you have nix
installed locally. Then:
- Add the new node and its IP to the in
flake.nix
. - Boot the node from the ISO created in Build NixOS ISO. Ensure that the node is reachable at
192.168.100.199
. If you get permission errors, you may have to add your key to the ISO config file. - Execute the following command on your local machine:
SSH_PRIVATE_KEY="$(cat ./nixos_cluster)"$'\n' nix run github:nix-community/nixos-anywhere --extra-experimental-features "nix-command flakes" -- --flake '.#cluster-node-NUMBER' [email protected]
- Once the node boots, ssh into the node and run the following command:
nix-shell -p ssh-to-age --run 'cat /etc/ssh/ssh_host_ed25519_key.pub | ssh-to-age'
Copy the outputted age
key to the .sops.yaml
file and regenerate secrets (See Secrets Management), then update the node.
If you have the repository cloned on a node (working on changes without committing), then run to update from local source:
sudo nixos-rebuild switch --flake '.#cluster-node-NUMBER'
Then to update each node in the cluster:
sudo nixos-rebuild switch --flake '.#cluster-node-NUMBER' --use-remote-sudo --target-host cluster@cluster-node-NUMBER
This will also update secrets on each node.
To pull new changes from the repository without cloning it onto the node, just run:
sudo nixos-rebuild switch --flake github:NelsonDane/clarkson-nixos-cluster#cluster-node-NUMBER
All nodes can ssh into each other using the included ssh_config
. There is a key located in .sops.yaml
that is available at /run/secrets/cluster_talk
.
If you don't want to manually update each node, they pull and apply new changes from this repository every day at 3:30am.
A GitHub Action runs this everyday at 3am automatically.
For convenience, the following aliases are available when ssh'd into a node:
c -> clear
k -> kubectl
h -> helm
hf -> helmfile
For distributed storage, we use Longhorn. To install Longhorn, run the following command:
cd helm
hf apply
To see the gui, go to http://192.168.100.61
in your browser.
To get Metallb working (if it's not), run the following command:
cd helm/kustomize
kubectl apply -k .
To see IPs:
k get svc -A
Slurm is configured using the Slurm Helm Chart. To pull the submodules for Slurm, run the following command:
git submodule update --init --recursive
Then to install Slurm, run:
cd helm/slurm-k8s-cluster
h install slurm slurm-cluster-chart
And then to apply changes after initial install, run:
h upgrade slurm slurm-cluster-chart
The Slurm GUI is available at https://192.168.100.82
To add a new app or service, find a helm chart and add it to helm/helmfile.yaml
. Then run:
cd helm
hf apply
And it will be installed on the cluster.