Staged deployments? Specializations? #254

max06 · 2024-12-11T12:17:51Z

Good morning!

I'm currently trying to deploy an etcd cluster on 3 nodes using colmena. It works great, except when handling different lifecycle stages.

To make it a bit more graphical, imagine this:

You have 3 clean, fresh nodes. You use following nix config to deploy the cluster:

    services.etcd = {
      enable = true;
      initialClusterState = "new";
      initialCluster = ["node1=http://192.168.27.200:2380" "node2=http://192.168.27.201:2380" "node3=http://192.168.27.202:2380"];
      listenPeerUrls = ["http://0.0.0.0:2380"];
      listenClientUrls = ["http://0.0.0.0:2379"];
      advertiseClientUrls = ["http://${config.hive.ip}:2379"];
      initialAdvertisePeerUrls = ["http://${config.hive.ip}:2380"];
    };

You run this using colmena apply. You're happy with your shiny new etcd cluster.

Suddenly node1 fails with a broken root disk. You replace it and install nixos with a minimal config. And now you're trying to get your previous config deployed again.

First issue: Your cluster isn't new anymore. initialClusterState needs to be changed to existing, otherwise the other 2 nodes won't accept the new node1. Changing that nix config isn't a viable option, you don't want your teammates changing code to accomodate for different stages. My current "hot idea" is using nix specializations. A default one for the regular runtime configuration (with existing to replace failed nodes), and a bootstrap one for... bootstrapping. I just haven't figured out yet, how to use them with colmena.

Second issue: Even after setting initialClusterState = "existing", the other 2 nodes will reject the reinstalled first node. You need to remove the broken node from the cluster, and add it again to make things work. This is definitely not an issue caused by nix, but it highlights the issue very well. There's a lot of software out there preferring runtime configuration changes through apis, clis and more. Declarative configuration? Nobody got time for that.

This is not meant to be a rant. I'm looking for ideas, best practices, the "nix way" of doing things. Issue 1 is the bigger problem for me right now. The second issue can be worked around using scripts, defined procedures, maybe ansible. "Before you reinstall that node, run following commands on a functional cluster member: etcdctl member remove...". A lot of documentation, less automation.

Someone please save me from more headache...

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Staged deployments? Specializations? #254

Staged deployments? Specializations? #254

max06 commented Dec 11, 2024

Staged deployments? Specializations? #254

Staged deployments? Specializations? #254

Comments

max06 commented Dec 11, 2024