Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Staged deployments? Specializations? #254

Open
max06 opened this issue Dec 11, 2024 · 0 comments
Open

Staged deployments? Specializations? #254

max06 opened this issue Dec 11, 2024 · 0 comments

Comments

@max06
Copy link

max06 commented Dec 11, 2024

Good morning!

I'm currently trying to deploy an etcd cluster on 3 nodes using colmena. It works great, except when handling different lifecycle stages.

To make it a bit more graphical, imagine this:

You have 3 clean, fresh nodes. You use following nix config to deploy the cluster:

    services.etcd = {
      enable = true;
      initialClusterState = "new";
      initialCluster = ["node1=http://192.168.27.200:2380" "node2=http://192.168.27.201:2380" "node3=http://192.168.27.202:2380"];
      listenPeerUrls = ["http://0.0.0.0:2380"];
      listenClientUrls = ["http://0.0.0.0:2379"];
      advertiseClientUrls = ["http://${config.hive.ip}:2379"];
      initialAdvertisePeerUrls = ["http://${config.hive.ip}:2380"];
    };

You run this using colmena apply. You're happy with your shiny new etcd cluster.

Suddenly node1 fails with a broken root disk. You replace it and install nixos with a minimal config. And now you're trying to get your previous config deployed again.

First issue: Your cluster isn't new anymore. initialClusterState needs to be changed to existing, otherwise the other 2 nodes won't accept the new node1. Changing that nix config isn't a viable option, you don't want your teammates changing code to accomodate for different stages. My current "hot idea" is using nix specializations. A default one for the regular runtime configuration (with existing to replace failed nodes), and a bootstrap one for... bootstrapping. I just haven't figured out yet, how to use them with colmena.

Second issue: Even after setting initialClusterState = "existing", the other 2 nodes will reject the reinstalled first node. You need to remove the broken node from the cluster, and add it again to make things work. This is definitely not an issue caused by nix, but it highlights the issue very well. There's a lot of software out there preferring runtime configuration changes through apis, clis and more. Declarative configuration? Nobody got time for that.

This is not meant to be a rant. I'm looking for ideas, best practices, the "nix way" of doing things. Issue 1 is the bigger problem for me right now. The second issue can be worked around using scripts, defined procedures, maybe ansible. "Before you reinstall that node, run following commands on a functional cluster member: etcdctl member remove...". A lot of documentation, less automation.

Someone please save me from more headache...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant