-
-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infinite loop under dynamic config situation #235
Comments
Yep, its fine to add From this comment, you can see we've had to fix a (simpler) form of this issue in the past. Terraform (single thread at the time) could run This is a case where terraform's ssh can't fast fail (its correctly waiting for nodes to come online). The |
Hi,
Indeed. But this might allow situations where controllers needs to wait the workers get their secrets before starting there own installation (or reverse, depends on the number of nodes). Let's take this situation as an example: 3 controllers, 10 workers. We have these Terraform resources:
If
*This might be any issue, since some machines abandon the PXE boot after many tries. I had some case where I needed to connect to the local console, just to hit We would have this graphical representation:
Another potential issue is this would increase the time to deploy the Typhoon cluster, since workers OS and controllers OS (or reverse) would be deployed in sequential way. In particular, physical box can take up to 5min for each boot. It becomes more relevant if, to simplify, we are using a single uniq Terraform thread. Terraform would These are the reasons why I would recommend to achieve all matchbox setup before starting any ssh connection loops.
Appendix: For code readability, in one of our Terraform script, we are using a
|
You could have PXE loop, but fair enough. Feel free to include the two explicit dependencies in your PR. Having the ssh Appendix: Dear god, I'd rather not know what you're using it for. Whatever floats your boat I guess. 🙈 |
Sorry for have taken the time to PR, and thank you for this fix! |
Bug
Under some circumstances, the Typhoon module can jump in a "wait forever" loop, if:
Environment
Problem / Context
Typhoon is used as module in a global Terraform deployment. The mac addresses of the nodes are provided in a dynamic way by another module.
In a real world use-case, the mac address can be provided by:
Typhoon job graph is composed of 2 main branches:
matchbox
setup branch:https://github.com/poseidon/typhoon/blob/master/bare-metal/container-linux/kubernetes/groups.tf
ssh connection
loops branch:https://github.com/poseidon/typhoon/blob/master/bare-metal/container-linux/kubernetes/ssh.tf
Matchbox operations can be achieve as soon as mac_address values are provided.
SSH connection loop can start without dependencies.
By default, Terraform runs 10 concurrents threads to apply operations, which is govern by the
-parallelism
option.Current Behavior
Let's consider a >10 nodes cluster.
If, which is the case, the machine provisioner takes some time to return the mac addresses, then Terraform will decide to allocate its 10 available threads to
ssh connection
operations, which leads to an infinite waiting loop.Indeed, all the Terraform threads will be used/busy for
ssh connection
. In the mean time, all booting nodes are waiting for Matchbox to be configured, to start CoreOS, then sshd.After some seconds, the machine provisioner will return the mac addresses for a proper matchbox setup, but ... it's too late: no available Terraform threads to apply these operations.
Steps to Reproduce
This is a simple way to reproduce the issue, by simulating a delayed machine/mac address provisioner.
To check if the matchbox config has been deploy, I run:
ls /var/lib/matchbox/groups
on the matchbox server.rm /var/lib/matchbox/groups/europe-worker-node-*
to clean the config.terraform apply -var latency=0 -var nodes=8
This test should properly deploy 8 + 1 (the controller node) matchbox config.
terraform apply -var latency=0 -var nodes=12
This test should properly deploy 12 + 1 matchbox config.
terraform apply -var latency=15 -var nodes=8
After ~15sec, this test should properly deploy 8 + 1 matchbox config.
terraform apply -var latency=15 -var nodes=12
This test should jump in an "wait forever loop".
Workaround, increase the number of Terraform threads:
terraform apply -var latency=15 -var nodes=12 -parallelism=15
After ~15sec, this test should properly deploy 12 + 1 matchbox config.
Unfortunately, this side effect implies to use Terraform bare-metal module with static configuration only (when nodes are >10).
Suggested solution
I suggest to make sure that all
ssh connection
operations will start only after allmatchbox
operations.I propose to update https://github.com/poseidon/typhoon/blob/master/bare-metal/container-linux/kubernetes/ssh.tf by adding a
depends_on
condition to bothcopy-controller-secrets
andcopy-worker-secrets
.Note: these matchbox_group resources are the only ones which actually depends on the nodes mac addresses.
The text was updated successfully, but these errors were encountered: