Infinite loop under dynamic config situation #235

johnmarcou · 2018-06-04T06:14:56Z

Bug

Under some circumstances, the Typhoon module can jump in a "wait forever" loop, if:

Typhoon is use as a module in a main Terraform deployment
the mac addresses are dynamically provided via another Terraform module/resource
the cluster size excess 10 nodes

Environment

Platform: bare-metal
OS: container-linux
Terraform: Terraform v0.11.7

Problem / Context

Typhoon is used as module in a global Terraform deployment. The mac addresses of the nodes are provided in a dynamic way by another module.

In a real world use-case, the mac address can be provided by:

a physical machine provisioner Terraform module: picking a node, firing some IPMI command, return the mac address
a virtual machine provisioner Terraform module: creating and starting a VM, return the mac address

Typhoon job graph is composed of 2 main branches:

matchbox setup branch:
https://github.com/poseidon/typhoon/blob/master/bare-metal/container-linux/kubernetes/groups.tf
ssh connection loops branch:
https://github.com/poseidon/typhoon/blob/master/bare-metal/container-linux/kubernetes/ssh.tf

Matchbox operations can be achieve as soon as mac_address values are provided.
SSH connection loop can start without dependencies.

By default, Terraform runs 10 concurrents threads to apply operations, which is govern by the -parallelism option.

Current Behavior

Let's consider a >10 nodes cluster.

If, which is the case, the machine provisioner takes some time to return the mac addresses, then Terraform will decide to allocate its 10 available threads to ssh connection operations, which leads to an infinite waiting loop.

Indeed, all the Terraform threads will be used/busy for ssh connection. In the mean time, all booting nodes are waiting for Matchbox to be configured, to start CoreOS, then sshd.

After some seconds, the machine provisioner will return the mac addresses for a proper matchbox setup, but ... it's too late: no available Terraform threads to apply these operations.

Steps to Reproduce

This is a simple way to reproduce the issue, by simulating a delayed machine/mac address provisioner.

# Providers are setup in a providers.tf file

variable "nodes" {
  description = "Number of worker node to deploy"
  default     = "2"
}

variable "latency" {
  description = "Latency in second before machine provisioner returns"
  default     = "0"
}

# Generate a dynamic list of workers, and render the name
data "null_data_source" "workers" {
  count = "${var.nodes}"

  inputs = {
    name   = "${format("worker-node-%d", count.index + 1)}"
    domain = "${format("worker-node-%d.example.com", count.index + 1)}"
  }
}

# Delayed machine provisioner simulator
resource "null_resource" "mac_address" {
  count = "${var.nodes}"

  provisioner "local-exec" {
    command = "sleep ${var.latency}"
  }
}

module "bare-metal-mercury" {
  source = "git::https://github.com/poseidon/typhoon//bare-metal/container-linux/kubernetes?ref=v1.10.3"

  providers = {
    local    = "local.default"
    null     = "null.default"
    template = "template.default"
    tls      = "tls.default"
  }

  # bare-metal
  cluster_name           = "europe"
  matchbox_http_endpoint = "http://matchbox.example.com"
  os_channel             = "coreos-stable"
  os_version             = "1632.3.0"

  # configuration
  k8s_domain_name    = "node1.example.com"
  ssh_authorized_key = "ssh-rsa AAAAB3Nz..."
  asset_dir          = "."

  # machines
  controller_names   = ["node1"]
  controller_macs    = ["52:54:00:a1:9c:ae"]
  controller_domains = ["node1.example.com"]

  worker_names   = "${data.null_data_source.workers.*.inputs.name}"
  worker_macs    = "${null_resource.mac_address.*.id}"                #Note this value is an ID, resulting in an invalid mac-address format on the matchbox server, but it's not what we are testing at that stage.
  worker_domains = "${data.null_data_source.workers.*.inputs.domain}"
}

To check if the matchbox config has been deploy, I run: ls /var/lib/matchbox/groups on the matchbox server. rm /var/lib/matchbox/groups/europe-worker-node-* to clean the config.

terraform apply -var latency=0 -var nodes=8
This test should properly deploy 8 + 1 (the controller node) matchbox config.
terraform apply -var latency=0 -var nodes=12
This test should properly deploy 12 + 1 matchbox config.
terraform apply -var latency=15 -var nodes=8
After ~15sec, this test should properly deploy 8 + 1 matchbox config.
terraform apply -var latency=15 -var nodes=12
This test should jump in an "wait forever loop".
Workaround, increase the number of Terraform threads:
terraform apply -var latency=15 -var nodes=12 -parallelism=15
After ~15sec, this test should properly deploy 12 + 1 matchbox config.

Unfortunately, this side effect implies to use Terraform bare-metal module with static configuration only (when nodes are >10).

Suggested solution

I suggest to make sure that all ssh connection operations will start only after all matchbox operations.

I propose to update https://github.com/poseidon/typhoon/blob/master/bare-metal/container-linux/kubernetes/ssh.tf by adding a depends_on condition to both copy-controller-secrets and copy-worker-secrets .

resource "null_resource" "copy-controller-secrets" {
...
  depends_on = [
    "matchbox_group.install",
    "matchbox_group.controller",
    "matchbox_group.worker",
  ]
...
}
resource "null_resource" "copy-worker-secrets" {
...
  depends_on = [
    "matchbox_group.install",
    "matchbox_group.controller",
    "matchbox_group.worker",
  ]
...
}

Note: these matchbox_group resources are the only ones which actually depends on the nodes mac addresses.

The text was updated successfully, but these errors were encountered:

dghubble · 2018-06-05T03:21:53Z

Yep, its fine to add depends_on for this case. Thanks for the detailed writeup.

From this comment, you can see we've had to fix a (simpler) form of this issue in the past. Terraform (single thread at the time) could run bootkube-start before secrets were placed on machines. The dependency ordering was fixed using depends_on between the copies. Your case is similar and could plausibly occur independent of whether a user dynamically sets controller_* or worker_* variables or not, if 1) you're unlucky and 2) the number of worker nodes exceeds the number of threads Terraform uses.

This is a case where terraform's ssh can't fast fail (its correctly waiting for nodes to come online). The copy-controller-secrets step could be made to depend directly on matchbox_group.controller, but that would still possibly allow matchbox_group.install to be done last. The proposed depends_on should be the right approach here. 👍 I think you can delete two of the dependencies though (copy-controller-secrets depending on worker, copy-worker-secrets depending on controller).

johnmarcou · 2018-06-07T01:48:02Z

Hi,
Thanks for your reply.

I think you can delete two of the dependencies though (copy-controller-secrets depending on worker, copy-worker-secrets depending on controller).

Indeed. But this might allow situations where controllers needs to wait the workers get their secrets before starting there own installation (or reverse, depends on the number of nodes).

Let's take this situation as an example: 3 controllers, 10 workers. We have these Terraform resources:

3 controllers matchbox
3 controllers SSH
10 workers matchbox
10 workers SSH

If workers ssh depends on workers matchbox and controllers ssh depends on controllers matchbox, then there is a probability for having this situation:

13 nodes are booting up
Terraform setup 10 workers matchbox: low chance but not null, Terraform choose exactly these 10 matchbox to apply first, which unlock the 10 worker ssh resources
Terraform wait for 10 workers ssh connections
10 worker nodes find the matchbox config, and install CoreOS
3 controller nodes don't find matchbox config, enter in a boot PXE trying loop ... *
Terraform connects to 10 workers, and push the secrets
Terraform setup 3 controllers matchbox
Terraform wait for 3 controllers ssh connections
3 controllers nodes find the matchbox config, and install CoreOS
Terraform connects to 3 controllers, and push the secrets

*This might be any issue, since some machines abandon the PXE boot after many tries. I had some case where I needed to connect to the local console, just to hit Enter to the question No bootable device found, press Enter to retry.

We would have this graphical representation:

Worker nodes          [Boot]......[Install------]...[Reboot]...[Ready]
Worker Terraform      ..[Matchbox][SSH------------------------][Secrets]
Controllers nodes     [Boot].....................................................[Install------]...[Reboot]...[Ready]
Controllers Terraform .................................................[Matchbox][SSH------------------------][Secrets]

Another potential issue is this would increase the time to deploy the Typhoon cluster, since workers OS and controllers OS (or reverse) would be deployed in sequential way. In particular, physical box can take up to 5min for each boot.

It becomes more relevant if, to simplify, we are using a single uniq Terraform thread. Terraform would configure matchbox, then ssh in sequence for each node groups (workers/controllers).

These are the reasons why I would recommend to achieve all matchbox setup before starting any ssh connection loops.

Worker nodes          [Boot]......[Install------]...[Reboot]...[Ready]
Worker Terraform      ..[Matchbox].[SSH-----------------------][Secrets]
Controllers nodes     [Boot]......[Install------]...[Reboot]...[Ready]
Controllers Terraform ...[Matchbox]....................................[SSH][Secrets]

Appendix: For code readability, in one of our Terraform script, we are using a null_resource as a "synchronization barrier":

# This meta-resource is waiting all low-level add-ons to be deployed before continue
# Stage1 addons are depending on this resource
resource "null_resource" "addons-stage0" {
  depends_on = [
    "helm_release.metallb",
    "helm_release.external-dns",
    "helm_release.pv-provider",
    "helm_release.ingress-internal",
    "helm_release.ingress-external",
  ]

  provisioner "local-exec" {
    # Wait for low-level components to be ready
    command = "sleep 10"
  }
  provisioner "local-exec" {
    # Wait for low-level components to clean old resources (DNS records, PV...)
    command = "sleep 60"
    when    = "destroy"
  }
}

# Deploy dashboard add-on after all low-level components are deployed
# At the terraform destroy, dashboard will be uninstalled,
# but external-dns will still be here (for 60s) to clean the DNS zone
resource "helm_release" "dashboard" {
  count      = "${local.addons_activation["dashboard"] ? 1 : 0}"
  depends_on = ["null_resource.addons-stage0"]
  name       = "dashboard"
  namespace  = "kube-system"
  chart      = "stable/kubernetes-dashboard"
  version    = "${var.versions["dashboard"]}"
}

dghubble · 2018-06-08T05:19:43Z

You could have PXE loop, but fair enough. Feel free to include the two explicit dependencies in your PR. Having the ssh null_resource steps at the end is the intent after all.

Appendix: Dear god, I'd rather not know what you're using it for. Whatever floats your boat I guess. 🙈

* Closes #235

johnmarcou · 2018-06-17T03:15:32Z

Sorry for have taken the time to PR, and thank you for this fix!

* Closes poseidon#235

johnmarcou changed the title ~~Infinite loop under dynamic node clusters situation~~ Infinite loop under dynamic config situation Jun 4, 2018

dghubble added the platform/bare-metal label Jun 5, 2018

dghubble added a commit that referenced this issue Jun 15, 2018

Fix possible deadlock for provisioning bare-metal clusters

3234abb

* Closes #235

dghubble mentioned this issue Jun 15, 2018

Fix possible deadlock for provisioning bare-metal clusters #244

Merged

dghubble added a commit that referenced this issue Jun 15, 2018

Fix possible deadlock for provisioning bare-metal clusters

f591d74

* Closes #235

dghubble added a commit that referenced this issue Jun 15, 2018

Fix possible deadlock for provisioning bare-metal clusters

ed0b781

* Closes #235

dghubble closed this as completed in #244 Jun 15, 2018

Snaipe pushed a commit to aristanetworks/monsoon that referenced this issue Apr 13, 2023

Fix possible deadlock for provisioning bare-metal clusters

4759a7a

* Closes poseidon#235

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infinite loop under dynamic config situation #235

Infinite loop under dynamic config situation #235

johnmarcou commented Jun 4, 2018 •

edited

Loading

dghubble commented Jun 5, 2018

johnmarcou commented Jun 7, 2018 •

edited

Loading

dghubble commented Jun 8, 2018

johnmarcou commented Jun 17, 2018

Infinite loop under dynamic config situation #235

Infinite loop under dynamic config situation #235

Comments

johnmarcou commented Jun 4, 2018 • edited Loading

Bug

Environment

Problem / Context

Current Behavior

Steps to Reproduce

Suggested solution

dghubble commented Jun 5, 2018

johnmarcou commented Jun 7, 2018 • edited Loading

dghubble commented Jun 8, 2018

johnmarcou commented Jun 17, 2018

johnmarcou commented Jun 4, 2018 •

edited

Loading

johnmarcou commented Jun 7, 2018 •

edited

Loading