Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop under dynamic config situation #235

Closed
johnmarcou opened this issue Jun 4, 2018 · 4 comments
Closed

Infinite loop under dynamic config situation #235

johnmarcou opened this issue Jun 4, 2018 · 4 comments

Comments

@johnmarcou
Copy link

johnmarcou commented Jun 4, 2018

Bug

Under some circumstances, the Typhoon module can jump in a "wait forever" loop, if:

  • Typhoon is use as a module in a main Terraform deployment
  • the mac addresses are dynamically provided via another Terraform module/resource
  • the cluster size excess 10 nodes

Environment

  • Platform: bare-metal
  • OS: container-linux
  • Terraform: Terraform v0.11.7

Problem / Context

Typhoon is used as module in a global Terraform deployment. The mac addresses of the nodes are provided in a dynamic way by another module.

In a real world use-case, the mac address can be provided by:

  • a physical machine provisioner Terraform module: picking a node, firing some IPMI command, return the mac address
  • a virtual machine provisioner Terraform module: creating and starting a VM, return the mac address

Typhoon job graph is composed of 2 main branches:

Matchbox operations can be achieve as soon as mac_address values are provided.
SSH connection loop can start without dependencies.

By default, Terraform runs 10 concurrents threads to apply operations, which is govern by the -parallelism option.

Current Behavior

Let's consider a >10 nodes cluster.

If, which is the case, the machine provisioner takes some time to return the mac addresses, then Terraform will decide to allocate its 10 available threads to ssh connection operations, which leads to an infinite waiting loop.

Indeed, all the Terraform threads will be used/busy for ssh connection. In the mean time, all booting nodes are waiting for Matchbox to be configured, to start CoreOS, then sshd.

After some seconds, the machine provisioner will return the mac addresses for a proper matchbox setup, but ... it's too late: no available Terraform threads to apply these operations.

Steps to Reproduce

This is a simple way to reproduce the issue, by simulating a delayed machine/mac address provisioner.

# Providers are setup in a providers.tf file

variable "nodes" {
  description = "Number of worker node to deploy"
  default     = "2"
}

variable "latency" {
  description = "Latency in second before machine provisioner returns"
  default     = "0"
}

# Generate a dynamic list of workers, and render the name
data "null_data_source" "workers" {
  count = "${var.nodes}"

  inputs = {
    name   = "${format("worker-node-%d", count.index + 1)}"
    domain = "${format("worker-node-%d.example.com", count.index + 1)}"
  }
}

# Delayed machine provisioner simulator
resource "null_resource" "mac_address" {
  count = "${var.nodes}"

  provisioner "local-exec" {
    command = "sleep ${var.latency}"
  }
}

module "bare-metal-mercury" {
  source = "git::https://github.com/poseidon/typhoon//bare-metal/container-linux/kubernetes?ref=v1.10.3"

  providers = {
    local    = "local.default"
    null     = "null.default"
    template = "template.default"
    tls      = "tls.default"
  }

  # bare-metal
  cluster_name           = "europe"
  matchbox_http_endpoint = "http://matchbox.example.com"
  os_channel             = "coreos-stable"
  os_version             = "1632.3.0"

  # configuration
  k8s_domain_name    = "node1.example.com"
  ssh_authorized_key = "ssh-rsa AAAAB3Nz..."
  asset_dir          = "."

  # machines
  controller_names   = ["node1"]
  controller_macs    = ["52:54:00:a1:9c:ae"]
  controller_domains = ["node1.example.com"]

  worker_names   = "${data.null_data_source.workers.*.inputs.name}"
  worker_macs    = "${null_resource.mac_address.*.id}"                #Note this value is an ID, resulting in an invalid mac-address format on the matchbox server, but it's not what we are testing at that stage.
  worker_domains = "${data.null_data_source.workers.*.inputs.domain}"
}

To check if the matchbox config has been deploy, I run: ls /var/lib/matchbox/groups on the matchbox server. rm /var/lib/matchbox/groups/europe-worker-node-* to clean the config.

  1. terraform apply -var latency=0 -var nodes=8
    This test should properly deploy 8 + 1 (the controller node) matchbox config.

  2. terraform apply -var latency=0 -var nodes=12
    This test should properly deploy 12 + 1 matchbox config.

  3. terraform apply -var latency=15 -var nodes=8
    After ~15sec, this test should properly deploy 8 + 1 matchbox config.

  4. terraform apply -var latency=15 -var nodes=12
    This test should jump in an "wait forever loop".

  5. Workaround, increase the number of Terraform threads:
    terraform apply -var latency=15 -var nodes=12 -parallelism=15
    After ~15sec, this test should properly deploy 12 + 1 matchbox config.

Unfortunately, this side effect implies to use Terraform bare-metal module with static configuration only (when nodes are >10).

Suggested solution

I suggest to make sure that all ssh connection operations will start only after all matchbox operations.

I propose to update https://github.com/poseidon/typhoon/blob/master/bare-metal/container-linux/kubernetes/ssh.tf by adding a depends_on condition to both copy-controller-secrets and copy-worker-secrets .

resource "null_resource" "copy-controller-secrets" {
...
  depends_on = [
    "matchbox_group.install",
    "matchbox_group.controller",
    "matchbox_group.worker",
  ]
...
}
resource "null_resource" "copy-worker-secrets" {
...
  depends_on = [
    "matchbox_group.install",
    "matchbox_group.controller",
    "matchbox_group.worker",
  ]
...
}

Note: these matchbox_group resources are the only ones which actually depends on the nodes mac addresses.

@johnmarcou johnmarcou changed the title Infinite loop under dynamic node clusters situation Infinite loop under dynamic config situation Jun 4, 2018
@dghubble
Copy link
Member

dghubble commented Jun 5, 2018

Yep, its fine to add depends_on for this case. Thanks for the detailed writeup.

From this comment, you can see we've had to fix a (simpler) form of this issue in the past. Terraform (single thread at the time) could run bootkube-start before secrets were placed on machines. The dependency ordering was fixed using depends_on between the copies. Your case is similar and could plausibly occur independent of whether a user dynamically sets controller_* or worker_* variables or not, if 1) you're unlucky and 2) the number of worker nodes exceeds the number of threads Terraform uses.

This is a case where terraform's ssh can't fast fail (its correctly waiting for nodes to come online). The copy-controller-secrets step could be made to depend directly on matchbox_group.controller, but that would still possibly allow matchbox_group.install to be done last. The proposed depends_on should be the right approach here. 👍 I think you can delete two of the dependencies though (copy-controller-secrets depending on worker, copy-worker-secrets depending on controller).

@johnmarcou
Copy link
Author

johnmarcou commented Jun 7, 2018

Hi,
Thanks for your reply.

I think you can delete two of the dependencies though (copy-controller-secrets depending on worker, copy-worker-secrets depending on controller).

Indeed. But this might allow situations where controllers needs to wait the workers get their secrets before starting there own installation (or reverse, depends on the number of nodes).

Let's take this situation as an example: 3 controllers, 10 workers. We have these Terraform resources:

  • 3 controllers matchbox
  • 3 controllers SSH
  • 10 workers matchbox
  • 10 workers SSH

If workers ssh depends on workers matchbox and controllers ssh depends on controllers matchbox, then there is a probability for having this situation:

  • 13 nodes are booting up
  • Terraform setup 10 workers matchbox: low chance but not null, Terraform choose exactly these 10 matchbox to apply first, which unlock the 10 worker ssh resources
  • Terraform wait for 10 workers ssh connections
  • 10 worker nodes find the matchbox config, and install CoreOS
  • 3 controller nodes don't find matchbox config, enter in a boot PXE trying loop ... *
  • Terraform connects to 10 workers, and push the secrets
  • Terraform setup 3 controllers matchbox
  • Terraform wait for 3 controllers ssh connections
  • 3 controllers nodes find the matchbox config, and install CoreOS
  • Terraform connects to 3 controllers, and push the secrets

*This might be any issue, since some machines abandon the PXE boot after many tries. I had some case where I needed to connect to the local console, just to hit Enter to the question No bootable device found, press Enter to retry.

We would have this graphical representation:

Worker nodes          [Boot]......[Install------]...[Reboot]...[Ready]
Worker Terraform      ..[Matchbox][SSH------------------------][Secrets]
Controllers nodes     [Boot].....................................................[Install------]...[Reboot]...[Ready]
Controllers Terraform .................................................[Matchbox][SSH------------------------][Secrets]

Another potential issue is this would increase the time to deploy the Typhoon cluster, since workers OS and controllers OS (or reverse) would be deployed in sequential way. In particular, physical box can take up to 5min for each boot.

It becomes more relevant if, to simplify, we are using a single uniq Terraform thread. Terraform would configure matchbox, then ssh in sequence for each node groups (workers/controllers).

These are the reasons why I would recommend to achieve all matchbox setup before starting any ssh connection loops.

Worker nodes          [Boot]......[Install------]...[Reboot]...[Ready]
Worker Terraform      ..[Matchbox].[SSH-----------------------][Secrets]
Controllers nodes     [Boot]......[Install------]...[Reboot]...[Ready]
Controllers Terraform ...[Matchbox]....................................[SSH][Secrets]

Appendix: For code readability, in one of our Terraform script, we are using a null_resource as a "synchronization barrier":

# This meta-resource is waiting all low-level add-ons to be deployed before continue
# Stage1 addons are depending on this resource
resource "null_resource" "addons-stage0" {
  depends_on = [
    "helm_release.metallb",
    "helm_release.external-dns",
    "helm_release.pv-provider",
    "helm_release.ingress-internal",
    "helm_release.ingress-external",
  ]

  provisioner "local-exec" {
    # Wait for low-level components to be ready
    command = "sleep 10"
  }
  provisioner "local-exec" {
    # Wait for low-level components to clean old resources (DNS records, PV...)
    command = "sleep 60"
    when    = "destroy"
  }
}

# Deploy dashboard add-on after all low-level components are deployed
# At the terraform destroy, dashboard will be uninstalled,
# but external-dns will still be here (for 60s) to clean the DNS zone
resource "helm_release" "dashboard" {
  count      = "${local.addons_activation["dashboard"] ? 1 : 0}"
  depends_on = ["null_resource.addons-stage0"]
  name       = "dashboard"
  namespace  = "kube-system"
  chart      = "stable/kubernetes-dashboard"
  version    = "${var.versions["dashboard"]}"
}

@dghubble
Copy link
Member

dghubble commented Jun 8, 2018

You could have PXE loop, but fair enough. Feel free to include the two explicit dependencies in your PR. Having the ssh null_resource steps at the end is the intent after all.

Appendix: Dear god, I'd rather not know what you're using it for. Whatever floats your boat I guess. 🙈

@johnmarcou
Copy link
Author

Sorry for have taken the time to PR, and thank you for this fix!

Snaipe pushed a commit to aristanetworks/monsoon that referenced this issue Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants