Skip to content

Conversation

@wmorgan6796
Copy link
Contributor

PR o'clock

Description

This adds support for the newly announced AWS EKS Managed Node Groups. Though the allowed config options are pretty limited, this adds initial support.

Checklist

@wmorgan6796 wmorgan6796 mentioned this pull request Nov 20, 2019
3 tasks
@wbertelsen
Copy link
Contributor

wbertelsen commented Nov 21, 2019

I tried this out, im not sure its compatible with manage_worker_iam_resources: true
Getting

Error: error creating EKS Node Group (redacted:redacted): InvalidParameterException: The provided nodeRole is invalid.
status code: 400, request id:xx

@wmorgan6796
Copy link
Contributor Author

manage_worker_iam_resources

This is due to the fact that Managed Node Groups will not work without using a role that has certain policies attached. The policies that need to be attached can be seen here: EKS Node IAM Role. I can add a change to make it so that the role will be made correctly.

Copy link
Contributor

@max-rocket-internet max-rocket-internet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like it @wmorgan6796 💚

Since this is a very important change and many people will likely opt for MNGs in the future, especially more basic users, I think it would be good create an example under examples/managed_node_groups.

This would be a good starting point for beginners and also would ensure that the syntax is checked by the CI checks.

Could you add that also? Make it quite basic, maybe copying roughly what is in the basic example?

@wmorgan6796
Copy link
Contributor Author

@max-rocket-internet I have added the example and fixed a bug that @wbertelsen found as well.

@pierresteiner
Copy link
Contributor

We are also really looking forward to this PR, awesome work.

I saw nothing in the changes regarding the aws auth config map. I didn't play much with the product, but I was expecting some simplification in that regard, as the master knows about the worker.

@wmorgan6796
Copy link
Contributor Author

@pierresteiner the aws auth configmap isn't for the workers, its for delegating access to the cluster for IAM roles. Managed Node Groups don't simplify that.

@lcycon
Copy link

lcycon commented Nov 23, 2019

Looks like there is an issue with the remote_access attribute. By my read of things, it appears that a remote_access block with empty key_name and security group ID list is rewritten to a null remote access block by the AWS API. Because of this, Terraform believes it must recreate the workers on every apply.

One work around is to delete the remote_access block entirely if it is unused.

@wmorgan6796
Copy link
Contributor Author

I’m unsure if it’s possible to completely remove a block if unused. Maybe it would be a good case to note this in the docs? This seems like more of a bug either on the AWS provider or the AWS API. Any suggested edits would be welcome

@wmorgan6796 wmorgan6796 reopened this Nov 23, 2019
@lcycon
Copy link

lcycon commented Nov 23, 2019

I agree this seems like a bug with the provider.

I think one potential workaround for the module is to declare two versions of each worker group (one with the remote access and one that omits it entirely). You may then be able to conditionally choose one over the other (via count trickery) depending on whether a key name is set.

@pierresteiner
Copy link
Contributor

@pierresteiner the aws auth configmap isn't for the workers, its for delegating access to the cluster for IAM roles. Managed Node Groups don't simplify that.

It is actually both, prior to managed node it was necessary to also grant permission to workers role so that they could join the cluster. I don't know if it is still needed (https://docs.aws.amazon.com/eks/latest/userguide/add-user-role.html) though

…etting ssh key causing a race condition within the aws provider.
@wmorgan6796
Copy link
Contributor Author

It is actually both, prior to managed node it was necessary to also grant permission to workers role so that they could join the cluster. I don't know if it is still needed (https://docs.aws.amazon.com/eks/latest/userguide/add-user-role.html) though

Hmm, I wasn't aware of that. I'm hesitant to change too many things as I'm not familiar with the true inner workings of some parts of this module.

@wmorgan6796
Copy link
Contributor Author

I agree this seems like a bug with the provider.

I think one potential workaround for the module is to declare two versions of each worker group (one with the remote access and one that omits it entirely). You may then be able to conditionally choose one over the other (via count trickery) depending on whether a key name is set.

I think I mitigated this issue with my latest commit. I just made it so that if the ssh key isn't set, then automatically make the source_security_group_ids an empty list, regardless of what the user puts there.

@splieth
Copy link
Contributor

splieth commented Nov 25, 2019

Shouldn't the worker group also have a create_before_destroy = true on it? I guess right now all nodes would be destroyed before new ones would be created.

Edit:
This would also allow rolling updates on worker groups which would be awesome!

}

resource "random_pet" "managed_node_groups" {
count = local.worker_group_managed_node_group_count
Copy link
Contributor

@eytanhanig eytanhanig Nov 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every occurrence of count = local.worker_group_managed_node_group_count is strong candidate for switching to for_each = toset(var.workers_additional_policies).

This will also let you replace lookup(var.worker_group_managed_node_groups[count.index], "name", count.index) with the more elegant each.key["name"].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I replace that look up though? I would still need to provide a default value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Given that node_group_name is a required argument for the eks_node_group resource it seems reasonable to ask that module users specify it along with things like the max size.

Copy link
Contributor Author

@wmorgan6796 wmorgan6796 Nov 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The given "for_each" argument value is unsuitable: "for_each" supports maps
and sets of strings, but you have provided a set containing type object.

Can't use for each on the random_pets or for anything that references that object

Copy link
Contributor

@eytanhanig eytanhanig Nov 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wmorgan6796 I forgot that for_each requires that the key it iterates over be a string. Fortunately node groups must have unique names, so we can just create a map that uses that name as the index:

locals {
  node_groups = { for obj in var.worker_group_managed_node_groups : obj.name => obj }
}

Then when using for_each you should be able to do key.value["instance_type"].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eytanhanig I can't seem to fix this error when using for_each:

Error: Invalid function argument

  on ../../node_groups.tf line 88, in resource "aws_eks_node_group" "workers":
  87: 
  88:       [
  89: 
  90: 
  91: 
  92: 
  93: 
    |----------------
    | count.index is 0
    | random_pet.node_groups is object with 1 attribute "example"
    | var.node_groups is tuple with 1 element

Invalid value for "list" parameter: element 2: string required.

Can you take a look at it and see what you can find?

Copy link
Contributor

@eytanhanig eytanhanig Nov 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you're switching from count to for_each, var.node_groups[count.index] becomes each.value.

Here's an example on how to implement this:

resource "random_pet" "node_groups" {
  for_each = local.node_groups

  separator = "-"
  length    = 2

  keepers = {
    instance_type = lookup(each.value, "instance_type", local.workers_group_defaults["instance_type"])

    ec2_ssh_key = lookup(each.value, "key_name", local.workers_group_defaults["key_name"])

    source_security_group_ids = join("-", compact(
      lookup(each.value, "source_security_group_ids", local.workers_group_defaults["source_security_group_id"]
    )))

    node_group_name = join("-", [var.cluster_name, each.value["name"]])
  }
}

resource "aws_eks_node_group" "workers" {
  for_each = local.node_groups

  node_group_name = join("-", [var.cluster_name, each.key, random_pet.node_groups[each.key].id])

  cluster_name    = var.cluster_name
  node_role_arn   = lookup(each.value, "iam_role_arn", aws_iam_role.node_groups[0].arn)
  subnet_ids      = lookup(each.value, "subnets", local.workers_group_defaults["subnets"])

  scaling_config {
    desired_size = lookup(each.value, "node_group_desired_capacity", local.workers_group_defaults["asg_desired_capacity"])
    max_size     = lookup(each.value, "node_group_max_capacity", local.workers_group_defaults["asg_max_size"])
    min_size     = lookup(each.value, "node_group_min_capacity", local.workers_group_defaults["asg_min_size"])
  }

  ami_type        = lookup(each.value, "ami_type", null)
  disk_size       = lookup(each.value, "root_volume_size", null)
  instance_types  = [lookup(each.value, "instance_type", null)]
  labels          = lookup(each.value, "node_group_k8s_labels", null)
  release_version = lookup(each.value, "ami_release_version", null)

  # This sometimes breaks idempotency as described in https://github.com/terraform-providers/terraform-provider-aws/issues/11063
  remote_access {
    ec2_ssh_key               = lookup(each.value, "key_name", "") != "" ? each.value["key_name"] : null
    source_security_group_ids = lookup(each.value, "key_name", "") != "" ? lookup(each.value, "source_security_group_ids", []) : null
  }

  version = aws_eks_cluster.this.version

  tags = lookup(each.value, "node_group_additional_tags", null)

  lifecycle {
    create_before_destroy = true
  }
}

@wmorgan6796
Copy link
Contributor Author

@eytanhanig @max-rocket-internet I updated the minimum versions of Terraform so that we can make use of the for_each blocks for resources. But the linting job for the Github CI only has terraform 0.12.2. Should that be updated or should I revert that change?

@eytanhanig
Copy link
Contributor

@eytanhanig @max-rocket-internet I updated the minimum versions of Terraform so that we can make use of the for_each blocks for resources. But the linting job for the Github CI only has terraform 0.12.2. Should that be updated or should I revert that change?

It appears that you'll need to update the image used in lint.yml.

@max-rocket-internet
Copy link
Contributor

I updated the minimum versions of Terraform so that we can make use of the for_each blocks for resources. But the linting job for the Github CI only has terraform 0.12.2. Should that be updated or should I revert that change?

Yes please also update that: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/.github/workflows/lint.yml#L59

@eytanhanig
Copy link
Contributor

eytanhanig commented Nov 29, 2019

I believe that I've found a bug in the AWS provider resource aws_eks_node_group: Idempotency is broken - and therefore subsequent applies attempt to create & destroy the node group - when remote_access is used in the following way:

  • remote_access {}
  • remote_access {} with ec2_ssh_key = ""
  • remote_access {} with ec2_ssh_key = null

It does become idempotent when used with a valid SSH key. I've not tested it with permutations of the source_security_group_ids argument.

To account for this I believe that that we are limited to the following options:

  • Completely remove the remote_access configuration block.
  • Require the inclusion of a valid SSH key.
  • Wait for a fix to become available before merging this PR.

I've opened a ticket with the aws provider here: hashicorp/terraform-provider-aws#11063

node_groups.tf Outdated
aws_eks_cluster.this.name,
lookup(var.node_groups[count.index], "name", count.index),
random_pet.node_groups[count.index].id
random_pet.node_groups[lookup(var.node_groups[count.index], "name", count.index)]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, I'm getting this error:

Error: Invalid function argument

  on .terraform/modules/eks/node_groups.tf line 88, in resource "aws_eks_node_group" "workers":
  87: 
  88:       [
  89: 
  90: 
  91: 
  92: 
  93: 
    |----------------
    | aws_eks_cluster.this.name is "toro-dev2"
    | count.index is 0
    | random_pet.node_groups is object with 1 attribute "standard"
    | var.node_groups is tuple with 1 element

Invalid value for "list" parameter: element 2: string required.

looks like .id is missing here:

random_pet.node_groups[lookup(var.node_groups[count.index], "name", count.index)].id

tks!

Copy link
Contributor

@eytanhanig eytanhanig Nov 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try this:

resource "random_pet" "node_groups" {
  for_each = local.node_groups

  separator = "-"
  length    = 2

  keepers = {
    instance_type = lookup(each.value, "instance_type", local.workers_group_defaults["instance_type"])

    ec2_ssh_key = lookup(each.value, "key_name", local.workers_group_defaults["key_name"])

    source_security_group_ids = join("-", compact(
      lookup(each.value, "source_security_group_ids", local.workers_group_defaults["source_security_group_id"]
    )))

    node_group_name = join("-", [var.cluster_name, each.value["name"]])
  }
}

resource "aws_eks_node_group" "workers" {
  for_each = local.node_groups

  node_group_name = join("-", [var.cluster_name, each.key, random_pet.node_groups[each.key].id])

  cluster_name    = var.cluster_name
  node_role_arn   = lookup(each.value, "iam_role_arn", aws_iam_role.node_groups[0].arn)
  subnet_ids      = lookup(each.value, "subnets", local.workers_group_defaults["subnets"])

  scaling_config {
    desired_size = lookup(each.value, "node_group_desired_capacity", local.workers_group_defaults["asg_desired_capacity"])
    max_size     = lookup(each.value, "node_group_max_capacity", local.workers_group_defaults["asg_max_size"])
    min_size     = lookup(each.value, "node_group_min_capacity", local.workers_group_defaults["asg_min_size"])
  }

  ami_type        = lookup(each.value, "ami_type", null)
  disk_size       = lookup(each.value, "root_volume_size", null)
  instance_types  = [lookup(each.value, "instance_type", null)]
  labels          = lookup(each.value, "node_group_k8s_labels", null)
  release_version = lookup(each.value, "ami_release_version", null)

  # This sometimes breaks idempotency as described in https://github.com/terraform-providers/terraform-provider-aws/issues/11063
  remote_access {
    ec2_ssh_key               = lookup(each.value, "key_name", "") != "" ? each.value["key_name"] : null
    source_security_group_ids = lookup(each.value, "key_name", "") != "" ? lookup(each.value, "source_security_group_ids", []) : null
  }

  version = aws_eks_cluster.this.version

  tags = lookup(each.value, "node_group_additional_tags", null)

  lifecycle {
    create_before_destroy = true
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eytanhanig I added the changes you suggested and the terraform plan seems to work.

@wmorgan6796
Copy link
Contributor Author

whats the status of this?

Copy link
Contributor

@max-rocket-internet max-rocket-internet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NICE thanks @wmorgan6796!

I fixed conflict in changelog and a small fmt issue.

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants