Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS 1.28 creation failure using 19.19.1 -- NodeCreationFailure: Instances failed to join the Kubernetes cluster #2819

Closed
rajali opened this issue Nov 16, 2023 · 10 comments

Comments

@rajali
Copy link

rajali commented Nov 16, 2023

Description

I am getting this error on a fresh cluster creation

Error: waiting for EKS Node Group (###) create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: ####: NodeCreationFailure: Instances failed to join the kubernetes cluster

Versions

  • Module version [Required]: 19.19.1

  • Terraform version:
    Terraform v1.6.4
    on linux_amd64

  • Provider version(s):

  • provider registry.terraform.io/hashicorp/aws v5.25.0
  • provider registry.terraform.io/hashicorp/cloudinit v2.3.2
  • provider registry.terraform.io/hashicorp/external v2.3.1
  • provider registry.terraform.io/hashicorp/kubernetes v2.23.0
  • provider registry.terraform.io/hashicorp/local v2.4.0
  • provider registry.terraform.io/hashicorp/null v3.2.1
  • provider registry.terraform.io/hashicorp/template v2.2.0
  • provider registry.terraform.io/hashicorp/time v0.9.1
  • provider registry.terraform.io/hashicorp/tls v4.0.4

Reproduction Code [Required]

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.25.0"
    }
    external = {
      source  = "hashicorp/external"
      version = "~> 2.3.1"
    }
    local = {
      source  = "hashicorp/local"
      version = ">= 2.4.0"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.23.0"
    }
    template = {
      source  = "hashicorp/template"
      version = "~> 2.2.0"
    }
  }
}

# Configure the AWS Provider
provider "aws" {
  region = "eu-central-1"
  default_tags {
    tags = {
      Name        = local.name
      Environment = "Test"
    }
  }
}

provider "kubernetes" {
  host                   = module.eks.cluster_endpoint
  cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)

  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    command     = "aws"
    # This requires the awscli to be installed locally where Terraform is executed
    args = ["eks", "get-token", "--cluster-name", module.eks.cluster_name]
  }
}



locals {
  name            = "private-only"
  cluster_name    = "eks-managed-${local.name}"
  cluster_version = 1.28
  additional_policies = [
    "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore",
    "${aws_iam_policy.additional.arn}"
  ]

  all_cidr = "0.0.0.0/0"

  vpc_cidr = "10.0.0.0/16"
  azs      = slice(data.aws_availability_zones.available.names, 0, 3)

}

data "aws_default_tags" "tags_all" {}
data "aws_availability_zones" "available" {}
data "aws_caller_identity" "current" {}

data "aws_ami" "eks_default" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["amazon-eks-node-${local.cluster_version}-v*"]
  }
}



module "eks" {
  source       = "terraform-aws-modules/eks/aws"
  version      = "19.19.1"
  cluster_name = local.cluster_name
  cluster_tags = {
    Name = local.cluster_name
  }
  cluster_version = local.cluster_version

  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = false
  cluster_ip_family               = "ipv4"


  enable_irsa                  = true
  iam_role_name                = "${local.cluster_name}-cluster"
  iam_role_additional_policies = local.additional_policies

  cluster_addons = {
    coredns = {
      resolve_conflicts = "OVERWRITE"
      most_recent       = true

      timeouts = {
        create = "25m"
        delete = "10m"
      }
    }
    kube-proxy = {
      resolve_conflicts = "OVERWRITE"
      most_recent       = true
    }
    vpc-cni = {
      resolve_conflicts        = "OVERWRITE"
      most_recent              = true
      before_compute           = true
      service_account_role_arn = module.vpc_cni_irsa.iam_role_arn
      configuration_values = jsonencode({
        env = {
          ENABLE_PREFIX_DELEGATION = "true"
          WARM_PREFIX_TARGET       = "1"
        }
      })
    }
  }


  //defaults values are used here
  cluster_timeouts = {
    create = "80m"
    update = "80m"
    delete = "80m"
  }

  create_kms_key = false
  cluster_encryption_config = {
    provider_key_arn = module.kms.key_arn
    resources        = ["secrets"]
  }

  cluster_enabled_log_types       = ["api", "audit", "authenticator"]
  create_cloudwatch_log_group     = true
  cloudwatch_log_group_kms_key_id = module.kms.key_arn


  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  //  Set source_node_security_group = true inside rules to set the node_security_group as source
  cluster_security_group_name = "${local.cluster_name}-cluster"
  cluster_security_group_additional_rules = {

    ingress_nodes_ephemeral_ports_tcp = {
      description                = "Nodes on ephemeral ports"
      protocol                   = "tcp"
      from_port                  = 1025
      to_port                    = 65535
      type                       = "ingress"
      source_node_security_group = true
    }

    ingress_source_security_group_id = {
      description              = "Ingress from another computed security group for port 22"
      protocol                 = "tcp"
      from_port                = 22
      to_port                  = 22
      type                     = "ingress"
      source_security_group_id = aws_security_group.additional.id
    }

    ingress_source_security_group_id = {
      description              = "Ingress from another computed security group for port 443"
      protocol                 = "tcp"
      from_port                = 443
      to_port                  = 443
      type                     = "ingress"
      source_security_group_id = aws_security_group.additional.id
    }

    ingress_source_security_group_id = {
      description              = "Allows inbound NFS traffic from another computed security group"
      protocol                 = "tcp"
      from_port                = 2049
      to_port                  = 2049
      type                     = "ingress"
      source_security_group_id = aws_security_group.additional.id
    }

  }
  cluster_security_group_tags = {
    Name = "${local.cluster_name}-cluster"
  }

  node_security_group_name = "${local.cluster_name}-node"
  // Extend node-to-node security group rules
  node_security_group_additional_rules = {

    ingress_self_all = {
      description = "Node to node all ports/protocols"
      protocol    = "-1"
      from_port   = 0
      to_port     = 0
      type        = "ingress"
      self        = true
    }

    ingress_source_security_group_id = {
      description              = "Ingress from another computed security group"
      protocol                 = "tcp"
      from_port                = 22
      to_port                  = 22
      type                     = "ingress"
      source_security_group_id = aws_security_group.additional.id
    }

    ingress_source_security_group_id = {
      description              = "Ingress from another computed security group two"
      protocol                 = "tcp"
      from_port                = 443
      to_port                  = 443
      type                     = "ingress"
      source_security_group_id = aws_security_group.additional.id
    }

    ingress_source_security_group_id = {
      description              = "Allows inbound NFS traffic"
      protocol                 = "tcp"
      from_port                = 2049
      to_port                  = 2049
      type                     = "ingress"
      source_security_group_id = aws_security_group.additional.id
    }

    egress_all = {
      description = "Node all egress"
      protocol    = "-1"
      from_port   = 0
      to_port     = 0
      type        = "egress"
      cidr_blocks = [local.all_cidr]
    }
  }

  node_security_group_tags = {
    Name                                          = "${local.cluster_name}-node"
    "kubernetes.io/cluster/${local.cluster_name}" = null
  }


  eks_managed_node_group_defaults = {
    ami_type                   = "AL2_x86_64"
    ami_id                     = data.aws_ami.eks_default.image_id
    instance_types             = ["m6i.large", "m5.large", "m5n.large", "m5zn.large"]
    iam_role_attach_cni_policy = true

    attach_cluster_primary_security_group = false
    vpc_security_group_ids                = [aws_security_group.allow_access_from_lb.id]
    create_iam_role                       = true
    iam_role_additional_policies          = local.additional_policies

    ebs_optimized           = true
    disable_api_termination = false

    enable_bootstrap_user_data = true
    pre_bootstrap_user_data    = <<-EOT
        echo "foo"
        export FOO=bar
      EOT


    update_config = {
      max_unavailable_percentage = 50
    }

    block_device_mappings = {
      xvda = {
        device_name = "/dev/xvda"
        ebs = {
          volume_size           = 50
          volume_type           = "gp3"
          iops                  = 3000
          throughput            = 150
          encrypted             = true
          kms_key_id            = module.kms.key_arn
          delete_on_termination = true
        }
      }
    }

    desired_size = 1
    min_size     = 1
    max_size     = 3

    metadata_options = {
      http_endpoint               = "enabled"
      http_tokens                 = "required"
      http_put_response_hop_limit = 2
      instance_metadata_tags      = "disabled"
    }

  }

  eks_managed_node_groups = {
    on_demand_system_a = {
      name                 = "${local.cluster_name}-on-demand-system-a"
      iam_role_name        = "${local.cluster_name}-on-demand-system-a"
      launch_template_name = "${local.cluster_name}-on-demand-system-a"
      capacity_type        = "ON_DEMAND"



      force_update_version = false

      iam_role_tags = merge(
        data.aws_default_tags.tags_all.tags,
        {
          Name = "${local.cluster_name}-on-demand-system-a"
        }
      )

      launch_template_tags = merge(
        data.aws_default_tags.tags_all.tags,
        {
          Name = "${local.cluster_name}-on-demand-system-a"
        }
      )


    }

    on_demand_system_b = {
      name                 = "eks-${local.cluster_name}-on-demand-system-b"
      iam_role_name        = "eks-${local.cluster_name}-on-demand-system-b"
      launch_template_name = "eks-${local.cluster_name}-on-demand-system-b"
      capacity_type        = "ON_DEMAND"



      force_update_version = false

      iam_role_tags = merge(
        data.aws_default_tags.tags_all.tags,
        {
          Name = "eks-${local.cluster_name}-on-demand-system-b"
        }
      )

      launch_template_tags = merge(
        data.aws_default_tags.tags_all.tags,
        {
          Name = "eks-${local.cluster_name}-on-demand-system-b"
        }
      )


    }
  }


}

################################################################################
# Supporting resources
################################################################################



module "vpc_cni_irsa" {
  source  = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
  version = "5.30.2"

  role_name_prefix      = "${local.cluster_name}-vpc-cni-irsa"
  attach_vpc_cni_policy = true
  vpc_cni_enable_ipv4   = true

  oidc_providers = {
    main = {
      provider_arn               = module.eks.oidc_provider_arn
      namespace_service_accounts = ["kube-system:aws-node"]
    }
  }
}

resource "aws_security_group_rule" "primary_cluster_eks_sg" {
  security_group_id = module.eks.cluster_primary_security_group_id
  type              = "ingress"
  from_port         = 443
  to_port           = 443
  protocol          = "tcp"
  cidr_blocks       = local.vpc_cidr
}

resource "aws_security_group" "additional" {
  name_prefix = "${local.cluster_name}-additional"
  vpc_id      = module.vpc.vpc_id
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = local.vpc_cidr
  }

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = local.vpc_cidr
  }

  tags = {
    Name = "${local.cluster_name}-additional"
  }
}

resource "aws_security_group" "allow_access_from_lb" {
  name        = "${local.cluster_name}-allow-access-from-lb"
  description = "Allow full access from LB(s) to worker nodes."
  vpc_id      = module.vpc.vpc_id

  tags = {
    Name = "${local.cluster_name}-allow-access-from-lb"
  }
}

resource "aws_security_group" "replacement_lb" {
  name        = "${local.cluster_name}-replacement-lb"
  description = "Custom SG for LB."
  vpc_id      = module.vpc.vpc_id

  tags = {
    Name = "${local.cluster_name}-replacement-lb"
  }

  ingress {

    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = local.vpc_cidr
  }
  ingress {
    description = "Allows inbound NFS traffic"
    from_port   = 2049
    to_port     = 2049
    protocol    = "tcp"
    cidr_blocks = local.vpc_cidr
  }
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = [local.vpc_cidr]
  }
}
resource "aws_security_group_rule" "allow_access_from_lb" {
  type                     = "ingress"
  security_group_id        = aws_security_group.allow_access_from_lb.id
  source_security_group_id = aws_security_group.replacement_lb.id
  from_port                = 0
  to_port                  = 0
  protocol                 = "-1"
  description              = "Allow full access from the lb(s) to worker nodes."
}

resource "aws_iam_policy" "additional" {
  name = "${local.name}-additional"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = [
          "ec2:Describe*",
        ]
        Effect   = "Allow"
        Resource = "*"
      },
    ]
  })
}

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 4.0"

  name = local.name
  cidr = local.vpc_cidr

  azs             = local.azs
  private_subnets = [for k, v in local.azs : cidrsubnet(local.vpc_cidr, 4, k)]


  enable_nat_gateway = false
  single_nat_gateway = false



  private_subnet_tags = {
    "kubernetes.io/role/internal-elb" = 1
  }
}

module "kms" {
  source  = "terraform-aws-modules/kms/aws"
  version = "~> 1.5"

  aliases               = ["eks/${local.name}"]
  description           = "${local.name} cluster encryption key"
  enable_default_policy = true
  key_owners            = [data.aws_caller_identity.current.arn]
}

data "aws_iam_policy_document" "generic_endpoint_policy" {
  statement {
    effect    = "Deny"
    actions   = ["*"]
    resources = ["*"]

    principals {
      type        = "*"
      identifiers = ["*"]
    }

    condition {
      test     = "StringNotEquals"
      variable = "aws:SourceVpc"

      values = [module.vpc.vpc_id]
    }
  }
}

module "vpc_endpoints" {
  source  = "terraform-aws-modules/vpc/aws//modules/vpc-endpoints"

  vpc_id = module.vpc.vpc_id

  create_security_group      = true
  security_group_name_prefix = "${local.name}-vpc-endpoints-"
  security_group_description = "VPC endpoint security group"
  security_group_rules = {
    ingress_https = {
      description = "HTTPS from VPC"
      cidr_blocks = [module.vpc.vpc_cidr_block]
    }
  }

  endpoints = {
    s3 = {
      service = "s3"
      tags    = { Name = "s3-vpc-endpoint" }
    },
    ecr_api = {
      service             = "ecr.api"
      private_dns_enabled = true
      subnet_ids          = module.vpc.private_subnets
      policy              = data.aws_iam_policy_document.generic_endpoint_policy.json
    },
    ecr_dkr = {
      service             = "ecr.dkr"
      private_dns_enabled = true
      subnet_ids          = module.vpc.private_subnets
      policy              = data.aws_iam_policy_document.generic_endpoint_policy.json
    }
  }


}

Steps to reproduce the behavior:

Yes Yes

Run terraform apply

Expected behavior

The cluster should get created with nodes showing status as "active" on the EKS console

Actual behavior

  • The launch templates, ASGs and the corresponding EC2 instances get created
  • The nodes fail to join the network giving this error after more than 25 mins of execution

Error: waiting for EKS Node Group (###) create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: ####: NodeCreationFailure: Instances failed to join the kubernetes cluster

@rajali rajali changed the title EKS 1.28 creation failure using 19.19.1 -- NodeCreationFailure: Instances failed to join the kubernetes cluster EKS 1.28 creation failure using 19.19.1 -- NodeCreationFailure: Instances failed to join the Kubernetes cluster Nov 16, 2023
@trc-ikeskin
Copy link

We currently run in to this exact same issue. Any updates on this?

@bryantbiggs
Copy link
Member

I need a reproduction in order to help troubleshoot

@dyudin0821

This comment has been minimized.

@rajali
Copy link
Author

rajali commented Nov 20, 2023

@bryantbiggs I have re-added the terraform code for reproduction. You may need to update the VPC configuration for vpc module. We do not use the terraform-aws-module/vpc internally so couldn't come up with a complete private VPC creation here.

Just to be clearer same VPC settings have been working in the past with older EKS versions. It has all the right tags including
"kubernetes.io/cluster/${local.cluster_name}": "shared"

@bryantbiggs
Copy link
Member

we only need a minimal example that reproduces the issue - are all of those configurations required? Try removing configs, you might find the issue on your own

@rajali
Copy link
Author

rajali commented Nov 20, 2023

Within the eks module, yes all configuration is required.

The only additional configuration above is the use of specific names and tagging of launch template, and provisioned resources.

I have already gone through multiple deployments removing a lot of configuration which are not required with respect to the working of the cluster.

Now, the only option to test is to test it without using terraform-as-module/eks.

@trc-ikeskin
Copy link

In our case the problem was an S3 gateway endpoint that received Bottlerockets ECR requests through an prefix list route in another VPC (centralized egress via TGW).

For some weird reason S3 endpoints only forward traffic for the VPC they are deployed in though so this basically created a blackhole route for our S3 destined traffic.

I found this issue by checking the system logs of the EC2 instances. @rajali maybe it’s worth checking in your case too?

So all in all not a bug of this module but still rather interesting (at least as far as I am concerned).

@rajali
Copy link
Author

rajali commented Nov 20, 2023

thanks @trc-ikeskin, the cluster is in the same vpc as the s3 gateway endpoint.

@bryantbiggs
Copy link
Member

closing this issue for now unless there is a minimal, reproducible example supplied that can demonstrate the issue

Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants