[EKS|kops] [NLB stability]: Fixes for kube-controller not available on EKS yet #62

patrickleet · 2018-12-13T22:14:25Z

Tell us about your request
What do you want us to build?

I want network load balancers to work on EKS without breaking Security Groups when nodes change, and I want that to happen before 1.13 which includes the fix for kube-controller-manager because that's obviously a long way out given 1.11 was just made available.

Without this, EKS is not usable.

Which service(s) is this request for?
This could be Fargate, ECS, EKS, ECR

EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
What outcome are you trying to achieve, ultimately, and why is it hard/impossible to do right now? What is the impact of not having this problem solved? The more details you can provide, the better we'll be able to understand and solve the problem.

I have to monitor my cluster for new nodes which break the security group and bring the whole cluster down.

Are you currently working around this issue?
How are you currently solving this problem?

Modifying the security group by finding an "unhealthy node" from the NLB targets, and modifying the security group on that node to re-allow traffic to the healthcheck port.

Additional context
Anything else we should know?

It's fixed in 1.13... apparently.

Attachments
If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

Yup here's some github issue links:
kubernetes/kubernetes#64148
kubernetes/kubernetes#68422

patrickleet · 2018-12-13T22:16:25Z

I've seen users report on kops as well

patrickleet · 2018-12-13T22:17:19Z

some chat about it https://kubernetes.slack.com/archives/C9MBGQJRH/p1544541920308200

mjhoffman65 · 2018-12-17T17:01:37Z

I'm running into this issue as well and am looking for a way to fix it

patrickleet · 2018-12-17T17:55:51Z

thread in kubernetes slack (linked above) has workaround for now - just fixing the security group

patrickleet · 2018-12-17T17:59:57Z

For convenience:

@nukepuppy

nukepuppy [Dec 11th at 10:25 AM]
this is for those experiencing issues in AWS where the nodes behind the NLB go `unhealthy`
for no reason other than nodes going in and out of the target group


22 replies
nukepuppy [6 days ago]
for anyone interested.. the temporary workaround (its fixed in 1.13/1.14 depending on release
schedule of k8s) but in meantime.. if you have an NLB with unhealty nodes.. the reason this 
is - is because of a bug in kube-controller-manager.. which will not really properly update the 
security group attached to the NODES (there is no security group on NLBS) and the port being 
set in the NLB for health checks.. isnt allowed.. so you can add that port to the nodes security 
group for a quick fix to get your nodes healty again.. (edited)


patrickleet [6 days ago]
thanks for finding this

patrickleet [6 days ago]
wonder when eks will support 1.13 (edited)

James Strachan [6 days ago]
:+1:

patrickleet [6 days ago]
I bet this is why my SSL wasn't working the other day - I modified the NLB to allow traffic for 
port 443

patrickleet [6 days ago]
that probably changed the security group, and then everything stopped working, and I 
blamed SSL

patrickleet [6 days ago]
... maybe?


nukepuppy [6 days ago]
likely yes lol

nukepuppy [6 days ago]
since everytime you add/remove a node.. you risk the kube-controller-manager deciding 
to overwrite whats there

nukepuppy [6 days ago]
and if you had touched the Security group its all using (for the nodes)


nukepuppy [6 days ago]
it might just go crazy and wipe it


patrickleet [6 days ago]
@nukepuppy if you have it handy in aws ui, any chance you could take a screen 
shot of which SG it is?


patrickleet [6 days ago]
this also prevents autospotting from working properly


patrickleet [5 days ago]
@Thien Le ^


patrickleet [5 days ago]
does this help


nukepuppy [5 days ago]
@patrickleet its easy...  since NLBs dont get a security group.. its the security gorup of 
the NODES inside the target group.. so literally just go to one of the nodes.. look at its 
security group...  it will have the network familiar weird ports for allowing access into 
the node ports (3000x or whatever) port


nukepuppy [5 days ago]
so go to your target group... click one of the unhealty nodes... then look at its 
security group attached... this is the security group that the kube-controller-manager 
is managing on the fly for you..  when nodes go in and out


nukepuppy [5 days ago]
on AWS console: EC2... Goto load balancers.. locate the Type: network one ... 
that is for you.. look in "Listeners" tab.. it will show a target group.. click target group.. 
something like k8s-tgXXXXXXX...  ..  then click "targets" .. click on the hyperlinked 
instance-id there.. security groups listed there. will have something probably identical 
to your cluster name  like `nodes.YOURSTUFF.kops.k8s.local` or similar..  in there you 
will find the security groups settings...


nukepuppy [5 days ago]
the health check that marks as unhealthy will have the port that you need to have 
updated there


nukepuppy [5 days ago]
ive also done it by just killing the pods 1 at a time from 
`kubectl get pods -n kube-system | grep kube-controller-manager`  
just do the other route first to get a feel for it working again first..

mjhoffman65 · 2018-12-18T12:40:33Z

@patrickleet thanks for referencing a fix! I'm seeing this issue due to cluster-autoscaling, so ideally I'd need an automated fix because the cluster scales many times per day.

patrickleet · 2018-12-18T14:57:29Z

yea agree - anyone from AWS care to comment?

patrickleet · 2019-01-14T20:03:36Z

guess not

patrickleet · 2019-01-16T15:55:55Z

Some activity kubernetes/kubernetes#68422

Looks like Kube 1.14 will have fix and @M00nF1sh is cherry picking to 1.11 and 1.13 so it may be available sooner

patrickleet · 2019-01-17T15:44:47Z

@tabern I know I wrote EKS, and it has just been labeled as so - but this affect kops as well. It's an NLB with Kubernetes issue more than EKS specific

patrickleet · 2019-01-18T22:06:01Z

cherry-picked into 1.11 kubernetes/kubernetes#72981

patrickleet · 2019-01-18T22:11:24Z

Looks to be fixed in v1.11.6
kubernetes/kubernetes@v1.11.6...release-1.11

Commit Hash 5f4aa7c

So whenever that becomes available through EKS.. :)

ghost · 2019-01-18T22:15:28Z

Assuming this will be a platform update in EKS 1.11, i.e. eks.2?

patrickleet · 2019-01-29T18:54:34Z

any rough timeline on when next eks release might be?

patrickleet · 2019-02-04T15:20:03Z

For those on kops, this is released and working according to other users - still no word on next EKS release

kubernetes/kubernetes#64148

tabern · 2019-03-06T19:57:14Z

Will be fixed by #188 and #24

tiffanyfay · 2019-03-15T22:23:25Z

It should be fixed now. #188

astrived · 2020-05-05T22:31:55Z

Fixed with #188 and 1.11.8 platform roll out.

patrickleet added the Proposed Community submitted issue label Dec 13, 2018

patrickleet changed the title ~~[service] [request]: describe request here~~ EKS NLB stability: Fixes for kube-controller not available on EKS yet Dec 13, 2018

patrickleet changed the title ~~EKS NLB stability: Fixes for kube-controller not available on EKS yet~~ [EKS] [NLB stability]: Fixes for kube-controller not available on EKS yet Dec 13, 2018

patrickleet mentioned this issue Dec 13, 2018

EKS Support for Kubernetes 1.13 #30

Closed

tabern added the EKS Amazon Elastic Kubernetes Service label Jan 16, 2019

patrickleet changed the title ~~[EKS] [NLB stability]: Fixes for kube-controller not available on EKS yet~~ [EKS|kops] [NLB stability]: Fixes for kube-controller not available on EKS yet Jan 17, 2019

tabern removed the Proposed Community submitted issue label Mar 6, 2019

astrived closed this as completed May 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EKS|kops] [NLB stability]: Fixes for kube-controller not available on EKS yet #62

[EKS|kops] [NLB stability]: Fixes for kube-controller not available on EKS yet #62

patrickleet commented Dec 13, 2018 •

edited

Loading

patrickleet commented Dec 13, 2018

patrickleet commented Dec 13, 2018

mjhoffman65 commented Dec 17, 2018

patrickleet commented Dec 17, 2018

patrickleet commented Dec 17, 2018 •

edited

Loading

mjhoffman65 commented Dec 18, 2018

patrickleet commented Dec 18, 2018 •

edited

Loading

patrickleet commented Jan 14, 2019

patrickleet commented Jan 16, 2019

patrickleet commented Jan 17, 2019

patrickleet commented Jan 18, 2019

patrickleet commented Jan 18, 2019 •

edited

Loading

ghost commented Jan 18, 2019

patrickleet commented Jan 29, 2019

patrickleet commented Feb 4, 2019

tabern commented Mar 6, 2019

tiffanyfay commented Mar 15, 2019

astrived commented May 5, 2020

[EKS|kops] [NLB stability]: Fixes for kube-controller not available on EKS yet #62

[EKS|kops] [NLB stability]: Fixes for kube-controller not available on EKS yet #62

Comments

patrickleet commented Dec 13, 2018 • edited Loading

patrickleet commented Dec 13, 2018

patrickleet commented Dec 13, 2018

mjhoffman65 commented Dec 17, 2018

patrickleet commented Dec 17, 2018

patrickleet commented Dec 17, 2018 • edited Loading

mjhoffman65 commented Dec 18, 2018

patrickleet commented Dec 18, 2018 • edited Loading

patrickleet commented Jan 14, 2019

patrickleet commented Jan 16, 2019

patrickleet commented Jan 17, 2019

patrickleet commented Jan 18, 2019

patrickleet commented Jan 18, 2019 • edited Loading

ghost commented Jan 18, 2019

patrickleet commented Jan 29, 2019

patrickleet commented Feb 4, 2019

tabern commented Mar 6, 2019

tiffanyfay commented Mar 15, 2019

astrived commented May 5, 2020

patrickleet commented Dec 13, 2018 •

edited

Loading

patrickleet commented Dec 17, 2018 •

edited

Loading

patrickleet commented Dec 18, 2018 •

edited

Loading

patrickleet commented Jan 18, 2019 •

edited

Loading