Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EKS|kops] [NLB stability]: Fixes for kube-controller not available on EKS yet #62

Closed
patrickleet opened this issue Dec 13, 2018 · 18 comments
Labels
EKS Amazon Elastic Kubernetes Service

Comments

@patrickleet
Copy link

patrickleet commented Dec 13, 2018

Tell us about your request
What do you want us to build?

I want network load balancers to work on EKS without breaking Security Groups when nodes change, and I want that to happen before 1.13 which includes the fix for kube-controller-manager because that's obviously a long way out given 1.11 was just made available.

Without this, EKS is not usable.

Which service(s) is this request for?
This could be Fargate, ECS, EKS, ECR

EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
What outcome are you trying to achieve, ultimately, and why is it hard/impossible to do right now? What is the impact of not having this problem solved? The more details you can provide, the better we'll be able to understand and solve the problem.

I have to monitor my cluster for new nodes which break the security group and bring the whole cluster down.

Are you currently working around this issue?
How are you currently solving this problem?

Modifying the security group by finding an "unhealthy node" from the NLB targets, and modifying the security group on that node to re-allow traffic to the healthcheck port.

Additional context
Anything else we should know?

It's fixed in 1.13... apparently.

Attachments
If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

Yup here's some github issue links:
kubernetes/kubernetes#64148
kubernetes/kubernetes#68422

@patrickleet patrickleet added the Proposed Community submitted issue label Dec 13, 2018
@patrickleet patrickleet changed the title [service] [request]: describe request here EKS NLB stability: Fixes for kube-controller not available on EKS yet Dec 13, 2018
@patrickleet
Copy link
Author

I've seen users report on kops as well

@patrickleet
Copy link
Author

@patrickleet patrickleet changed the title EKS NLB stability: Fixes for kube-controller not available on EKS yet [EKS] [NLB stability]: Fixes for kube-controller not available on EKS yet Dec 13, 2018
@mjhoffman65
Copy link

I'm running into this issue as well and am looking for a way to fix it

@patrickleet
Copy link
Author

thread in kubernetes slack (linked above) has workaround for now - just fixing the security group

@patrickleet
Copy link
Author

patrickleet commented Dec 17, 2018

For convenience:

@nukepuppy

nukepuppy [Dec 11th at 10:25 AM]
this is for those experiencing issues in AWS where the nodes behind the NLB go `unhealthy`
for no reason other than nodes going in and out of the target group


22 replies
nukepuppy [6 days ago]
for anyone interested.. the temporary workaround (its fixed in 1.13/1.14 depending on release
schedule of k8s) but in meantime.. if you have an NLB with unhealty nodes.. the reason this 
is - is because of a bug in kube-controller-manager.. which will not really properly update the 
security group attached to the NODES (there is no security group on NLBS) and the port being 
set in the NLB for health checks.. isnt allowed.. so you can add that port to the nodes security 
group for a quick fix to get your nodes healty again.. (edited)


patrickleet [6 days ago]
thanks for finding this

patrickleet [6 days ago]
wonder when eks will support 1.13 (edited)

James Strachan [6 days ago]
:+1:

patrickleet [6 days ago]
I bet this is why my SSL wasn't working the other day - I modified the NLB to allow traffic for 
port 443

patrickleet [6 days ago]
that probably changed the security group, and then everything stopped working, and I 
blamed SSL

patrickleet [6 days ago]
... maybe?


nukepuppy [6 days ago]
likely yes lol

nukepuppy [6 days ago]
since everytime you add/remove a node.. you risk the kube-controller-manager deciding 
to overwrite whats there

nukepuppy [6 days ago]
and if you had touched the Security group its all using (for the nodes)


nukepuppy [6 days ago]
it might just go crazy and wipe it


patrickleet [6 days ago]
@nukepuppy if you have it handy in aws ui, any chance you could take a screen 
shot of which SG it is?


patrickleet [6 days ago]
this also prevents autospotting from working properly


patrickleet [5 days ago]
@Thien Le ^


patrickleet [5 days ago]
does this help


nukepuppy [5 days ago]
@patrickleet its easy...  since NLBs dont get a security group.. its the security gorup of 
the NODES inside the target group.. so literally just go to one of the nodes.. look at its 
security group...  it will have the network familiar weird ports for allowing access into 
the node ports (3000x or whatever) port


nukepuppy [5 days ago]
so go to your target group... click one of the unhealty nodes... then look at its 
security group attached... this is the security group that the kube-controller-manager 
is managing on the fly for you..  when nodes go in and out


nukepuppy [5 days ago]
on AWS console: EC2... Goto load balancers.. locate the Type: network one ... 
that is for you.. look in "Listeners" tab.. it will show a target group.. click target group.. 
something like k8s-tgXXXXXXX...  ..  then click "targets" .. click on the hyperlinked 
instance-id there.. security groups listed there. will have something probably identical 
to your cluster name  like `nodes.YOURSTUFF.kops.k8s.local` or similar..  in there you 
will find the security groups settings...


nukepuppy [5 days ago]
the health check that marks as unhealthy will have the port that you need to have 
updated there


nukepuppy [5 days ago]
ive also done it by just killing the pods 1 at a time from 
`kubectl get pods -n kube-system | grep kube-controller-manager`  
just do the other route first to get a feel for it working again first..

@mjhoffman65
Copy link

@patrickleet thanks for referencing a fix! I'm seeing this issue due to cluster-autoscaling, so ideally I'd need an automated fix because the cluster scales many times per day.

@patrickleet
Copy link
Author

patrickleet commented Dec 18, 2018

yea agree - anyone from AWS care to comment?

@patrickleet
Copy link
Author

guess not

@patrickleet
Copy link
Author

Some activity kubernetes/kubernetes#68422

Looks like Kube 1.14 will have fix and @M00nF1sh is cherry picking to 1.11 and 1.13 so it may be available sooner

image

@tabern tabern added the EKS Amazon Elastic Kubernetes Service label Jan 16, 2019
@patrickleet
Copy link
Author

@tabern I know I wrote EKS, and it has just been labeled as so - but this affect kops as well. It's an NLB with Kubernetes issue more than EKS specific

@patrickleet patrickleet changed the title [EKS] [NLB stability]: Fixes for kube-controller not available on EKS yet [EKS|kops] [NLB stability]: Fixes for kube-controller not available on EKS yet Jan 17, 2019
@patrickleet
Copy link
Author

cherry-picked into 1.11 kubernetes/kubernetes#72981

@patrickleet
Copy link
Author

patrickleet commented Jan 18, 2019

Looks to be fixed in v1.11.6
kubernetes/kubernetes@v1.11.6...release-1.11

Commit Hash 5f4aa7c

So whenever that becomes available through EKS.. :)

@ghost
Copy link

ghost commented Jan 18, 2019

Assuming this will be a platform update in EKS 1.11, i.e. eks.2?

@patrickleet
Copy link
Author

any rough timeline on when next eks release might be?

@patrickleet
Copy link
Author

For those on kops, this is released and working according to other users - still no word on next EKS release

image

kubernetes/kubernetes#64148

@tabern
Copy link
Contributor

tabern commented Mar 6, 2019

Will be fixed by #188 and #24

@tabern tabern removed the Proposed Community submitted issue label Mar 6, 2019
@tiffanyfay
Copy link

It should be fixed now. #188

@astrived
Copy link

astrived commented May 5, 2020

Fixed with #188 and 1.11.8 platform roll out.

@astrived astrived closed this as completed May 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EKS Amazon Elastic Kubernetes Service
Projects
None yet
Development

No branches or pull requests

5 participants