Continuation and retries after failures #188

errordeveloper · 2018-09-06T05:13:35Z

Right now, when we fail to create a cluster, we bail out and expect user to delete partially create resource (and eksctl delete cluster handles this okay), but in certain cases (e.g. #185) it'd be perfectly reasonable to let user continue somehow.

We will certainly need to add continuation eventually for eksctl apply (i.e. Cluster API) , but it maybe useful to add it already now and let user try running eksctl create until it succeeds.

The text was updated successfully, but these errors were encountered:

mumoshu · 2018-11-30T10:22:30Z

@errordeveloper Hey!

As far as I know, CloudFormation consistently fails to update already-failed stacks, saying something like "can't update a stack in state CREATE_FAILED".

What should we do then? Perhaps trigger a deletion of the failed stack after prompting the user?

mumoshu · 2018-11-30T10:31:11Z

Probably you've considered about that in #105 before?

errordeveloper · 2018-11-30T10:50:56Z

What are the reason for it to fail in the first place? Are those to do with VPC count or EKS instance limits by any chance?

…

On Fri, 30 Nov 2018, 12:22 pm KUOKA Yusuke, ***@***.***> wrote: @errordeveloper <https://github.com/errordeveloper> Hey! As far as I know, CloudFormation consistently fails to update already-failed stacks, saying something like "can't update a stack in state CREATE_FAILED". What should we do then? Perhaps trigger a deletion of the failed stack after prompting the user? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#188 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAPWS82Wl4_TkTOpOyR6pT3oHpk76ef3ks5u0QbngaJpZM4WcPTD> .

mumoshu · 2018-11-30T11:04:20Z

I have no specific idea for eksctl, but generally speaking there are various reasons like:

hitting resource limit(vpc, ec2 instance, subnet, security group, etc
invalid subnet/security group specification on launch config(as far as i remember..
in case you use cfn-signal, timed-out a cfn wait handle also results in a failed stack

mumoshu · 2018-12-05T02:42:13Z

So I now believe I missed the point.

The granularity of steps to be continued in the scope of this issue would look like the below, right?

create control-plane
create node-group
add/update aws-auth configmap
write/update kubeconfig

Let's say ran eksctl create cluster and then Ctrl-C'ed between 1 and 2. You'd want to continue from 2.

One idea to achieve it would be to tag cfn stack to track the progress/status of the creation. That is, a tag named eksctl.cluster.k8s.io/v1alpha1/phase progresses from creating-cluster to creating-node-group, creating-auth-configmap, updating-kubeconfig, and finally created.

Then, the cmd to continue the process like eksctl create EXACT_SAME_SET_OF_FLAGS --name CLUSTER_NAME --continue can be invoked.

Repeating EXACT_SAME_SET_OF_FLAGS on a continuation would be hard for users. We'd better introduce a declarative spec for eks clusters(not necessarily Cluster API-based #19) at first, so that it would be a matter of just rerunning eksctl apply -f all.yaml. all yaml would contain a cluster and one or more nodegroup(s).

nithu0115 · 2019-03-08T05:45:13Z

To add on to @mumoshu reasons for failure, while creation of EKS cluster using eksctl, if there are other team member who is trying to use subnets(existing) which are created via eksctl to create resources like say EC2 instances, CloudFormation stack would fail to delete and would be in delete failed state because of dependency issues.

github-actions · 2021-01-27T02:11:56Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2021-02-01T02:16:20Z

This issue was closed because it has been stalled for 5 days with no activity.

GaboFDC · 2021-03-18T22:13:33Z

Bumping this, should be a important feature to have

errordeveloper added kind/feature New feature or request help wanted Extra attention is needed labels Sep 6, 2018

errordeveloper mentioned this issue Sep 6, 2018

Inconsistent state after timeout #185

Closed

errordeveloper added the hacktoberfest label Oct 4, 2018

errordeveloper removed the hacktoberfest label Jan 18, 2019

martina-if changed the title ~~Continuation~~ Continuation and retries after failures Mar 20, 2020

martina-if mentioned this issue Apr 27, 2020

Unable to recreate cluster on failure from CLI #2005

Closed

martina-if added the needs-investigation label Sep 15, 2020

github-actions bot added the stale label Jan 27, 2021

github-actions bot closed this as completed Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuation and retries after failures #188

Continuation and retries after failures #188

errordeveloper commented Sep 6, 2018

mumoshu commented Nov 30, 2018

mumoshu commented Nov 30, 2018

errordeveloper commented Nov 30, 2018 via email

mumoshu commented Nov 30, 2018

mumoshu commented Dec 5, 2018 •

edited

Loading

nithu0115 commented Mar 8, 2019 •

edited

Loading

github-actions bot commented Jan 27, 2021

github-actions bot commented Feb 1, 2021

GaboFDC commented Mar 18, 2021

Continuation and retries after failures #188

Continuation and retries after failures #188

Comments

errordeveloper commented Sep 6, 2018

mumoshu commented Nov 30, 2018

mumoshu commented Nov 30, 2018

errordeveloper commented Nov 30, 2018 via email

mumoshu commented Nov 30, 2018

mumoshu commented Dec 5, 2018 • edited Loading

nithu0115 commented Mar 8, 2019 • edited Loading

github-actions bot commented Jan 27, 2021

github-actions bot commented Feb 1, 2021

GaboFDC commented Mar 18, 2021

mumoshu commented Dec 5, 2018 •

edited

Loading

nithu0115 commented Mar 8, 2019 •

edited

Loading