Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuation and retries after failures #188

Closed
errordeveloper opened this issue Sep 6, 2018 · 9 comments
Closed

Continuation and retries after failures #188

errordeveloper opened this issue Sep 6, 2018 · 9 comments
Labels
help wanted Extra attention is needed kind/feature New feature or request needs-investigation stale

Comments

@errordeveloper
Copy link
Contributor

Right now, when we fail to create a cluster, we bail out and expect user to delete partially create resource (and eksctl delete cluster handles this okay), but in certain cases (e.g. #185) it'd be perfectly reasonable to let user continue somehow.

We will certainly need to add continuation eventually for eksctl apply (i.e. Cluster API) , but it maybe useful to add it already now and let user try running eksctl create until it succeeds.

@errordeveloper errordeveloper added kind/feature New feature or request help wanted Extra attention is needed labels Sep 6, 2018
@mumoshu
Copy link
Contributor

mumoshu commented Nov 30, 2018

@errordeveloper Hey!

As far as I know, CloudFormation consistently fails to update already-failed stacks, saying something like "can't update a stack in state CREATE_FAILED".

What should we do then? Perhaps trigger a deletion of the failed stack after prompting the user?

@mumoshu
Copy link
Contributor

mumoshu commented Nov 30, 2018

Probably you've considered about that in #105 before?

@errordeveloper
Copy link
Contributor Author

errordeveloper commented Nov 30, 2018 via email

@mumoshu
Copy link
Contributor

mumoshu commented Nov 30, 2018

I have no specific idea for eksctl, but generally speaking there are various reasons like:

  • hitting resource limit(vpc, ec2 instance, subnet, security group, etc
  • invalid subnet/security group specification on launch config(as far as i remember..
  • in case you use cfn-signal, timed-out a cfn wait handle also results in a failed stack

@mumoshu
Copy link
Contributor

mumoshu commented Dec 5, 2018

So I now believe I missed the point.

The granularity of steps to be continued in the scope of this issue would look like the below, right?

  1. create control-plane
  2. create node-group
  3. add/update aws-auth configmap
  4. write/update kubeconfig

Let's say ran eksctl create cluster and then Ctrl-C'ed between 1 and 2. You'd want to continue from 2.

One idea to achieve it would be to tag cfn stack to track the progress/status of the creation. That is, a tag named eksctl.cluster.k8s.io/v1alpha1/phase progresses from creating-cluster to creating-node-group, creating-auth-configmap, updating-kubeconfig, and finally created.

Then, the cmd to continue the process like eksctl create EXACT_SAME_SET_OF_FLAGS --name CLUSTER_NAME --continue can be invoked.

Repeating EXACT_SAME_SET_OF_FLAGS on a continuation would be hard for users. We'd better introduce a declarative spec for eks clusters(not necessarily Cluster API-based #19) at first, so that it would be a matter of just rerunning eksctl apply -f all.yaml. all yaml would contain a cluster and one or more nodegroup(s).

@nithu0115
Copy link

nithu0115 commented Mar 8, 2019

To add on to @mumoshu reasons for failure, while creation of EKS cluster using eksctl, if there are other team member who is trying to use subnets(existing) which are created via eksctl to create resources like say EC2 instances, CloudFormation stack would fail to delete and would be in delete failed state because of dependency issues.

@martina-if martina-if changed the title Continuation Continuation and retries after failures Mar 20, 2020
@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label Jan 27, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Feb 1, 2021

This issue was closed because it has been stalled for 5 days with no activity.

@github-actions github-actions bot closed this as completed Feb 1, 2021
@GaboFDC
Copy link

GaboFDC commented Mar 18, 2021

Bumping this, should be a important feature to have

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed kind/feature New feature or request needs-investigation stale
Projects
None yet
Development

No branches or pull requests

5 participants