Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry AWS commands that may fail and increase insufficient timeouts #6775

Closed
wants to merge 2 commits into from

Conversation

carlossg
Copy link
Contributor

AWS is lately experiencing increasing flakiness, or "eventual consistency",
that forces to retry a lot of operations.
Describing the resources until they show as available is not even enough
as a following call to eg. setTags will fail because the resource doesn't
exist.

I have added more debug logging but could be removed

@josh-padnick
Copy link

I'd just like to add that we run automated tests for most of our terraform templates and we have also seen the recent "increasing flakiness" to which @carlossg alludes makes reference. It made me realize just how many bugs there are in Terraform around AWS eventual consistency. Excited to see this get merged.

@jrnt30
Copy link
Contributor

jrnt30 commented May 31, 2016

@josh-padnick Are those results something that you could share or create discrete issues per resource where you are experiencing the issues?

@josh-padnick
Copy link

@jrnt30 I'd be happy to report the specific issues. But I'd prefer to report them in a single issue since managing duplicate issues is already a challenge. For example, here are relevant issues I found from a single search:

Maybe I'll just create one comment in #5335 per error and periodically update the comments to indicate frequency of occurrence. I'm open to other options.

@jrnt30
Copy link
Contributor

jrnt30 commented May 31, 2016

I understand that, I think it's just easier from an implementation perspective and getting PRs merged, if there is a specific resource provider that needs to be adjusted. There are many AWS providers and the fact that the API can respond considerably differently (had an aws_security_group_rule today that took ~3 minutes, next took sub-second...) I don't know that we will be able to really identify them all together. I think the other "blanket" rule is a good example.

What are the acceptance criteria in which that issue could be closed if we have so many different resources. I think a generalized "tracking" issue for eventual consistency would help though.

ec2err, ok := err.(awserr.Error)
if !ok {
log.Printf("[INFO] RetryableError setting tags InternetGateway ID: %s %s", d.Id(), err)
return resource.RetryableError(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not an EC2 Error, it's a RetryableError error? It seems like this should be the if ok block and just fall through to the NonRetryableError below, unless I'm mistaken here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like most of your retries follow this pattern, are you assuming this situation arrises outside of AWS, and so should be retried?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied it from other files IIRC, but can change the logic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was along these lines:

if it's not nil, we check to see if it's an AWS error. If it is not an AWS error, we return as non retryable. I think is correct, agree?

@catsby
Copy link
Contributor

catsby commented Jun 1, 2016

Hey all –

Describing the resources until they show as available is not even enough
as a following call to eg. setTags will fail because the resource doesn't
exist.

Do you have an example error/warning output of this? I don't know that I've seen many eventual consistency related issues regarding setting Tags. Aside, 5 minutes to set tags seems crazy, has anyone seen such times?

I didn't see any in #5335. #6813 does mention it, but it's a fairly large case, I think there is more at play there. Specifically, both subnet and route_table have logic to poll for availability.

@catsby catsby added the waiting-response An issue/pull request is waiting for a response from the community label Jun 1, 2016
@carlossg
Copy link
Contributor Author

carlossg commented Jun 1, 2016

@catsby the log error is the same as in #6105
That PR didn't fix the issue as the setTags call was still failing

The long retries were necessary during 3-4 days 3 weeks ago, since then AWS returned to normal

We just don't want to have to retry terraform applies, and having the timeouts hardcoded makes it really difficult to configure, and forces to use longer ones just in case

@carlossg
Copy link
Contributor Author

carlossg commented Jun 2, 2016

logs from a couple errors happening again today

15:10:54 Initializing environment terraform
15:11:00 [vpc] aws_vpc.tiger: Creating...
15:11:00 [vpc]   cidr_block:                 "" => "10.16.0.0/16"
15:11:00 [vpc]   default_network_acl_id:     "" => "<computed>"
15:11:00 [vpc]   default_security_group_id:  "" => "<computed>"
15:11:00 [vpc]   dhcp_options_id:            "" => "<computed>"
15:11:00 [vpc]   enable_classiclink:         "" => "<computed>"
15:11:00 [vpc]   enable_dns_hostnames:       "" => "1"
15:11:00 [vpc]   enable_dns_support:         "" => "1"
15:11:00 [vpc]   main_route_table_id:        "" => "<computed>"
15:11:00 [vpc]   tags.#:                     "" => "3"
15:11:00 [vpc]   tags.Name:                  "" => "pse-upgrade"
15:11:00 [vpc]   tags.cloudbees:pse:cluster: "" => "pse-upgrade"
15:11:00 [vpc]   tags.tiger:cluster:         "" => "pse-upgrade"
15:11:00 [vpc] aws_vpc.tiger: Creation complete
15:11:00 [vpc] aws_internet_gateway.tiger: Creating...
15:11:00 [vpc]   tags.#:                     "0" => "3"
15:11:00 [vpc]   tags.Name:                  "" => "pse-upgrade"
15:11:00 [vpc]   tags.cloudbees:pse:cluster: "" => "pse-upgrade"
15:11:00 [vpc]   tags.tiger:cluster:         "" => "pse-upgrade"
15:11:00 [vpc]   vpc_id:                     "" => "vpc-af5b86c8"
15:11:00 [vpc] aws_subnet.tiger: Creating...
15:11:00 [vpc]   availability_zone:          "" => "<computed>"
15:11:00 [vpc]   cidr_block:                 "" => "10.16.0.0/16"
15:11:00 [vpc]   map_public_ip_on_launch:    "" => "1"
15:11:00 [vpc]   tags.#:                     "" => "3"
15:11:00 [vpc]   tags.Name:                  "" => "pse-upgrade"
15:11:00 [vpc]   tags.cloudbees:pse:cluster: "" => "pse-upgrade"
15:11:00 [vpc]   tags.tiger:cluster:         "" => "pse-upgrade"
15:11:00 [vpc]   vpc_id:                     "" => "vpc-af5b86c8"
15:11:00 [vpc] aws_subnet.tiger: Creation complete
15:11:00 [vpc] Error applying plan:
15:11:00 [vpc] 
15:11:00 [vpc] 1 error(s) occurred:
15:11:00 [vpc] 
15:11:00 
[vpc] * aws_internet_gateway.tiger: InvalidInternetGatewayID.NotFound: The internetGateway ID 'igw-6f47020b' does not exist
15:11:00 [vpc]  status code: 400, request id: 
15:11:00 [vpc] 
15:16:06 [vpc] aws_internet_gateway.tiger: Creating...
15:16:06 [vpc] tags.#:                     "0" => "3"
15:16:06 [vpc] tags.Name:                  "" => "pse-integration"
15:16:06 [vpc] tags.cloudbees:pse:cluster: "" => "pse-integration"
15:16:06 [vpc] tags.tiger:cluster:         "" => "pse-integration"
15:16:06 [vpc] vpc_id:                     "" => "vpc-2c55884b"
15:16:07 [vpc] aws_subnet.tiger: Creation complete
15:16:17 [vpc] aws_internet_gateway.tiger: Still creating... (10s elapsed)
15:16:26 [vpc] aws_internet_gateway.tiger: Still creating... (20s elapsed)
15:16:36 [vpc] aws_internet_gateway.tiger: Still creating... (30s elapsed)
15:16:46 [vpc] aws_internet_gateway.tiger: Still creating... (40s elapsed)
15:16:56 [vpc] aws_internet_gateway.tiger: Still creating... (50s elapsed)
15:17:06 [vpc] aws_internet_gateway.tiger: Still creating... (1m0s elapsed)
15:17:06 [vpc] Error applying plan:
15:17:06 [vpc] 
15:17:06 [vpc] 1 error(s) occurred:
15:17:06 [vpc] 
15:17:06 [vpc] * aws_internet_gateway.tiger: Error waiting for internet gateway (igw-5e46033a) to attach: timeout while waiting for state to become '[available]'

@carlossg
Copy link
Contributor Author

carlossg commented Jun 3, 2016

@catsby I have addressed your comments, need to run some further tests but please let me know if the code looks good

@bkc1
Copy link

bkc1 commented Jun 22, 2016

#7038

@billcrook
Copy link

Is there an ETA on this? It is hindering adoption of terraform in my organization.

jwbowler added a commit to memsql/terraform that referenced this pull request Jun 24, 2016
Summary:
There are five steps that should happen when we create a security group:

1. Create the group
2. Revoke the default egress rule
3. Add egress rules
4. Add ingress rules
5. Set tags

A Terraform "create" action consists of all 5, and an "update" action consists of 3-5.

This patch makes two changes:

* Wrap steps 2, 3, and 4 in retry logic (1 and 5 are already wrapped)
* Refactor the "create" logic so we don't recheck for SG existence between steps 2 and 3

This revision is based on the branch in the following PR: hashicorp#6775. That branch fixes the internet-gateway flakiness we've been seeing, and also adds retry logic to the "set tags" step for all AWS resources.

Test Plan: make test

Reviewers: hurshal, tyler, areece, carl

Reviewed By: carl

Subscribers: engineering-list

JIRA Issues: AP-499

Differential Revision: https://grizzly.memsql.com/D16013
@brikis98
Copy link
Contributor

What happened to this PR?

I'm still intermittently seeing The internetGateway ID 'igw-2e6d314a' does not exist errors, which makes running Terraform in an automated setting (e.g. CI job) very flakey.

AWS is lately experiencing increasing flakiness, or "eventual consistency",
that forces to retry a lot of operations.
Describing the resources until they show as available is not even enough
as a following call to eg. setTags will fail because the resource doesn't
exist.
@carlossg
Copy link
Contributor Author

Rebased against master

@jwbowler
Copy link
Contributor

jwbowler commented Jul 6, 2016

@carlossg I noticed that after the rebase, the timeout on IGAttachStateRefreshFunc changed from 5 minutes to 2 minutes. If you don't mind an anecdote with no evidence: I've been using the old version of your PR, and I've seen Internet Gateways take 2-3 minutes to successfully create, on multiple occasions.

jwbowler added a commit to memsql/terraform that referenced this pull request Jul 6, 2016
Summary:
There are five steps that should happen when we create a security group:

1. Create the group
2. Revoke the default egress rule
3. Add egress rules
4. Add ingress rules
5. Set tags

A Terraform "create" action consists of all 5, and an "update" action consists of 3-5.

This patch makes two changes:

* Wrap steps 2, 3, and 4 in retry logic (1 and 5 are already wrapped)
* Refactor the "create" logic so we don't recheck for SG existence between steps 2 and 3

This revision is based on the branch in the following PR: hashicorp#6775. That branch fixes the internet-gateway flakiness we've been seeing, and also adds retry logic to the "set tags" step for all AWS resources.

Test Plan: make test

Reviewers: hurshal, tyler, areece, carl

Reviewed By: carl

Subscribers: engineering-list

JIRA Issues: AP-499

Differential Revision: https://grizzly.memsql.com/D16013
@carlossg
Copy link
Contributor Author

carlossg commented Jul 7, 2016

@jwbowler I have increased it again to 4 minutes

@carlossg
Copy link
Contributor Author

carlossg commented Aug 8, 2016

split in #7890 & #7891 for easier digestion

@carlossg carlossg closed this Aug 8, 2016
@ghost
Copy link

ghost commented Apr 23, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 23, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug provider/aws waiting-response An issue/pull request is waiting for a response from the community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants