Retry AWS commands that may fail and increase insufficient timeouts #6775

carlossg · 2016-05-19T18:15:11Z

AWS is lately experiencing increasing flakiness, or "eventual consistency",
that forces to retry a lot of operations.
Describing the resources until they show as available is not even enough
as a following call to eg. setTags will fail because the resource doesn't
exist.

I have added more debug logging but could be removed

josh-padnick · 2016-05-31T03:59:30Z

I'd just like to add that we run automated tests for most of our terraform templates and we have also seen the recent "increasing flakiness" to which @carlossg alludes makes reference. It made me realize just how many bugs there are in Terraform around AWS eventual consistency. Excited to see this get merged.

jrnt30 · 2016-05-31T14:01:00Z

@josh-padnick Are those results something that you could share or create discrete issues per resource where you are experiencing the issues?

josh-padnick · 2016-05-31T15:48:54Z

@jrnt30 I'd be happy to report the specific issues. But I'd prefer to report them in a single issue since managing duplicate issues is already a challenge. For example, here are relevant issues I found from a single search:

Maybe I'll just create one comment in #5335 per error and periodically update the comments to indicate frequency of occurrence. I'm open to other options.

jrnt30 · 2016-05-31T16:31:08Z

I understand that, I think it's just easier from an implementation perspective and getting PRs merged, if there is a specific resource provider that needs to be adjusted. There are many AWS providers and the fact that the API can respond considerably differently (had an aws_security_group_rule today that took ~3 minutes, next took sub-second...) I don't know that we will be able to really identify them all together. I think the other "blanket" rule is a good example.

What are the acceptance criteria in which that issue could be closed if we have so many different resources. I think a generalized "tracking" issue for eventual consistency would help though.

catsby · 2016-06-01T21:15:26Z

builtin/providers/aws/resource_aws_internet_gateway.go

+		ec2err, ok := err.(awserr.Error)
+		if !ok {
+			log.Printf("[INFO] RetryableError setting tags InternetGateway ID: %s %s", d.Id(), err)
+			return resource.RetryableError(err)


If it's not an EC2 Error, it's a RetryableError error? It seems like this should be the if ok block and just fall through to the NonRetryableError below, unless I'm mistaken here

Seems like most of your retries follow this pattern, are you assuming this situation arrises outside of AWS, and so should be retried?

I copied it from other files IIRC, but can change the logic

My thinking was along these lines:

https://github.com/hashicorp/terraform/blob/master/builtin/providers/aws/resource_aws_autoscaling_group.go#L457

if it's not nil, we check to see if it's an AWS error. If it is not an AWS error, we return as non retryable. I think is correct, agree?

catsby · 2016-06-01T21:32:38Z

Hey all –

Describing the resources until they show as available is not even enough
as a following call to eg. setTags will fail because the resource doesn't
exist.

Do you have an example error/warning output of this? I don't know that I've seen many eventual consistency related issues regarding setting Tags. Aside, 5 minutes to set tags seems crazy, has anyone seen such times?

I didn't see any in #5335. #6813 does mention it, but it's a fairly large case, I think there is more at play there. Specifically, both subnet and route_table have logic to poll for availability.

carlossg · 2016-06-01T21:59:22Z

@catsby the log error is the same as in #6105
That PR didn't fix the issue as the setTags call was still failing

The long retries were necessary during 3-4 days 3 weeks ago, since then AWS returned to normal

We just don't want to have to retry terraform applies, and having the timeouts hardcoded makes it really difficult to configure, and forces to use longer ones just in case

carlossg · 2016-06-02T18:22:28Z

logs from a couple errors happening again today

15:10:54 Initializing environment terraform
15:11:00 [vpc] aws_vpc.tiger: Creating...
15:11:00 [vpc]   cidr_block:                 "" => "10.16.0.0/16"
15:11:00 [vpc]   default_network_acl_id:     "" => "<computed>"
15:11:00 [vpc]   default_security_group_id:  "" => "<computed>"
15:11:00 [vpc]   dhcp_options_id:            "" => "<computed>"
15:11:00 [vpc]   enable_classiclink:         "" => "<computed>"
15:11:00 [vpc]   enable_dns_hostnames:       "" => "1"
15:11:00 [vpc]   enable_dns_support:         "" => "1"
15:11:00 [vpc]   main_route_table_id:        "" => "<computed>"
15:11:00 [vpc]   tags.#:                     "" => "3"
15:11:00 [vpc]   tags.Name:                  "" => "pse-upgrade"
15:11:00 [vpc]   tags.cloudbees:pse:cluster: "" => "pse-upgrade"
15:11:00 [vpc]   tags.tiger:cluster:         "" => "pse-upgrade"
15:11:00 [vpc] aws_vpc.tiger: Creation complete
15:11:00 [vpc] aws_internet_gateway.tiger: Creating...
15:11:00 [vpc]   tags.#:                     "0" => "3"
15:11:00 [vpc]   tags.Name:                  "" => "pse-upgrade"
15:11:00 [vpc]   tags.cloudbees:pse:cluster: "" => "pse-upgrade"
15:11:00 [vpc]   tags.tiger:cluster:         "" => "pse-upgrade"
15:11:00 [vpc]   vpc_id:                     "" => "vpc-af5b86c8"
15:11:00 [vpc] aws_subnet.tiger: Creating...
15:11:00 [vpc]   availability_zone:          "" => "<computed>"
15:11:00 [vpc]   cidr_block:                 "" => "10.16.0.0/16"
15:11:00 [vpc]   map_public_ip_on_launch:    "" => "1"
15:11:00 [vpc]   tags.#:                     "" => "3"
15:11:00 [vpc]   tags.Name:                  "" => "pse-upgrade"
15:11:00 [vpc]   tags.cloudbees:pse:cluster: "" => "pse-upgrade"
15:11:00 [vpc]   tags.tiger:cluster:         "" => "pse-upgrade"
15:11:00 [vpc]   vpc_id:                     "" => "vpc-af5b86c8"
15:11:00 [vpc] aws_subnet.tiger: Creation complete
15:11:00 [vpc] Error applying plan:
15:11:00 [vpc] 
15:11:00 [vpc] 1 error(s) occurred:
15:11:00 [vpc] 
15:11:00 
[vpc] * aws_internet_gateway.tiger: InvalidInternetGatewayID.NotFound: The internetGateway ID 'igw-6f47020b' does not exist
15:11:00 [vpc]  status code: 400, request id: 
15:11:00 [vpc]

15:16:06 [vpc] aws_internet_gateway.tiger: Creating...
15:16:06 [vpc] tags.#:                     "0" => "3"
15:16:06 [vpc] tags.Name:                  "" => "pse-integration"
15:16:06 [vpc] tags.cloudbees:pse:cluster: "" => "pse-integration"
15:16:06 [vpc] tags.tiger:cluster:         "" => "pse-integration"
15:16:06 [vpc] vpc_id:                     "" => "vpc-2c55884b"
15:16:07 [vpc] aws_subnet.tiger: Creation complete
15:16:17 [vpc] aws_internet_gateway.tiger: Still creating... (10s elapsed)
15:16:26 [vpc] aws_internet_gateway.tiger: Still creating... (20s elapsed)
15:16:36 [vpc] aws_internet_gateway.tiger: Still creating... (30s elapsed)
15:16:46 [vpc] aws_internet_gateway.tiger: Still creating... (40s elapsed)
15:16:56 [vpc] aws_internet_gateway.tiger: Still creating... (50s elapsed)
15:17:06 [vpc] aws_internet_gateway.tiger: Still creating... (1m0s elapsed)
15:17:06 [vpc] Error applying plan:
15:17:06 [vpc] 
15:17:06 [vpc] 1 error(s) occurred:
15:17:06 [vpc] 
15:17:06 [vpc] * aws_internet_gateway.tiger: Error waiting for internet gateway (igw-5e46033a) to attach: timeout while waiting for state to become '[available]'

carlossg · 2016-06-03T19:10:08Z

@catsby I have addressed your comments, need to run some further tests but please let me know if the code looks good

bkc1 · 2016-06-22T20:00:23Z

#7038

billcrook · 2016-06-23T00:11:59Z

Is there an ETA on this? It is hindering adoption of terraform in my organization.

Summary: There are five steps that should happen when we create a security group: 1. Create the group 2. Revoke the default egress rule 3. Add egress rules 4. Add ingress rules 5. Set tags A Terraform "create" action consists of all 5, and an "update" action consists of 3-5. This patch makes two changes: * Wrap steps 2, 3, and 4 in retry logic (1 and 5 are already wrapped) * Refactor the "create" logic so we don't recheck for SG existence between steps 2 and 3 This revision is based on the branch in the following PR: hashicorp#6775. That branch fixes the internet-gateway flakiness we've been seeing, and also adds retry logic to the "set tags" step for all AWS resources. Test Plan: make test Reviewers: hurshal, tyler, areece, carl Reviewed By: carl Subscribers: engineering-list JIRA Issues: AP-499 Differential Revision: https://grizzly.memsql.com/D16013

brikis98 · 2016-06-27T23:08:03Z

What happened to this PR?

I'm still intermittently seeing The internetGateway ID 'igw-2e6d314a' does not exist errors, which makes running Terraform in an automated setting (e.g. CI job) very flakey.

AWS is lately experiencing increasing flakiness, or "eventual consistency", that forces to retry a lot of operations. Describing the resources until they show as available is not even enough as a following call to eg. setTags will fail because the resource doesn't exist.

carlossg · 2016-06-28T06:47:53Z

Rebased against master

jwbowler · 2016-07-06T20:01:28Z

@carlossg I noticed that after the rebase, the timeout on IGAttachStateRefreshFunc changed from 5 minutes to 2 minutes. If you don't mind an anecdote with no evidence: I've been using the old version of your PR, and I've seen Internet Gateways take 2-3 minutes to successfully create, on multiple occasions.

Summary: There are five steps that should happen when we create a security group: 1. Create the group 2. Revoke the default egress rule 3. Add egress rules 4. Add ingress rules 5. Set tags A Terraform "create" action consists of all 5, and an "update" action consists of 3-5. This patch makes two changes: * Wrap steps 2, 3, and 4 in retry logic (1 and 5 are already wrapped) * Refactor the "create" logic so we don't recheck for SG existence between steps 2 and 3 This revision is based on the branch in the following PR: hashicorp#6775. That branch fixes the internet-gateway flakiness we've been seeing, and also adds retry logic to the "set tags" step for all AWS resources. Test Plan: make test Reviewers: hurshal, tyler, areece, carl Reviewed By: carl Subscribers: engineering-list JIRA Issues: AP-499 Differential Revision: https://grizzly.memsql.com/D16013

…eouts

carlossg · 2016-07-07T11:48:30Z

@jwbowler I have increased it again to 4 minutes

carlossg · 2016-08-08T08:06:12Z

split in #7890 & #7891 for easier digestion

ghost · 2020-04-23T02:25:45Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

carlossg mentioned this pull request May 19, 2016

Intermittent AWS VPC creation errors #6759

Closed

catsby added bug provider/aws labels May 19, 2016

carlossg mentioned this pull request May 27, 2016

Intermittent AWS Eventual Consistency Issues on CircleCI? #5335

Closed

catsby reviewed Jun 1, 2016
View reviewed changes

catsby added the waiting-response An issue/pull request is waiting for a response from the community label Jun 1, 2016

carlossg mentioned this pull request Jun 2, 2016

Resource does not have attribute 'id' for variable #6991

Closed

carlossg force-pushed the aws-consistency-master branch from e1c79dd to c52074b Compare June 28, 2016 06:47

fixup! Retry AWS commands that may fail and increase insufficient tim…

eab5e0c

…eouts

carlossg closed this Aug 8, 2016

ghost locked and limited conversation to collaborators Apr 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry AWS commands that may fail and increase insufficient timeouts #6775

Retry AWS commands that may fail and increase insufficient timeouts #6775

carlossg commented May 19, 2016

josh-padnick commented May 31, 2016

jrnt30 commented May 31, 2016

josh-padnick commented May 31, 2016

jrnt30 commented May 31, 2016

catsby Jun 1, 2016

catsby Jun 1, 2016

carlossg Jun 1, 2016

catsby Jun 2, 2016

catsby commented Jun 1, 2016

carlossg commented Jun 1, 2016

carlossg commented Jun 2, 2016

carlossg commented Jun 3, 2016

bkc1 commented Jun 22, 2016

billcrook commented Jun 23, 2016

brikis98 commented Jun 27, 2016

carlossg commented Jun 28, 2016

jwbowler commented Jul 6, 2016

carlossg commented Jul 7, 2016

carlossg commented Aug 8, 2016

ghost commented Apr 23, 2020

Retry AWS commands that may fail and increase insufficient timeouts #6775

Retry AWS commands that may fail and increase insufficient timeouts #6775

Conversation

carlossg commented May 19, 2016

josh-padnick commented May 31, 2016

jrnt30 commented May 31, 2016

josh-padnick commented May 31, 2016

jrnt30 commented May 31, 2016

catsby Jun 1, 2016

Choose a reason for hiding this comment

catsby Jun 1, 2016

Choose a reason for hiding this comment

carlossg Jun 1, 2016

Choose a reason for hiding this comment

catsby Jun 2, 2016

Choose a reason for hiding this comment

catsby commented Jun 1, 2016

carlossg commented Jun 1, 2016

carlossg commented Jun 2, 2016

carlossg commented Jun 3, 2016

bkc1 commented Jun 22, 2016

billcrook commented Jun 23, 2016

brikis98 commented Jun 27, 2016

carlossg commented Jun 28, 2016

jwbowler commented Jul 6, 2016

carlossg commented Jul 7, 2016

carlossg commented Aug 8, 2016

ghost commented Apr 23, 2020