Intermittent error using s3 state #4709

matt-richardson · 2018-05-31T06:27:59Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

Terraform 0.11.7
aws 1.17.0

Affected Resource(s)

s3 backend

Terraform Configuration Files

(names have been changed to protect the innocent...)

provider "aws" { 
  region = "${var.Region}"
  version = "~> 1.17.0"
}

terraform {
  backend "s3" {
    profile              = "TheCloud"
    bucket               = "the-cloud-terraform-state"
    key                  = "instance-terraform.tfstate"
    region               = "us-west-2"
    workspace_key_prefix = "instances"
  }
}

# Retrieve state data from S3
data "terraform_remote_state" "state" {
  backend = "s3"

  config {
    profile              = "TheCloud"
    bucket               = "the-cloud-terraform-state"
    key                  = "instance-terraform.tfstate"
    region               = "us-west-2"
    workspace_key_prefix = "instances"
  }
}
...

Output

...
aws_launch_configuration.Instance: Refreshing state... (ID: instance-launch-config-20180531060133610900000001) 
aws_autoscaling_group.Instance: Refreshing state... (ID: Instance-204aaaba-fb33-48e8-88b2-aa190e763b71) 
Error: Error refreshing state: 1 error(s) occurred: 
* data.terraform_remote_state.state: 1 error(s) occurred: 
* data.terraform_remote_state.state: data.terraform_remote_state.state: error loading the remote state: RequestError: send request failed 
caused by: Get https://the-cloud-terraform-state.s3.us-west-2.amazonaws.com/?prefix=instances%2F: dial tcp 54.231.177.36:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. 
...

Expected Behavior

Terraform to work as expected, consistently

Actual Behavior

Intermittent error talking to s3

Steps to Reproduce

Its intermittent :(

terraform apply

The text was updated successfully, but these errors were encountered:

bflad · 2018-05-31T17:47:52Z

At first glance, it appears that maybe the code in the S3 backend in Terraform core is not retrying or retrying enough in that situation, but you would need to see the Terraform debug logging to confirm.

matt-richardson · 2018-06-05T02:15:47Z

How does one share the debug log, when its got lots of stuff in it that is private? (ie auth tokens etc)

Just had a similarish occurrence, and the debug log ended with:

018/06/05 11:13:26 [INFO] command: backend *s3.Backend is not enhanced, wrapping in local
2018/06/05 11:13:26 [DEBUG] [aws-sdk-go] DEBUG: Request s3/ListObjects Details:
---[ REQUEST POST-SIGN ]-----------------------------
GET /?prefix=instances%2F HTTP/1.1
Host: the-cloud-terraform-state.s3.us-west-2.amazonaws.com
User-Agent: aws-sdk-go/1.12.75 (go1.10.1; windows; 386) APN/1.0 HashiCorp/1.0 Terraform/0.11.7
Authorization: XXX
X-Amz-Content-Sha256: XXX
X-Amz-Date: 20180605T011326Z
Accept-Encoding: gzip

-----------------------------------------------------
2018/06/05 11:13:47 [DEBUG] [aws-sdk-go] DEBUG: Send Request s3/ListObjects failed, not retrying, error RequestError: send request failed
caused by: Get https://the-cloud-terraform-state.s3.us-west-2.amazonaws.com/?prefix=instances%2F: dial tcp 54.231.177.36:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2018/06/05 11:13:47 [DEBUG] plugin: waiting for all plugin processes to complete...

The not retrying looks rather suspicious here

stopthatastronaut · 2018-06-06T01:37:32Z

Got a log here, also looks like failure to retry

-----------------------------------------------------
2018/06/06 11:15:02 [DEBUG] [aws-sdk-go] DEBUG: Send Request s3/GetObject failed, not retrying, error RequestError: send request failed
caused by: Get https://[bucketnameredacted].s3.us-west-2.amazonaws.com/hosted-instances/hosted-instance-d74bcf9a-9dbd-4dd9-9f4f-a71e2974261b/hostedinstance-terraform.tfstate: dial tcp 54.231.177.36:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2018/06/06 11:15:02 [DEBUG] plugin: waiting for all plugin processes to complete...

Not entirely sure if this is the correct bit, I'm retaining the log at my end if anyone wants to delve.

matt-richardson · 2018-06-06T03:36:40Z

This looks related - hashicorp/terraform#16243. Probably just needs some more error cases

mlafeldt · 2018-09-19T09:16:31Z

We've been experiencing the same issue in our CI pipeline, and have narrowed it down to two problems.

1. ListObjects doesn't retry on any error

2018/09/18 14:01:03 [DEBUG] [aws-sdk-go] DEBUG: Send Request s3/ListObjects failed, not retrying, error RequestError: send request failed
caused by: Get https://s3.eu-central-1.amazonaws.com/OUR-BUCKET?prefix=env%3A%2F: EOF
2018/09/18 14:01:03 [DEBUG] plugin: waiting for all plugin processes to complete...
�[31mError loading state: RequestError: send request failed
caused by: Get https://s3.eu-central-1.amazonaws.com/OUR-BUCKET?prefix=env%3A%2F: EOF�[0m�[0m

See https://github.com/hashicorp/terraform/blob/49d62d3a1b99abf65711f5c8fdf2396931044db3/backend/remote-state/s3/backend_state.go#L29

2. GetObject only retries on S3-specific errors

2018/09/18 15:39:38 [DEBUG] [aws-sdk-go] DEBUG: Send Request s3/GetObject failed, not retrying, error RequestError: send request failed
caused by: Get https://s3.eu-central-1.amazonaws.com/OUR-BUCKET/dev/ae-auth/certificate/terraform.tfstate: EOF

See https://github.com/hashicorp/terraform/blob/49d62d3a1b99abf65711f5c8fdf2396931044db3/backend/remote-state/s3/client.go#L104

/cc @bracki

johnnyplaydrums · 2018-09-21T16:44:07Z

We've also noticed a huge uptick in the number of RequestError: send request failed when trying to load remote state. Allowing a retry on RequestError should alleviate the issue greatly. I will try to reproduce in debug mode and add the logs here.

lanmalkieri · 2018-09-21T22:13:01Z

+1 here, hitting lots of this, even with max_retries set to 10 in the aws provider block.

Ry-K · 2018-09-25T14:38:26Z

Same on v0.11.8, aws provider 1.36.0 w/ S3 backend.

rifelpet · 2018-09-25T17:47:30Z

Also seeing this issue intermittently on an EC2 instance in the same region as the s3 bucket. 0.11.8, 1.36.0. Sometimes it is failing to load the main state file from s3 and other times it is failing to load a terraform_remote_state also from s3.

$ terraform init
Error refreshing state: RequestError: send request failed
caused by: Get https://my-s3-bucket.s3.us-west-2.amazonaws.com/path/to/terraform.state: EOF

$ terraform plan
* data.terraform_remote_state.my-remote-state: data.terraform_remote_state.my-remote-state: error loading the remote state: RequestError: send request failed
caused by: Get https://my-s3-bucket.s3.us-west-2.amazonaws.com/?prefix=env%3A%2F: EOF

$ terraform plan
* data.terraform_remote_state.my-remote-state: data.terraform_remote_state.my-remote-state: RequestError: send request failed
caused by: Get https://my-s3-bucket.s3.us-west-2.amazonaws.com/path/to/terraform.state: EOF

And occasionally states are failing to be uploaded which results in the state file and dynamo lock table getting out of sync and needing to be manually repaired:

$ terraform apply terraform.plan
Failed to load backend: Error writing state: failed to upload state: RequestError: send request failed
caused by: Put https://my-s3-bucket.s3.us-west-2.amazonaws.com/path/to/terraform.state: EOF

The s3 backend code (that seems to be used by the terraform_remote_state datasource as well) appears to do some basic retrying but this EOF issue is either not getting caught by s3ErrCodeInternalError or is occurring twice in a row. It would be great to make the retry logic more intelligent with an exponential backoff of some sort. If needed I can try to repro this with more verbose logging so we can get the AWS error code.

brikis98 · 2018-09-25T22:32:17Z

Seeing this very often when using terragrunt with the apply-all command that runs several Terraform instances in parallel across different modules. It's so bad that I can't do deployments from my own laptop anymore and have to instead deploy an EC2 instance and git clone my code there. Is it possible AWS/S3 is throttling? Or is there some concurrency issue with Go?

akrapfl · 2018-09-26T18:40:39Z

Same here. Yesterday was HORRIBLE with a ton of failures executing terragrunt apply-all across many projects, AWS accounts, and buckets. Things seem improved today, but not completely. We are using terraform 0.11.7, and aws provider 1.37.0. We have backed it down aws provider 1.33.0 as a precaution. Applies seem faster, but no guarantee it solved anything because of the intermittent nature of the issue.

bflad · 2018-09-26T19:05:26Z

For issues relating to version 1.34.0+ of the AWS provider plan/apply being much slower, you might be interested in this thread potentially related to DNS handling and Go 1.11 (v1.34.0 was the first release on Go 1.11): #5822 (comment)

brikis98 · 2018-09-26T19:35:10Z

It's possible 1.34.0 and Go 1.11 have made this issue worse, but it definitely existed before these versions too, so these are definitely not the root cause.

richardgavel · 2018-09-26T19:50:17Z

We turned on DEBUG level logging in terraform, and the the most common thing we see is errors like the following. And it's fairly consistent but not totally consistent. Usually it fails in the terraform init. But sometimes it gets farther and fails on a subsequent call when it needs to hit sts (or iam) again.

The weird thing is with the DEBUG logging, I can see the actual request being made, and I cannot recreate the error by resubmitting it via Postman, even going so far as to update my hosts file to make sure the same IP address. Granted its my machine and not the one failing, but same network.

[DEBUG] [aws-sdk-go] DEBUG: Send Request sts/GetCallerIdentity failed, not retrying, error RequestError: send request failed caused by: Post https://sts.amazonaws.com/: dial tcp 54.239.29.25:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

subzero112233 · 2018-10-04T16:07:56Z

any news regarding this?

holybit · 2018-10-12T15:15:36Z

Fairly annoyed up with this bug which has been on going for our team of ~10 for at least two full months. At times I spend whole days working with Terraform supporting our developers moving out of the co-lo datacenter to AWS and it's been very frustrating of late. Terraform is often slow to start up and I get RequestError: messages conservatively 20% of the time.

We are seriously considering migrating remote state S3 -> Consul, but no idea if that would help. It could be the issue is with AWS throttling STS and or S3.

brikis98 · 2018-10-12T20:48:46Z

As an ugly workaround for this issue, we've added automatic retries to Terragrunt for errors known to be intermittent / transient: https://github.com/gruntwork-io/terragrunt#auto-retry

DrHashi · 2018-10-18T13:42:51Z

Still happens in 0.11.8

djschnei21 · 2018-10-21T16:30:01Z

we recently started seeing s3 remote state "Error loading state: RequestError: send request failed" for all buckets that reside in a different region than our terraform deployer (which lives in us-east-1). It's trying to "Get https://REDACTED.s3.us-west-2.amazonaws.com/......." which fails and has it's connection reset. interestingly enough when I curl that same bucket from the terraform deployer i get a (35) SSL connect error. When I curl the same bucket without specifying the region in the URL "REDACTED.s3.amazonaws.com" it curls fine. My working theory is that however these URLs are generated has changed recently and the region-specific URLs break SSL?

hbceylan · 2018-11-27T01:07:59Z

same issue OS X Terraform v0.11.10

puzzloholic · 2018-12-05T10:16:36Z

My case is on Terraform v0.11.10 and it is not intermitten. All terraform init command failed with same error

Initializing the backend...
Error loading state: RequestError: send request failed
caused by: Get https://[REDACTED].s3.ap-southeast-1.amazonaws.com/?prefix=env%3A%2F: dial tcp 52.219.36.19:443: connect: connection refused

marcelloromani · 2019-01-06T21:27:44Z

+1 here, hitting lots of this, even with max_retries set to 10 in the aws provider block.

Thank you, we just lost 16 upvotes to the original report.

"Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request"

Yes, this rant is noise too.

bflad · 2019-01-09T18:21:35Z

I have submitted a pull request upstream, which should resolve this: hashicorp/terraform#19951

bflad · 2019-01-11T15:29:02Z

Hi Folks 👋

The upstream pull request mentioned above, hashicorp/terraform#19951, has been merged and will release with the next Terraform 0.12 release, likely 0.12-beta1. It should enable automatic retries (5 by default) for temporary networking issues and other retry-able API errors in the S3 backend and terraform_remote_state data source when using S3.

If you have continuing issues with the S3 backend or terraform_remote_state data source when using S3 after the next Terraform 0.12 release that contains that change, or would like to see the change backported into a Terraform 0.11 release, please file an issue upstream: https://github.com/hashicorp/terraform/issues

Since there are a lot of varying issues/symptoms reported here, I am going to proactively lock this issue to encourage fresh bug reports for further triage if necessary.

bflad added bug Addresses a defect in current functionality. upstream-terraform Addresses functionality related to the Terraform core binary. labels May 31, 2018

johnnyplaydrums mentioned this issue Sep 25, 2018

Intermittent remote S3 state failure hashicorp/terraform#10779

Closed

vcalmic mentioned this issue Sep 25, 2018

Error configuring the backend "s3": RequestError: [...] TLS handshake timeout tfxor/terrahub#299

Closed

bflad mentioned this issue Jan 9, 2019

backend/s3: Configure AWS Client MaxRetries and provide enhanced S3 NoSuchBucket error message hashicorp/terraform#19951

Merged

bflad added the terraform-0.12 label Jan 11, 2019

bflad closed this as completed Jan 11, 2019

hashicorp locked and limited conversation to collaborators Jan 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent error using s3 state #4709

Intermittent error using s3 state #4709

matt-richardson commented May 31, 2018

bflad commented May 31, 2018

matt-richardson commented Jun 5, 2018

stopthatastronaut commented Jun 6, 2018

matt-richardson commented Jun 6, 2018

mlafeldt commented Sep 19, 2018

johnnyplaydrums commented Sep 21, 2018

lanmalkieri commented Sep 21, 2018

Ry-K commented Sep 25, 2018

rifelpet commented Sep 25, 2018 •

edited

Loading

brikis98 commented Sep 25, 2018

akrapfl commented Sep 26, 2018

bflad commented Sep 26, 2018

brikis98 commented Sep 26, 2018

richardgavel commented Sep 26, 2018 •

edited

Loading

subzero112233 commented Oct 4, 2018

holybit commented Oct 12, 2018 •

edited

Loading

brikis98 commented Oct 12, 2018

DrHashi commented Oct 18, 2018

djschnei21 commented Oct 21, 2018

hbceylan commented Nov 27, 2018

puzzloholic commented Dec 5, 2018

marcelloromani commented Jan 6, 2019 •

edited

Loading

bflad commented Jan 9, 2019

bflad commented Jan 11, 2019

Intermittent error using s3 state #4709

Intermittent error using s3 state #4709

Comments

matt-richardson commented May 31, 2018

Community Note

Terraform Version

Affected Resource(s)

Terraform Configuration Files

Output

Expected Behavior

Actual Behavior

Steps to Reproduce

bflad commented May 31, 2018

matt-richardson commented Jun 5, 2018

stopthatastronaut commented Jun 6, 2018

matt-richardson commented Jun 6, 2018

mlafeldt commented Sep 19, 2018

1. ListObjects doesn't retry on any error

2. GetObject only retries on S3-specific errors

johnnyplaydrums commented Sep 21, 2018

lanmalkieri commented Sep 21, 2018

Ry-K commented Sep 25, 2018

rifelpet commented Sep 25, 2018 • edited Loading

brikis98 commented Sep 25, 2018

akrapfl commented Sep 26, 2018

bflad commented Sep 26, 2018

brikis98 commented Sep 26, 2018

richardgavel commented Sep 26, 2018 • edited Loading

subzero112233 commented Oct 4, 2018

holybit commented Oct 12, 2018 • edited Loading

brikis98 commented Oct 12, 2018

DrHashi commented Oct 18, 2018

djschnei21 commented Oct 21, 2018

hbceylan commented Nov 27, 2018

puzzloholic commented Dec 5, 2018

marcelloromani commented Jan 6, 2019 • edited Loading

bflad commented Jan 9, 2019

bflad commented Jan 11, 2019

rifelpet commented Sep 25, 2018 •

edited

Loading

richardgavel commented Sep 26, 2018 •

edited

Loading

holybit commented Oct 12, 2018 •

edited

Loading

marcelloromani commented Jan 6, 2019 •

edited

Loading