-
Notifications
You must be signed in to change notification settings - Fork 9.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent error using s3 state #4709
Comments
At first glance, it appears that maybe the code in the S3 backend in Terraform core is not retrying or retrying enough in that situation, but you would need to see the Terraform debug logging to confirm. |
How does one share the debug log, when its got lots of stuff in it that is private? (ie auth tokens etc) Just had a similarish occurrence, and the debug log ended with:
The |
Got a log here, also looks like failure to retry
Not entirely sure if this is the correct bit, I'm retaining the log at my end if anyone wants to delve. |
This looks related - hashicorp/terraform#16243. Probably just needs some more error cases |
We've been experiencing the same issue in our CI pipeline, and have narrowed it down to two problems. 1. ListObjects doesn't retry on any error
2. GetObject only retries on S3-specific errors
/cc @bracki |
We've also noticed a huge uptick in the number of |
+1 here, hitting lots of this, even with max_retries set to 10 in the aws provider block. |
Same on v0.11.8, aws provider 1.36.0 w/ S3 backend. |
Also seeing this issue intermittently on an EC2 instance in the same region as the s3 bucket. 0.11.8, 1.36.0. Sometimes it is failing to load the main state file from s3 and other times it is failing to load a terraform_remote_state also from s3.
And occasionally states are failing to be uploaded which results in the state file and dynamo lock table getting out of sync and needing to be manually repaired:
The s3 backend code (that seems to be used by the terraform_remote_state datasource as well) appears to do some basic retrying but this EOF issue is either not getting caught by |
Seeing this very often when using terragrunt with the |
Same here. Yesterday was HORRIBLE with a ton of failures executing terragrunt apply-all across many projects, AWS accounts, and buckets. Things seem improved today, but not completely. We are using terraform 0.11.7, and aws provider 1.37.0. We have backed it down aws provider 1.33.0 as a precaution. Applies seem faster, but no guarantee it solved anything because of the intermittent nature of the issue. |
For issues relating to version 1.34.0+ of the AWS provider |
It's possible 1.34.0 and Go 1.11 have made this issue worse, but it definitely existed before these versions too, so these are definitely not the root cause. |
We turned on DEBUG level logging in terraform, and the the most common thing we see is errors like the following. And it's fairly consistent but not totally consistent. Usually it fails in the terraform init. But sometimes it gets farther and fails on a subsequent call when it needs to hit sts (or iam) again. The weird thing is with the DEBUG logging, I can see the actual request being made, and I cannot recreate the error by resubmitting it via Postman, even going so far as to update my hosts file to make sure the same IP address. Granted its my machine and not the one failing, but same network.
|
any news regarding this? |
Fairly annoyed up with this bug which has been on going for our team of ~10 for at least two full months. At times I spend whole days working with Terraform supporting our developers moving out of the co-lo datacenter to AWS and it's been very frustrating of late. Terraform is often slow to start up and I get We are seriously considering migrating remote state S3 -> Consul, but no idea if that would help. It could be the issue is with AWS throttling STS and or S3. |
As an ugly workaround for this issue, we've added automatic retries to Terragrunt for errors known to be intermittent / transient: https://github.com/gruntwork-io/terragrunt#auto-retry |
Still happens in 0.11.8 |
we recently started seeing s3 remote state "Error loading state: RequestError: send request failed" for all buckets that reside in a different region than our terraform deployer (which lives in us-east-1). It's trying to "Get https://REDACTED.s3.us-west-2.amazonaws.com/......." which fails and has it's connection reset. interestingly enough when I curl that same bucket from the terraform deployer i get a (35) SSL connect error. When I curl the same bucket without specifying the region in the URL "REDACTED.s3.amazonaws.com" it curls fine. My working theory is that however these URLs are generated has changed recently and the region-specific URLs break SSL? |
same issue OS X Terraform v0.11.10 |
My case is on Terraform v0.11.10 and it is not intermitten. All terraform init command failed with same error
|
Thank you, we just lost 16 upvotes to the original report. "Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request" Yes, this rant is noise too. |
I have submitted a pull request upstream, which should resolve this: hashicorp/terraform#19951 |
Hi Folks 👋 The upstream pull request mentioned above, hashicorp/terraform#19951, has been merged and will release with the next Terraform 0.12 release, likely 0.12-beta1. It should enable automatic retries (5 by default) for temporary networking issues and other retry-able API errors in the S3 backend and If you have continuing issues with the S3 backend or Since there are a lot of varying issues/symptoms reported here, I am going to proactively lock this issue to encourage fresh bug reports for further triage if necessary. |
Community Note
Terraform Version
Terraform 0.11.7
aws 1.17.0
Affected Resource(s)
Terraform Configuration Files
(names have been changed to protect the innocent...)
Output
Expected Behavior
Terraform to work as expected, consistently
Actual Behavior
Intermittent error talking to s3
Steps to Reproduce
Its intermittent :(
terraform apply
The text was updated successfully, but these errors were encountered: