-
Notifications
You must be signed in to change notification settings - Fork 9.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spot instance creation times out #8164
Comments
Okay, I think I see what is going on here. The retry timeout that was raised before is irrelevant to the issue I'm seeing, because the AWS API is accepting the spot request:
So the creation loop (which is what checks for the race) finishes: https://github.com/terraform-providers/terraform-provider-aws/blob/master/aws/resource_aws_spot_instance_request.go#L179 Then the code moves onto the next retry block: https://github.com/terraform-providers/terraform-provider-aws/blob/master/aws/resource_aws_spot_instance_request.go#L208 This retry block doesn't tolerate the IAM race and so it fails:
It seems like this can be fixed by retrying on |
My fix didn't work - it still fails, it just takes the full timeout now (even with a very high, 20 minute timeout set). But it's very strange since there are a number of other spot instances created at the same time that do work - it's only the one that never converges. |
Fixes hashicorp#8164 Due to IAM eventual consistency, it is possible for spot instances to be successfully requested and then later return `bad-parameters` indicating the IAM role does not exist - but the instance never recovers. ``` <DescribeSpotInstanceRequestsResponse xmlns="http://ec2.amazonaws.com/doc/2016-11-15/"> <state>failed</state> <fault> <code>InvalidParameterValue</code> <message>Value (dcos-terraform-ci-90913f42df22592c-instance_profile) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name</message> </fault> <status> <code>bad-parameters</code> <updateTime>2019-04-23T20:22:00.000Z</updateTime> <message>Your Spot request failed due to bad parameters.</message> </status> ... }) to resolve: unexpected state 'bad-parameters', wanted target 'fulfilled'. last error: %!s(<nil>) ``` This means that the retries on requesting the spot instance is insufficient - it appears that even if it is successfully accepted, it can error later. Because of this, this PR makes it so that if `bad-parameters` is returned, we request a new spot instance and try again.
Fixes hashicorp#8164 Due to IAM eventual consistency, it is possible for spot instances to be successfully requested and then later return `bad-parameters` indicating the IAM role does not exist - but the instance never recovers. ``` <DescribeSpotInstanceRequestsResponse xmlns="http://ec2.amazonaws.com/doc/2016-11-15/"> <state>failed</state> <fault> <code>InvalidParameterValue</code> <message>Value (dcos-terraform-ci-90913f42df22592c-instance_profile) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name</message> </fault> <status> <code>bad-parameters</code> <updateTime>2019-04-23T20:22:00.000Z</updateTime> <message>Your Spot request failed due to bad parameters.</message> </status> ... }) to resolve: unexpected state 'bad-parameters', wanted target 'fulfilled'. last error: %!s(<nil>) ``` This means that the retries on requesting the spot instance is insufficient - it appears that even if it is successfully accepted, it can error later. Because of this, this PR makes it so that if `bad-parameters` is returned, we request a new spot instance and try again.
@jbarrick-mesosphere I believe I'm hitting the same issue every now and then (once every ~2 weeks maybe) in CI. My spot requests are also configured with IAM and the result we get is the same. I've started looking into this as it was reported that once this happens it's impossible to either apply or destroy anymore - the one time I've managed to look into it was a few hours later and the problem was gone. I think your commit message on the PR linked to this issue explains the issue we're seeing very well - until the spot request is gone from AWS the failed spot instance resource in the state seems to be in a strange status where it kind of exists but returns
|
hashi folks, is there any plan to merge the PR for the fix #8556 |
…r_ready_timeout hashicorp#8164" This reverts commit 675ffcc.
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. |
When creating a spot instance in AWS, I am occasionally seeing timeouts:
This was originally reported here: #3554. The issue was closed since a "fix" was merged, but the fix is not really permanent:
Another commenter also indicated they were seeing this as well: #3554 (comment)
The text was updated successfully, but these errors were encountered: