Spot instance creation times out #8164

jbarrick-mesosphere · 2019-04-02T22:35:35Z

When creating a spot instance in AWS, I am occasionally seeing timeouts:

[00:14:57]* aws_spot_instance_request.instance.1: Error requesting spot instances: timeout while waiting for state to become 'success' (timeout: 1m0s)

This was originally reported here: #3554. The issue was closed since a "fix" was merged, but the fix is not really permanent:

increase timeout from 15 secs to 1 minute, which matches the timeouts
for other places where we deal with IAM "eventually consistent"
semantics

Another commenter also indicated they were seeing this as well: #3554 (comment)

The text was updated successfully, but these errors were encountered:

…timeout hashicorp#8164

jbarrick-mesosphere · 2019-04-23T23:30:57Z

Okay, I think I see what is going on here.

The retry timeout that was raised before is irrelevant to the issue I'm seeing, because the AWS API is accepting the spot request:

2019-04-23T20:21:59.712Z [DEBUG] plugin.terraform-provider-aws: 2019/04/23 20:21:59 [DEBUG] waiting for spot bid to resolve... this may take several minutes.
2019-04-23T20:21:59.712Z [DEBUG] plugin.terraform-provider-aws: 2019/04/23 20:21:59 [DEBUG] Waiting for state to become: [fulfilled]
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws: 2019/04/23 20:21:59 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/RequestSpotInstances Details:
...
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws: -----------------------------------------------------
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws: 2019/04/23 20:21:59 [DEBUG] [aws-sdk-go] <?xml version="1.0" encoding="UTF-8"?>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws: <RequestSpotInstancesResponse xmlns="http://ec2.amazonaws.com/doc/2016-11-15/">
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:     <requestId>ab72a552-e604-41cb-84eb-21d862112f90</requestId>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:     <spotInstanceRequestSet>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:         <item>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:             <spotInstanceRequestId>sir-c1jrav3m</spotInstanceRequestId>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:             <spotPrice>0.100000</spotPrice>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:             <type>one-time</type>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:             <state>open</state>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:             <status>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:                 <code>pending-evaluation</code>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:                 <updateTime>2019-04-23T20:21:59.000Z</updateTime>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:                 <message>Your Spot request has been submitted for review, and is pending evaluation.</message>

So the creation loop (which is what checks for the race) finishes: https://github.com/terraform-providers/terraform-provider-aws/blob/master/aws/resource_aws_spot_instance_request.go#L179

Then the code moves onto the next retry block: https://github.com/terraform-providers/terraform-provider-aws/blob/master/aws/resource_aws_spot_instance_request.go#L208

This retry block doesn't tolerate the IAM race and so it fails:

2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws: 2019/04/23 20:22:09 [DEBUG] [aws-sdk-go] <?xml version="1.0" encoding="UTF-8"?>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws: <DescribeSpotInstanceRequestsResponse xmlns="http://ec2.amazonaws.com/doc/2016-11-15/">
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:     <requestId>671cba80-354c-4a1c-8c17-35abdfea757b</requestId>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:     <spotInstanceRequestSet>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:         <item>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:             <spotInstanceRequestId>sir-c1jrav3m</spotInstanceRequestId>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:             <spotPrice>0.100000</spotPrice>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:             <type>one-time</type>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:             <state>failed</state>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:             <fault>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:                 <code>InvalidParameterValue</code>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:                 <message>Value (dcos-terraform-ci-90913f42df22592c-instance_profile) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name</message>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:             </fault>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:             <status>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:                 <code>bad-parameters</code>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:                 <updateTime>2019-04-23T20:22:00.000Z</updateTime>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:                 <message>Your Spot request failed due to bad parameters.</message>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:             </status>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:             <validUntil>2019-04-30T20:21:59.000Z</validUntil>
......
</DescribeSpotInstanceRequestsResponse>
:

* aws_spot_instance_request.instance: Error while waiting for spot request ({
  BlockDurationMinutes: 120,
  CreateTime: 2019-04-23 20:21:59 +0000 UTC,
  InstanceInterruptionBehavior: "terminate",
  LaunchSpecification: {
    BlockDeviceMappings: [{
        DeviceName: "/dev/sda1",
        Ebs: {
          DeleteOnTermination: true,
          VolumeSize: 120,
          VolumeType: "gp2"
        }
      }],
    EbsOptimized: false,
    IamInstanceProfile: {
      Name: "dcos-terraform-ci-90913f42df22592c-instance_profile"
    },
    ImageId: "ami-3ecc8f46",
    InstanceType: "m4.large",
    KeyName: "terraform-ci-90913f42df22592c-deployer-key",
    Monitoring: {
      Enabled: false
    },
    NetworkInterfaces: [{
        AssociatePublicIpAddress: true,
        DeviceIndex: 0,
        SubnetId: "subnet-0eef635a09248f35c"
      }],
    Placement: {
      AvailabilityZone: "us-west-2a",
      GroupName: ""
    },
    SecurityGroups: [{
        GroupName: "default"
      }]
  },
  ProductDescription: "Linux/UNIX",
  SpotInstanceRequestId: "sir-c1jrav3m",
  SpotPrice: "0.100000",
  State: "open",
  Status: {
    Code: "pending-evaluation",
    Message: "Your Spot request has been submitted for review, and is pending evaluation.",
    UpdateTime: 2019-04-23 20:21:59 +0000 UTC
  },
  Type: "one-time"
}) to resolve: unexpected state 'bad-parameters', wanted target 'fulfilled'. last error: %!s(<nil>)

It seems like this can be fixed by retrying on bad-parameters? In the event of a legitimate failure, the error will appear after the timeout.
I have created a fork of the provider to test and will submit a PR if my theory pans out.

jbarrick-mesosphere · 2019-04-24T17:16:45Z

My fix didn't work - it still fails, it just takes the full timeout now (even with a very high, 20 minute timeout set). But it's very strange since there are a number of other spot instances created at the same time that do work - it's only the one that never converges.

Fixes hashicorp#8164 Due to IAM eventual consistency, it is possible for spot instances to be successfully requested and then later return `bad-parameters` indicating the IAM role does not exist - but the instance never recovers. ``` <DescribeSpotInstanceRequestsResponse xmlns="http://ec2.amazonaws.com/doc/2016-11-15/"> <state>failed</state> <fault> <code>InvalidParameterValue</code> <message>Value (dcos-terraform-ci-90913f42df22592c-instance_profile) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name</message> </fault> <status> <code>bad-parameters</code> <updateTime>2019-04-23T20:22:00.000Z</updateTime> <message>Your Spot request failed due to bad parameters.</message> </status> ... }) to resolve: unexpected state 'bad-parameters', wanted target 'fulfilled'. last error: %!s(<nil>) ``` This means that the retries on requesting the spot instance is insufficient - it appears that even if it is successfully accepted, it can error later. Because of this, this PR makes it so that if `bad-parameters` is returned, we request a new spot instance and try again.

…timeout hashicorp#8164

Fixes hashicorp#8164 Due to IAM eventual consistency, it is possible for spot instances to be successfully requested and then later return `bad-parameters` indicating the IAM role does not exist - but the instance never recovers. ``` <DescribeSpotInstanceRequestsResponse xmlns="http://ec2.amazonaws.com/doc/2016-11-15/"> <state>failed</state> <fault> <code>InvalidParameterValue</code> <message>Value (dcos-terraform-ci-90913f42df22592c-instance_profile) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name</message> </fault> <status> <code>bad-parameters</code> <updateTime>2019-04-23T20:22:00.000Z</updateTime> <message>Your Spot request failed due to bad parameters.</message> </status> ... }) to resolve: unexpected state 'bad-parameters', wanted target 'fulfilled'. last error: %!s(<nil>) ``` This means that the retries on requesting the spot instance is insufficient - it appears that even if it is successfully accepted, it can error later. Because of this, this PR makes it so that if `bad-parameters` is returned, we request a new spot instance and try again.

choppedpork · 2020-08-18T15:43:38Z

@jbarrick-mesosphere I believe I'm hitting the same issue every now and then (once every ~2 weeks maybe) in CI. My spot requests are also configured with IAM and the result we get is the same. I've started looking into this as it was reported that once this happens it's impossible to either apply or destroy anymore - the one time I've managed to look into it was a few hours later and the problem was gone.

I think your commit message on the PR linked to this issue explains the issue we're seeing very well - until the spot request is gone from AWS the failed spot instance resource in the state seems to be in a strange status where it kind of exists but returns nulls for its attributes:

Error: Null value found in list

  on main.tf line 74, in resource "aws_route53_record" "consul":
  74:   records = module.node.private_ips

Null values are not allowed for this attribute value.

shanlashari · 2022-02-21T20:23:28Z

hashi folks, is there any plan to merge the PR for the fix #8556

cc @breathingdust

…r_ready_timeout hashicorp#8164" This reverts commit 675ffcc.

github-actions · 2023-07-07T02:20:03Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

nywilken added the service/ec2 Issues and PRs that pertain to the ec2 service. label Apr 3, 2019

jbarrick-mesosphere added a commit to jbarrick-mesosphere/terraform-provider-aws that referenced this issue Apr 3, 2019

Make spot instance resource timeout configurable with wait_for_ready_…

675ffcc

…timeout hashicorp#8164

jbarrick-mesosphere mentioned this issue Apr 3, 2019

Make the ready timeout configurable for ELB and spot instance requests to address #564 and #8164 #8180

Closed

jbarrick-mesosphere mentioned this issue May 7, 2019

Workaround IAM role creation race with spot instances #8556

Closed

jbarrick-mesosphere added a commit to jbarrick-mesosphere/terraform-provider-aws that referenced this issue May 7, 2019

Make spot instance resource timeout configurable with wait_for_ready_…

c598650

…timeout hashicorp#8164

aeschright added the needs-triage Waiting for first response or review from a maintainer. label Jun 24, 2019

breathingdust added bug Addresses a defect in current functionality. and removed needs-triage Waiting for first response or review from a maintainer. labels Sep 17, 2021

ewbankkit mentioned this issue Jun 6, 2023

Workaround IAM instance profile creation race with spot instances #31800

Merged

ewbankkit closed this as completed in #31800 Jun 6, 2023

ewbankkit mentioned this issue Jun 7, 2023

Revert #31800 #31820

Merged

ewbankkit added a commit to brittandeyoung/terraform-provider-aws that referenced this issue Jun 15, 2023

Revert "Make spot instance resource timeout configurable with wait_fo…

8e74c33

…r_ready_timeout hashicorp#8164" This reverts commit 675ffcc.

github-actions bot locked as resolved and limited conversation to collaborators Jul 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spot instance creation times out #8164

Spot instance creation times out #8164

jbarrick-mesosphere commented Apr 2, 2019

jbarrick-mesosphere commented Apr 23, 2019

jbarrick-mesosphere commented Apr 24, 2019

choppedpork commented Aug 18, 2020

shanlashari commented Feb 21, 2022 •

edited

Loading

github-actions bot commented Jul 7, 2023

Spot instance creation times out #8164

Spot instance creation times out #8164

Comments

jbarrick-mesosphere commented Apr 2, 2019

jbarrick-mesosphere commented Apr 23, 2019

jbarrick-mesosphere commented Apr 24, 2019

choppedpork commented Aug 18, 2020

shanlashari commented Feb 21, 2022 • edited Loading

github-actions bot commented Jul 7, 2023

shanlashari commented Feb 21, 2022 •

edited

Loading