Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spot instance creation times out #8164

Closed
jbarrick-mesosphere opened this issue Apr 2, 2019 · 5 comments · Fixed by #31800
Closed

Spot instance creation times out #8164

jbarrick-mesosphere opened this issue Apr 2, 2019 · 5 comments · Fixed by #31800
Labels
bug Addresses a defect in current functionality. service/ec2 Issues and PRs that pertain to the ec2 service.

Comments

@jbarrick-mesosphere
Copy link
Contributor

When creating a spot instance in AWS, I am occasionally seeing timeouts:

[00:14:57]* aws_spot_instance_request.instance.1: Error requesting spot instances: timeout while waiting for state to become 'success' (timeout: 1m0s)

This was originally reported here: #3554. The issue was closed since a "fix" was merged, but the fix is not really permanent:

increase timeout from 15 secs to 1 minute, which matches the timeouts
for other places where we deal with IAM "eventually consistent"
semantics

Another commenter also indicated they were seeing this as well: #3554 (comment)

@nywilken nywilken added the service/ec2 Issues and PRs that pertain to the ec2 service. label Apr 3, 2019
jbarrick-mesosphere added a commit to jbarrick-mesosphere/terraform-provider-aws that referenced this issue Apr 3, 2019
@jbarrick-mesosphere
Copy link
Contributor Author

Okay, I think I see what is going on here.

The retry timeout that was raised before is irrelevant to the issue I'm seeing, because the AWS API is accepting the spot request:

2019-04-23T20:21:59.712Z [DEBUG] plugin.terraform-provider-aws: 2019/04/23 20:21:59 [DEBUG] waiting for spot bid to resolve... this may take several minutes.
2019-04-23T20:21:59.712Z [DEBUG] plugin.terraform-provider-aws: 2019/04/23 20:21:59 [DEBUG] Waiting for state to become: [fulfilled]
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws: 2019/04/23 20:21:59 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/RequestSpotInstances Details:
...
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws: -----------------------------------------------------
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws: 2019/04/23 20:21:59 [DEBUG] [aws-sdk-go] <?xml version="1.0" encoding="UTF-8"?>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws: <RequestSpotInstancesResponse xmlns="http://ec2.amazonaws.com/doc/2016-11-15/">
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:     <requestId>ab72a552-e604-41cb-84eb-21d862112f90</requestId>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:     <spotInstanceRequestSet>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:         <item>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:             <spotInstanceRequestId>sir-c1jrav3m</spotInstanceRequestId>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:             <spotPrice>0.100000</spotPrice>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:             <type>one-time</type>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:             <state>open</state>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:             <status>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:                 <code>pending-evaluation</code>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:                 <updateTime>2019-04-23T20:21:59.000Z</updateTime>
2019-04-23T20:21:59.742Z [DEBUG] plugin.terraform-provider-aws:                 <message>Your Spot request has been submitted for review, and is pending evaluation.</message>

So the creation loop (which is what checks for the race) finishes: https://github.com/terraform-providers/terraform-provider-aws/blob/master/aws/resource_aws_spot_instance_request.go#L179

Then the code moves onto the next retry block: https://github.com/terraform-providers/terraform-provider-aws/blob/master/aws/resource_aws_spot_instance_request.go#L208

This retry block doesn't tolerate the IAM race and so it fails:

2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws: 2019/04/23 20:22:09 [DEBUG] [aws-sdk-go] <?xml version="1.0" encoding="UTF-8"?>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws: <DescribeSpotInstanceRequestsResponse xmlns="http://ec2.amazonaws.com/doc/2016-11-15/">
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:     <requestId>671cba80-354c-4a1c-8c17-35abdfea757b</requestId>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:     <spotInstanceRequestSet>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:         <item>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:             <spotInstanceRequestId>sir-c1jrav3m</spotInstanceRequestId>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:             <spotPrice>0.100000</spotPrice>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:             <type>one-time</type>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:             <state>failed</state>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:             <fault>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:                 <code>InvalidParameterValue</code>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:                 <message>Value (dcos-terraform-ci-90913f42df22592c-instance_profile) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name</message>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:             </fault>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:             <status>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:                 <code>bad-parameters</code>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:                 <updateTime>2019-04-23T20:22:00.000Z</updateTime>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:                 <message>Your Spot request failed due to bad parameters.</message>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:             </status>
2019-04-23T20:22:09.846Z [DEBUG] plugin.terraform-provider-aws:             <validUntil>2019-04-30T20:21:59.000Z</validUntil>
......
</DescribeSpotInstanceRequestsResponse>
:

* aws_spot_instance_request.instance: Error while waiting for spot request ({
  BlockDurationMinutes: 120,
  CreateTime: 2019-04-23 20:21:59 +0000 UTC,
  InstanceInterruptionBehavior: "terminate",
  LaunchSpecification: {
    BlockDeviceMappings: [{
        DeviceName: "/dev/sda1",
        Ebs: {
          DeleteOnTermination: true,
          VolumeSize: 120,
          VolumeType: "gp2"
        }
      }],
    EbsOptimized: false,
    IamInstanceProfile: {
      Name: "dcos-terraform-ci-90913f42df22592c-instance_profile"
    },
    ImageId: "ami-3ecc8f46",
    InstanceType: "m4.large",
    KeyName: "terraform-ci-90913f42df22592c-deployer-key",
    Monitoring: {
      Enabled: false
    },
    NetworkInterfaces: [{
        AssociatePublicIpAddress: true,
        DeviceIndex: 0,
        SubnetId: "subnet-0eef635a09248f35c"
      }],
    Placement: {
      AvailabilityZone: "us-west-2a",
      GroupName: ""
    },
    SecurityGroups: [{
        GroupName: "default"
      }]
  },
  ProductDescription: "Linux/UNIX",
  SpotInstanceRequestId: "sir-c1jrav3m",
  SpotPrice: "0.100000",
  State: "open",
  Status: {
    Code: "pending-evaluation",
    Message: "Your Spot request has been submitted for review, and is pending evaluation.",
    UpdateTime: 2019-04-23 20:21:59 +0000 UTC
  },
  Type: "one-time"
}) to resolve: unexpected state 'bad-parameters', wanted target 'fulfilled'. last error: %!s(<nil>)

It seems like this can be fixed by retrying on bad-parameters? In the event of a legitimate failure, the error will appear after the timeout.
I have created a fork of the provider to test and will submit a PR if my theory pans out.

@jbarrick-mesosphere
Copy link
Contributor Author

My fix didn't work - it still fails, it just takes the full timeout now (even with a very high, 20 minute timeout set). But it's very strange since there are a number of other spot instances created at the same time that do work - it's only the one that never converges.

jbarrick-mesosphere added a commit to jbarrick-mesosphere/terraform-provider-aws that referenced this issue May 7, 2019
Fixes hashicorp#8164

Due to IAM eventual consistency, it is possible for spot instances to be successfully requested and then
later return `bad-parameters` indicating the IAM role does not exist - but the instance never recovers.

```
<DescribeSpotInstanceRequestsResponse xmlns="http://ec2.amazonaws.com/doc/2016-11-15/">
  <state>failed</state>
  <fault>
      <code>InvalidParameterValue</code>
      <message>Value (dcos-terraform-ci-90913f42df22592c-instance_profile) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name</message>
  </fault>
  <status>
      <code>bad-parameters</code>
      <updateTime>2019-04-23T20:22:00.000Z</updateTime>
      <message>Your Spot request failed due to bad parameters.</message>
  </status>
...
}) to resolve: unexpected state 'bad-parameters', wanted target 'fulfilled'. last error: %!s(<nil>)
```

This means that the retries on requesting the spot instance is insufficient - it appears that even if it is successfully
accepted, it can error later. Because of this, this PR makes it so that if `bad-parameters` is returned, we request a new
spot instance and try again.
jbarrick-mesosphere added a commit to jbarrick-mesosphere/terraform-provider-aws that referenced this issue May 7, 2019
jbarrick-mesosphere added a commit to jbarrick-mesosphere/terraform-provider-aws that referenced this issue May 7, 2019
Fixes hashicorp#8164

Due to IAM eventual consistency, it is possible for spot instances to be successfully requested and then
later return `bad-parameters` indicating the IAM role does not exist - but the instance never recovers.

```
<DescribeSpotInstanceRequestsResponse xmlns="http://ec2.amazonaws.com/doc/2016-11-15/">
  <state>failed</state>
  <fault>
      <code>InvalidParameterValue</code>
      <message>Value (dcos-terraform-ci-90913f42df22592c-instance_profile) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name</message>
  </fault>
  <status>
      <code>bad-parameters</code>
      <updateTime>2019-04-23T20:22:00.000Z</updateTime>
      <message>Your Spot request failed due to bad parameters.</message>
  </status>
...
}) to resolve: unexpected state 'bad-parameters', wanted target 'fulfilled'. last error: %!s(<nil>)
```

This means that the retries on requesting the spot instance is insufficient - it appears that even if it is successfully
accepted, it can error later. Because of this, this PR makes it so that if `bad-parameters` is returned, we request a new
spot instance and try again.
@aeschright aeschright added the needs-triage Waiting for first response or review from a maintainer. label Jun 24, 2019
@choppedpork
Copy link

@jbarrick-mesosphere I believe I'm hitting the same issue every now and then (once every ~2 weeks maybe) in CI. My spot requests are also configured with IAM and the result we get is the same. I've started looking into this as it was reported that once this happens it's impossible to either apply or destroy anymore - the one time I've managed to look into it was a few hours later and the problem was gone.

I think your commit message on the PR linked to this issue explains the issue we're seeing very well - until the spot request is gone from AWS the failed spot instance resource in the state seems to be in a strange status where it kind of exists but returns nulls for its attributes:

Error: Null value found in list

  on main.tf line 74, in resource "aws_route53_record" "consul":
  74:   records = module.node.private_ips

Null values are not allowed for this attribute value.

@breathingdust breathingdust added bug Addresses a defect in current functionality. and removed needs-triage Waiting for first response or review from a maintainer. labels Sep 17, 2021
@shanlashari
Copy link

shanlashari commented Feb 21, 2022

hashi folks, is there any plan to merge the PR for the fix #8556

cc @breathingdust

ewbankkit added a commit to brittandeyoung/terraform-provider-aws that referenced this issue Jun 15, 2023
@github-actions
Copy link

github-actions bot commented Jul 7, 2023

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/ec2 Issues and PRs that pertain to the ec2 service.
Projects
None yet
6 participants