Skip to content

Comments

OCPBUGS-34975: aws: terraform: add spot instance support for masters#8349

Merged
openshift-merge-bot[bot] merged 3 commits intoopenshift:masterfrom
r4f4:aws-terraform-spot-master
May 31, 2024
Merged

OCPBUGS-34975: aws: terraform: add spot instance support for masters#8349
openshift-merge-bot[bot] merged 3 commits intoopenshift:masterfrom
r4f4:aws-terraform-spot-master

Conversation

@r4f4
Copy link
Contributor

@r4f4 r4f4 commented May 5, 2024

Spot instances can result on savings for short-lived clusters. If the control plane machine manifests have been edited with spot instance information, enable the use of spot instances in the terraform config. Notice that max price information is ignored, since it's not advised for it to be set.

@openshift-ci openshift-ci bot requested review from mtulio and patrickdillon May 5, 2024 18:45
@r4f4
Copy link
Contributor Author

r4f4 commented May 5, 2024

Depends on openshift/release#51664 for testing spot instances deployments.

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 5, 2024
@r4f4 r4f4 force-pushed the aws-terraform-spot-master branch from 00d3baa to 08abdde Compare May 6, 2024 15:00
@2uasimojo
Copy link
Member

2uasimojo commented May 21, 2024

Implements (part of) RFE-5545

@r4f4
Copy link
Contributor Author

r4f4 commented May 23, 2024

/retitle CORS-3523: aws: terraform: add spot instance support for masters

@openshift-ci openshift-ci bot changed the title aws: terraform: add spot instance support for masters CORS-3523: aws: terraform: add spot instance support for masters May 23, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 23, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented May 23, 2024

@r4f4: This pull request references CORS-3523 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

Details

In response to this:

Spot instances can result on savings for short-lived clusters. If the control plane machine manifests have been edited with spot instance information, enable the use of spot instances in the terraform config. Notice that max price information is ignored, since it's not advised for it to be set.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@r4f4
Copy link
Contributor Author

r4f4 commented May 23, 2024

/retitle CORS-3524, CORS-3523: aws: terraform: add spot instance support for masters

@openshift-ci openshift-ci bot changed the title CORS-3523: aws: terraform: add spot instance support for masters CORS-3524, CORS-3523: aws: terraform: add spot instance support for masters May 23, 2024
@2uasimojo
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 23, 2024
r4f4 added 2 commits May 23, 2024 21:20
Spot instances can result on savings for short-lived clusters.
If the control plane machine manifests have been edited with spot
instance information, enable the use of spot instances in the terraform
config. Notice that max price information is ignored, since it's not
advised for it to be set.
@r4f4 r4f4 force-pushed the aws-terraform-spot-master branch from 3d2f7f9 to 377bc14 Compare May 23, 2024 19:21
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 23, 2024
Comment on lines 9 to 11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose grep works fine here, and perhaps we don't need to over engineer but would it be more convenient to structure this output, say in json or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrickdillon what about a --json argument for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool; though the current use case still requires:

  • parsing the output with something (be it grep or jq or [[ "$(...)" == *'"terraform-spot-masters"'* ]])
  • accounting for what happens when invoking this on an older installer binary (stdout is empty, RC is nonzero)

...so YAGNI for now 🤷

The only thing I can think of that would make it easier to consume would be something like openshift-install is-hidden-feature-supported terraform-spot-masters which exits zero if yes, nonzero if no. That way it can reliably be consumed as:

if ! openshift-install is-hidden-feature-supported terraform-spot-masters >/dev/null 2>&1; then
    ...
fi

...because I'll get a nonzero RC on older installers as well as newer ones that have this code but disclaim the support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be clear: I'm not asking for ⬆️, just thinking aloud :)

For now this will be use to detect installer versions that support Spot
instance features. The command is marked as `hidden` since it's
supposed to be used only internally.
@r4f4 r4f4 force-pushed the aws-terraform-spot-master branch from 377bc14 to 8ec547f Compare May 23, 2024 19:50
@r4f4
Copy link
Contributor Author

r4f4 commented May 23, 2024

/hold cancel
The way we're implementing this, openshift/release#51664 and this PR can merge in any order.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 23, 2024
@r4f4
Copy link
Contributor Author

r4f4 commented May 23, 2024

/override ci/prow/okd-images
Known issue

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 23, 2024

@r4f4: Overrode contexts on behalf of r4f4: ci/prow/okd-images

Details

In response to this:

/override ci/prow/okd-images
Known issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 23, 2024

@r4f4: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-e2e-aws-ovn-upgrade 8ec547f link false /test okd-e2e-aws-ovn-upgrade
ci/prow/e2e-aws-custom-security-groups 8ec547f link false /test e2e-aws-custom-security-groups

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@2uasimojo
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 28, 2024
@2uasimojo
Copy link
Member

/lgtm cancel

Discussion still open around the hidden subcommand.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 28, 2024
@patrickdillon
Copy link
Contributor

/lgtm cancel

Discussion still open around the hidden subcommand.

Yes, I am hesitant to add a new command to the installer for such a specific use. And I'm not certain that the problem requires a new command. I think it would help to have more context for the motivation. Reading through the slack thread I identified the motivation as:

My goal was to put in a check so that, if you (accidentally or out of ignorance) tried to do this (tf + spot masters) on a .z without your PR, it would puke.

I am not sure if this is intended to be used in CI or for QE running manual tests or both. I would imagine in both of these cases we're pretty much only testing the latest z streams anyway, so my impression is that we're solving a transient problem with code that will live on much longer.

Can we layout in more detail the problem we're tackling with the command?

@2uasimojo
Copy link
Member

Yes, I am hesitant to add a new command to the installer for such a specific use.

Ack, and likewise.

And I'm not certain that the problem requires a new command.

I would love for us to come up with some other solution.

Can we layout in more detail the problem we're tackling with the command?

If I pass an install-config with controlPlane spotMarketOptions into an installer with either CAPI featuregate flaggage or the terraform changes from this PR, it is honored.
If I pass an install-config with controlPlane spotMarketOptions into any other installer currently in existence, it is silently ignored.
Unchecked, this can easily result in a scenario where I think I'm saving money by enabling spot masters, but I'm actually not.

I am not sure if this is intended to be used in CI or for QE running manual tests or both.

Yes, the primary use case is for CI jobs, where we have different personas responsible for the install-config vs the job config. It is via env vars to the latter that spot instance config (for masters and/or workers) is injected by tooling into the former. It is for this use case that we need protection, again because if I provide that env var but don't actually get spot masters:

  • I haven't saved $$ as I thought I should.
  • If something goes wrong, things could get confusing as I triage on the assumption that I'm using spot masters.

I imagine QE would find this similarly useful if they have scripting around generating install-configs. Not so much if they're doing everything by hand.

I would imagine in both of these cases we're pretty much only testing the latest z streams anyway, so my impression is that we're solving a transient problem with code that will live on much longer.

This is a sane point. If we intend to backport this change to all living y-streams, and CI/QE exclusively tests only the latest (nightlies and) z-streams thereon, I agree the window of usefulness is smallish. Are those assumptions valid? In particular, do we really not run CI on old-but-still-supported z-streams?

@r4f4
Copy link
Contributor Author

r4f4 commented May 29, 2024

/lgtm cancel

Discussion still open around the hidden subcommand.

Yes, I am hesitant to add a new command to the installer for such a specific use. And I'm not certain that the problem requires a new command. I think it would help to have more context for the motivation. Reading through the slack thread I identified the motivation as:

My goal was to put in a check so that, if you (accidentally or out of ignorance) tried to do this (tf + spot masters) on a .z without your PR, it would puke.

I am not sure if this is intended to be used in CI or for QE running manual tests or both. I would imagine in both of these cases we're pretty much only testing the latest z streams anyway, so my impression is that we're solving a transient problem with code that will live on much longer.

Can we layout in more detail the problem we're tackling with the command?

That's not totally true because we have to modify the IPI steps which are shared among all ocp versions. So we need a why to identify which versions of the installer support the spot instances feature (the other option is that older installers will silently ignore the spot feature, even though the master manifests were changed. That's what we were trying to avoid).

@patrickdillon
Copy link
Contributor

@r4f4 @2uasimojo what about solving these problems with tests? In ci run a post-install test that checks the envvar and if set ensures spot instances are running. Share this tactic with qe

@patrickdillon
Copy link
Contributor

patrickdillon commented May 29, 2024

@r4f4 @2uasimojo what about solving these problems with tests? In ci run a post-install test that checks the envvar and if set ensures spot instances are running. Share this tactic with qe

sorry was in a rush to share this idea but I realize I did it in an entirely unconvincing and simple way. i'm not sure how large the surface area in ci is for failing this test, but if it is large, the test would need to be optional (not sure if this is possible) to be informative (this would give us an idea of how to expand coverage), if it is massive perhaps this idea would not work. this is perhaps less of a silent fail than the proposed solution is?

@r4f4
Copy link
Contributor Author

r4f4 commented May 30, 2024

@r4f4 @2uasimojo what about solving these problems with tests? In ci run a post-install test that checks the envvar and if set ensures spot instances are running. Share this tactic with qe

sorry was in a rush to share this idea but I realize I did it in an entirely unconvincing and simple way. i'm not sure how large the surface area in ci is for failing this test, but if it is large, the test would need to be optional (not sure if this is possible) to be informative (this would give us an idea of how to expand coverage), if it is massive perhaps this idea would not work. this is perhaps less of a silent fail than the proposed solution is?

Remember that backports in the installer take way longer than we wish they would. So I think failing tests is out of question because we'd fail too many. Besides, a mechanism for detecting available features would be useful for other ci-only functionality like public-only subnets and BYOIPv4 (which is active for certain regions but we are silently ignoring until we add support in the capi path).

It doesn't need to be a hidden command but I think it'll be very useful to have something.

@2uasimojo
Copy link
Member

solving these problems with tests? In ci run a post-install test that checks the envvar and if set ensures spot instances are running.

Interesting idea. Digging in...

IIUC this would look like:
Include a post-install step in the IPI flow where we're adding support for spot masters.
The post-install step would, if SPOT_MASTERS is set, inspect the masters on the created cluster and make sure they're spots, failing the setup if they're not.
To be useful, this step would need to always be part of the ipi setup flow.
It would thus affect, what, hundreds? of CI jobs.
...which, if misconfigured, would be creating entire clusters -- with expensive on-demand instances -- and quickly destroying them (without running the meat of whatever broader test this is part of) because they're misconfigured.
And these test would take ~40m to fail, vs "immediately".
I suppose we would catch this misconfig for a given job at pj-rehearse time... assuming we ran a pj-rehearse for every job we try to switch this on for, which doesn't sound like it would be reasonable for the scale we're talking about.
Also, anecdotally, the rate at which job failures are ignored in OSCI... I'm not even sure we would catch or react to this with any efficiency.

In summary: IMHO making the failure fast and cheap is moar better than making it slow and expensive.

@patrickdillon
Copy link
Contributor

patrickdillon commented May 31, 2024

solving these problems with tests? In ci run a post-install test that checks the envvar and if set ensures spot instances are running.

Interesting idea. Digging in...

IIUC this would look like: Include a post-install step in the IPI flow where we're adding support for spot masters.

write/run a test--not add a step. I haven't written such a test before so I don't know how it's done offhand (would need to nail this down to be serious). the test would fail if the job config sets the envvar for spot instances but the resulting cluster isn't running (on) them. Ideally would be able to use something like ci search or sippy to identify failing jobs.

But I agree that it is a much slower feedback loop. This discussion helped me understand how we're going to use this to roll out the spot instances.

/approve

@2uasimojo feel free to re-lgtm we are probably going to need to backport this pretty far I assume

@patrickdillon
Copy link
Contributor

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 31, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: patrickdillon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 31, 2024
@2uasimojo
Copy link
Member

Awesome, let’s do it

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 31, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit e7a8199 into openshift:master May 31, 2024
2uasimojo added a commit to 2uasimojo/release that referenced this pull request May 31, 2024
Add a new variable for the AWS IPI flows, `$SPOT_MASTERS`. When using
CAPI installs (`featureGates[].ClusterAPIInstall=true`) *or* OCP
versions containing openshift/installer#8349
this can be set to `'true'` to inject `spotMarketOptions` into master
machine manifests.

The existing `$SPOT_INSTANCES` variable is unchanged: as before, it only
results in *worker* nodes using spot instances. (We may at some point
wish to rename this to `$SPOT_WORKERS` for clarity.)

NOTE: Spot instances are unreliable. Using them may cause additional
flakes in your tests.
@2uasimojo
Copy link
Member

/cherry-pick release-4.15 release-4.14 release-4.13 release-4.12

@openshift-cherrypick-robot

@2uasimojo: #8349 failed to apply on top of branch "release-4.15":

Applying: aws: terraform: add spot instance support for masters
Using index info to reconstruct a base tree...
M	data/data/aws/bootstrap/main.tf
M	data/data/aws/cluster/main.tf
M	data/data/aws/variables-aws.tf
Falling back to patching base and 3-way merge...
Auto-merging data/data/aws/variables-aws.tf
CONFLICT (content): Merge conflict in data/data/aws/variables-aws.tf
Auto-merging data/data/aws/cluster/main.tf
Auto-merging data/data/aws/bootstrap/main.tf
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 aws: terraform: add spot instance support for masters
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherry-pick release-4.15 release-4.14 release-4.13 release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-installer-altinfra-container-v4.17.0-202405312141.p0.ge7a8199.assembly.stream.el9 for distgit ose-installer-altinfra.
All builds following this will include this PR.

2uasimojo added a commit to 2uasimojo/release that referenced this pull request Jun 3, 2024
Add a new variable for the AWS IPI flows, `$SPOT_MASTERS`. When using
CAPI installs (installer will generate the `cluster-api` directory) *or*
OCP versions containing openshift/installer#8349
this can be set to `'true'` to inject `spotMarketOptions` into master
machine manifests.

The existing `$SPOT_INSTANCES` variable is unchanged: as before, it only
results in *worker* nodes using spot instances. (We may at some point
wish to rename this to `$SPOT_WORKERS` for clarity.)

NOTE: Spot instances are unreliable. Using them may cause additional
flakes in your tests.
@r4f4
Copy link
Contributor Author

r4f4 commented Jun 4, 2024

/cherry-pick release-4.16

@openshift-cherrypick-robot

@r4f4: new pull request created: #8526

Details

In response to this:

/cherry-pick release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@r4f4
Copy link
Contributor Author

r4f4 commented Jun 5, 2024

/retitle OCPBUGS-34975: aws: terraform: add spot instance support for masters

@openshift-ci openshift-ci bot changed the title CORS-3524, CORS-3523: aws: terraform: add spot instance support for masters OCPBUGS-34975: aws: terraform: add spot instance support for masters Jun 5, 2024
@openshift-ci-robot
Copy link
Contributor

@r4f4: Jira Issue OCPBUGS-34975: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-34975 has been moved to the MODIFIED state.

Details

In response to this:

Spot instances can result on savings for short-lived clusters. If the control plane machine manifests have been edited with spot instance information, enable the use of spot instances in the terraform config. Notice that max price information is ignored, since it's not advised for it to be set.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-merge-bot bot pushed a commit to openshift/release that referenced this pull request Jul 22, 2024
* Enable spot instances for AWS masters

Add a new variable for the AWS IPI flows, `$SPOT_MASTERS`. When using
CAPI installs (installer will generate the `cluster-api` directory) *or*
OCP versions containing openshift/installer#8349
this can be set to `'true'` to inject `spotMarketOptions` into master
machine manifests.

The existing `$SPOT_INSTANCES` variable is unchanged: as before, it only
results in *worker* nodes using spot instances. (We may at some point
wish to rename this to `$SPOT_WORKERS` for clarity.)

NOTE: Spot instances are unreliable. Using them may cause additional
flakes in your tests.

* installer / master, 4.17, 4.16: AWS spot instances

Convert all AWS IPI-based presubmits in the openshift-installer legacy
(terraform) and altinfra (CAPI) test suites to use spot instances for
branches:
- master
- release-4.18
- release-4.17
- release-4.16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants