OCPBUGS-34975: aws: terraform: add spot instance support for masters#8349
Conversation
|
Depends on openshift/release#51664 for testing spot instances deployments. /hold |
00d3baa to
08abdde
Compare
|
Implements (part of) RFE-5545 |
|
/retitle CORS-3523: aws: terraform: add spot instance support for masters |
|
@r4f4: This pull request references CORS-3523 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retitle CORS-3524, CORS-3523: aws: terraform: add spot instance support for masters |
|
/lgtm |
Spot instances can result on savings for short-lived clusters.
If the control plane machine manifests have been edited with spot instance information, enable the use of spot instances in the terraform config. Notice that max price information is ignored, since it's not advised for it to be set.
3d2f7f9 to
377bc14
Compare
cmd/openshift-install/features.go
Outdated
There was a problem hiding this comment.
I suppose grep works fine here, and perhaps we don't need to over engineer but would it be more convenient to structure this output, say in json or something?
There was a problem hiding this comment.
@patrickdillon what about a --json argument for that?
There was a problem hiding this comment.
Cool; though the current use case still requires:
- parsing the output with something (be it
greporjqor[[ "$(...)" == *'"terraform-spot-masters"'* ]]) - accounting for what happens when invoking this on an older installer binary (stdout is empty, RC is nonzero)
...so YAGNI for now 🤷
The only thing I can think of that would make it easier to consume would be something like openshift-install is-hidden-feature-supported terraform-spot-masters which exits zero if yes, nonzero if no. That way it can reliably be consumed as:
if ! openshift-install is-hidden-feature-supported terraform-spot-masters >/dev/null 2>&1; then
...
fi...because I'll get a nonzero RC on older installers as well as newer ones that have this code but disclaim the support.
There was a problem hiding this comment.
to be clear: I'm not asking for ⬆️, just thinking aloud :)
For now this will be use to detect installer versions that support Spot instance features. The command is marked as `hidden` since it's supposed to be used only internally.
377bc14 to
8ec547f
Compare
|
/hold cancel |
|
/override ci/prow/okd-images |
|
@r4f4: Overrode contexts on behalf of r4f4: ci/prow/okd-images DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@r4f4: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/lgtm |
|
/lgtm cancel Discussion still open around the hidden subcommand. |
Yes, I am hesitant to add a new command to the installer for such a specific use. And I'm not certain that the problem requires a new command. I think it would help to have more context for the motivation. Reading through the slack thread I identified the motivation as:
I am not sure if this is intended to be used in CI or for QE running manual tests or both. I would imagine in both of these cases we're pretty much only testing the latest z streams anyway, so my impression is that we're solving a transient problem with code that will live on much longer. Can we layout in more detail the problem we're tackling with the command? |
Ack, and likewise.
I would love for us to come up with some other solution.
If I pass an install-config with controlPlane spotMarketOptions into an installer with either CAPI featuregate flaggage or the terraform changes from this PR, it is honored.
Yes, the primary use case is for CI jobs, where we have different personas responsible for the install-config vs the job config. It is via env vars to the latter that spot instance config (for masters and/or workers) is injected by tooling into the former. It is for this use case that we need protection, again because if I provide that env var but don't actually get spot masters:
I imagine QE would find this similarly useful if they have scripting around generating install-configs. Not so much if they're doing everything by hand.
This is a sane point. If we intend to backport this change to all living y-streams, and CI/QE exclusively tests only the latest (nightlies and) z-streams thereon, I agree the window of usefulness is smallish. Are those assumptions valid? In particular, do we really not run CI on old-but-still-supported z-streams? |
That's not totally true because we have to modify the IPI steps which are shared among all ocp versions. So we need a why to identify which versions of the installer support the spot instances feature (the other option is that older installers will silently ignore the spot feature, even though the master manifests were changed. That's what we were trying to avoid). |
|
@r4f4 @2uasimojo what about solving these problems with tests? In ci run a post-install test that checks the envvar and if set ensures spot instances are running. Share this tactic with qe |
sorry was in a rush to share this idea but I realize I did it in an entirely unconvincing and simple way. i'm not sure how large the surface area in ci is for failing this test, but if it is large, the test would need to be optional (not sure if this is possible) to be informative (this would give us an idea of how to expand coverage), if it is massive perhaps this idea would not work. this is perhaps less of a silent fail than the proposed solution is? |
Remember that backports in the installer take way longer than we wish they would. So I think failing tests is out of question because we'd fail too many. Besides, a mechanism for detecting available features would be useful for other ci-only functionality like public-only subnets and BYOIPv4 (which is active for certain regions but we are silently ignoring until we add support in the capi path). It doesn't need to be a hidden command but I think it'll be very useful to have something. |
Interesting idea. Digging in... IIUC this would look like: In summary: IMHO making the failure fast and cheap is moar better than making it slow and expensive. |
write/run a test--not add a step. I haven't written such a test before so I don't know how it's done offhand (would need to nail this down to be serious). the test would fail if the job config sets the envvar for spot instances but the resulting cluster isn't running (on) them. Ideally would be able to use something like ci search or sippy to identify failing jobs. But I agree that it is a much slower feedback loop. This discussion helped me understand how we're going to use this to roll out the spot instances. /approve @2uasimojo feel free to re-lgtm we are probably going to need to backport this pretty far I assume |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: patrickdillon The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Awesome, let’s do it /lgtm |
Add a new variable for the AWS IPI flows, `$SPOT_MASTERS`. When using CAPI installs (`featureGates[].ClusterAPIInstall=true`) *or* OCP versions containing openshift/installer#8349 this can be set to `'true'` to inject `spotMarketOptions` into master machine manifests. The existing `$SPOT_INSTANCES` variable is unchanged: as before, it only results in *worker* nodes using spot instances. (We may at some point wish to rename this to `$SPOT_WORKERS` for clarity.) NOTE: Spot instances are unreliable. Using them may cause additional flakes in your tests.
|
/cherry-pick release-4.15 release-4.14 release-4.13 release-4.12 |
|
@2uasimojo: #8349 failed to apply on top of branch "release-4.15": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[ART PR BUILD NOTIFIER] This PR has been included in build ose-installer-altinfra-container-v4.17.0-202405312141.p0.ge7a8199.assembly.stream.el9 for distgit ose-installer-altinfra. |
Add a new variable for the AWS IPI flows, `$SPOT_MASTERS`. When using CAPI installs (installer will generate the `cluster-api` directory) *or* OCP versions containing openshift/installer#8349 this can be set to `'true'` to inject `spotMarketOptions` into master machine manifests. The existing `$SPOT_INSTANCES` variable is unchanged: as before, it only results in *worker* nodes using spot instances. (We may at some point wish to rename this to `$SPOT_WORKERS` for clarity.) NOTE: Spot instances are unreliable. Using them may cause additional flakes in your tests.
|
/cherry-pick release-4.16 |
|
@r4f4: new pull request created: #8526 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retitle OCPBUGS-34975: aws: terraform: add spot instance support for masters |
|
@r4f4: Jira Issue OCPBUGS-34975: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-34975 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
* Enable spot instances for AWS masters Add a new variable for the AWS IPI flows, `$SPOT_MASTERS`. When using CAPI installs (installer will generate the `cluster-api` directory) *or* OCP versions containing openshift/installer#8349 this can be set to `'true'` to inject `spotMarketOptions` into master machine manifests. The existing `$SPOT_INSTANCES` variable is unchanged: as before, it only results in *worker* nodes using spot instances. (We may at some point wish to rename this to `$SPOT_WORKERS` for clarity.) NOTE: Spot instances are unreliable. Using them may cause additional flakes in your tests. * installer / master, 4.17, 4.16: AWS spot instances Convert all AWS IPI-based presubmits in the openshift-installer legacy (terraform) and altinfra (CAPI) test suites to use spot instances for branches: - master - release-4.18 - release-4.17 - release-4.16
Spot instances can result on savings for short-lived clusters. If the control plane machine manifests have been edited with spot instance information, enable the use of spot instances in the terraform config. Notice that max price information is ignored, since it's not advised for it to be set.