Skip to content

Comments

Enable spot instances for AWS masters#51664

Closed
2uasimojo wants to merge 1 commit intoopenshift:masterfrom
2uasimojo:spot-masters-capi
Closed

Enable spot instances for AWS masters#51664
2uasimojo wants to merge 1 commit intoopenshift:masterfrom
2uasimojo:spot-masters-capi

Conversation

@2uasimojo
Copy link
Member

@2uasimojo 2uasimojo commented May 3, 2024

Add a new variable for the AWS IPI flows, $SPOT_MASTERS. When using CAPI installs (featureGates[].ClusterAPIInstall=true) this can be set to 'true' to inject spotMarketOptions into master machine manifests.

The existing $SPOT_INSTANCES variable is unchanged: as before, it only results in worker nodes using spot instances. (We may at some point wish to rename this to $SPOT_WORKERS for clarity.)

NOTE: Spot instances are unreliable. Using them may cause additional flakes in your tests.

Needed by RFE-5545

@openshift-ci openshift-ci bot requested review from deepsm007 and xueqzhan May 3, 2024 23:19
@2uasimojo
Copy link
Member Author

/hold for testing
/assign @mtulio
/cc @elmiko @JoelSpeed @tangyisheng2 @r4f4

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 3, 2024
@2uasimojo 2uasimojo force-pushed the spot-masters-capi branch from ccc7d19 to 89f6ca9 Compare May 3, 2024 23:24
@2uasimojo
Copy link
Member Author

/pj-rehearse

Sure, let's try five at random.

@openshift-ci-robot
Copy link
Contributor

@2uasimojo: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci-robot
Copy link
Contributor

@2uasimojo, pj-rehearse: failed to create rehearsal jobs ERROR:

failed to ensure imagestreamtags in cluster build05: failed waiting for imagestreamtag logging/5.2:logging-elasticsearch6 to appear: get failed: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline

If the problem persists, please contact Test Platform.

Copy link
Contributor

@mtulio mtulio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Eric, overall looks good. I have some questions about Machine and CPMS manifest changes.

Are you planning to create a new job using installer altinfra(CAPI) setting the var SPOT_MASTERS?

Let's trigger an existing job to validate the spot on workers:

/pj-rehearse pull-ci-openshift-machine-api-provider-aws-release-4.16-e2e-aws

echo "Spot instances for masters can only be used with CAPI installs."
exit 1
fi
manifests="${dir}/cluster-api/machines/10_inframachine_*.yaml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to glob the three machine and CPMS manifests too.

What about creating one list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added CPMS for the reason @r4f4 points out below; but leaving out the Machine manifests for the reason I point out below :)

@r4f4
Copy link
Contributor

r4f4 commented May 4, 2024

Triggering a capi job:
/pj-rehearse pull-ci-openshift-installer-master-altinfra-e2e-aws-ovn

@openshift-ci-robot
Copy link
Contributor

@r4f4: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

documentation: "Use AWS Spot Instances for worker nodes. Set to 'true' to opt into spot instances. Explicitly set to 'false' to opt out. Leave unset for the default, which may change."
documentation: "Use AWS Spot Instances for *worker* nodes. Set to 'true' to opt into spot instances. Explicitly set to 'false' to opt out. Leave unset for the default, which may change."
- name: SPOT_MASTERS
default: "false"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about

Suggested change
default: "false"
default: "true"

plus a /hold so we can rehearse this change for a while with capi/aws jobs and fix any issues? Then we can flip it back to false before merge and add the SPOT_MASTERS=true to the capi jobs definitions.

@mtulio
Copy link
Contributor

mtulio commented May 4, 2024

/pj-rehearse pull-ci-openshift-machine-api-provider-aws-release-4.16-e2e-aws

@openshift-ci-robot
Copy link
Contributor

@mtulio: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@2uasimojo
Copy link
Member Author

Not sure why this is happening (haven't dug in yet):

unmarshal master 0: json: cannot unmarshal string into Go struct field AWSMachineProviderConfig.spotMarketOptions of type v1beta1.SpotMarketOptions

@r4f4
Copy link
Contributor

r4f4 commented May 6, 2024

Not sure why this is happening (haven't dug in yet):

unmarshal master 0: json: cannot unmarshal string into Go struct field AWSMachineProviderConfig.spotMarketOptions of type v1beta1.SpotMarketOptions

@2uasimojo Oh, I know this one because I made the same mistake. Here https://github.com/openshift/release/pull/51664/files#diff-65cce809dc5ddaa91711f6709acd41e5c24b735c073708e60f10aae28faf8712R511 you're saving the spotMarketOptions as a string '{}' but it's supposed to be a json object {}.

@2uasimojo
Copy link
Member Author

Not sure why this is happening (haven't dug in yet):

unmarshal master 0: json: cannot unmarshal string into Go struct field AWSMachineProviderConfig.spotMarketOptions of type v1beta1.SpotMarketOptions

@2uasimojo Oh, I know this one because I made the same mistake. Here https://github.com/openshift/release/pull/51664/files#diff-65cce809dc5ddaa91711f6709acd41e5c24b735c073708e60f10aae28faf8712R511 you're saving the spotMarketOptions as a string '{}' but it's supposed to be a json object {}.

Cool, thanks. I suspected that might be it, but I don't understand how yq can tell the difference by the time shell parsing is done.

[efried@efried Downloads]$ echo '{}' | od -x
0000000 7d7b 000a
0000003
[efried@efried Downloads]$ echo "{}" | od -x
0000000 7d7b 000a
0000003
[efried@efried Downloads]$ echo {} | od -x
0000000 7d7b 000a
0000003

🤷

@2uasimojo 2uasimojo force-pushed the spot-masters-capi branch from 89f6ca9 to c707fdb Compare May 7, 2024 17:12
@2uasimojo
Copy link
Member Author

Not sure why this is happening (haven't dug in yet):

unmarshal master 0: json: cannot unmarshal string into Go struct field AWSMachineProviderConfig.spotMarketOptions of type v1beta1.SpotMarketOptions

@2uasimojo Oh, I know this one because I made the same mistake. Here https://github.com/openshift/release/pull/51664/files#diff-65cce809dc5ddaa91711f6709acd41e5c24b735c073708e60f10aae28faf8712R511 you're saving the spotMarketOptions as a string '{}' but it's supposed to be a json object {}.

Cool, thanks. I suspected that might be it, but I don't understand how yq can tell the difference by the time shell parsing is done.

[efried@efried Downloads]$ echo '{}' | od -x
0000000 7d7b 000a
0000003
[efried@efried Downloads]$ echo "{}" | od -x
0000000 7d7b 000a
0000003
[efried@efried Downloads]$ echo {} | od -x
0000000 7d7b 000a
0000003

🤷

Well, it turns out it is difficult or impossible to add an empty dict using yq v3. I tried --tag '!!map' but that just added the !!map tag to the string value, which may or may not be parseable by installer. So I cheated and used !!str (which does not inject the actual tag) and blew out "spotMarketOptions": {"maxPrice": ""} which is hopefully equivalent.

@2uasimojo
Copy link
Member Author

/pj-rehearse pull-ci-openshift-machine-api-provider-aws-release-4.16-e2e-aws

@openshift-ci-robot
Copy link
Contributor

@2uasimojo: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@2uasimojo 2uasimojo force-pushed the spot-masters-capi branch from c707fdb to 37f1da9 Compare May 7, 2024 17:54
@r4f4
Copy link
Contributor

r4f4 commented May 7, 2024

Let's try a capi job:
/pj-rehearse pull-ci-openshift-installer-master-altinfra-e2e-aws-ovn

@2uasimojo 2uasimojo force-pushed the spot-masters-capi branch from fb51d21 to f594928 Compare June 3, 2024 21:59
@openshift-ci-robot openshift-ci-robot removed the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Jun 3, 2024
@openshift-ci-robot
Copy link
Contributor

[REHEARSALNOTIFIER]
@2uasimojo: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-netobserv-network-observability-operator-release-4.18-e2e-operator netobserv/network-observability-operator presubmit Registry content changed
pull-ci-netobserv-network-observability-operator-release-4.17-e2e-operator netobserv/network-observability-operator presubmit Registry content changed
pull-ci-netobserv-network-observability-operator-release-4.16-e2e-operator netobserv/network-observability-operator presubmit Registry content changed
pull-ci-openshift-hive-master-e2e-azure openshift/hive presubmit Registry content changed
pull-ci-openshift-hive-mce-2.5-e2e-azure openshift/hive presubmit Registry content changed
pull-ci-openshift-hive-mce-2.6-e2e-azure openshift/hive presubmit Registry content changed
pull-ci-openshift-hive-mce-2.4-e2e-azure openshift/hive presubmit Registry content changed
pull-ci-openshift-hive-mce-2.2-e2e-azure openshift/hive presubmit Registry content changed
pull-ci-openshift-hive-mce-2.3-e2e-azure openshift/hive presubmit Registry content changed
pull-ci-openshift-hive-master-e2e-gcp openshift/hive presubmit Registry content changed
pull-ci-openshift-hive-mce-2.5-e2e-gcp openshift/hive presubmit Registry content changed
pull-ci-openshift-hive-mce-2.6-e2e-gcp openshift/hive presubmit Registry content changed
pull-ci-openshift-hive-mce-2.4-e2e-gcp openshift/hive presubmit Registry content changed
pull-ci-openshift-hive-mce-2.2-e2e-gcp openshift/hive presubmit Registry content changed
pull-ci-openshift-hive-mce-2.3-e2e-gcp openshift/hive presubmit Registry content changed
pull-ci-openshift-knative-serving-release-v1.12-412-perf-tests-aws-412 openshift-knative/serving presubmit Registry content changed
pull-ci-openshift-knative-serving-release-v1.12-412-test-e2e-aws-412 openshift-knative/serving presubmit Registry content changed
pull-ci-openshift-knative-serving-release-next-416-test-e2e-aws-416 openshift-knative/serving presubmit Registry content changed
pull-ci-openshift-knative-serving-release-v1.11-416-perf-tests-aws-416 openshift-knative/serving presubmit Registry content changed
pull-ci-openshift-knative-serving-release-v1.11-416-test-e2e-aws-416 openshift-knative/serving presubmit Registry content changed
pull-ci-openshift-knative-serving-release-v1.11-416-test-e2e-tls-aws-416 openshift-knative/serving presubmit Registry content changed
pull-ci-openshift-knative-serving-release-next-412-test-e2e-aws-412 openshift-knative/serving presubmit Registry content changed
pull-ci-openshift-knative-serving-release-v1.14-412-perf-tests-aws-412 openshift-knative/serving presubmit Registry content changed
pull-ci-openshift-knative-serving-release-v1.14-412-test-e2e-aws-412 openshift-knative/serving presubmit Registry content changed
pull-ci-openshift-knative-serving-release-v1.14-412-test-e2e-tls-aws-412 openshift-knative/serving presubmit Registry content changed

A total of 15375 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs.

A full list of affected jobs can be found here

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse abort to abort all active rehearsals

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@2uasimojo
Copy link
Member Author

/pj-rehearse ack

Rehearsals done via #51721

/hold cancel

@openshift-ci-robot
Copy link
Contributor

@2uasimojo: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 4, 2024
@openshift-ci-robot openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Jun 4, 2024
@2uasimojo
Copy link
Member Author

/assign @patrickdillon

@openshift-bot
Copy link
Contributor

Issues in openshift/release go stale after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 15d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 4, 2024
@openshift-bot
Copy link
Contributor

Stale issue in openshift/release rot after 15d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 15d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 19, 2024
@r4f4
Copy link
Contributor

r4f4 commented Jul 19, 2024

/remove-lifecycle rotten

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 19, 2024
Copy link
Contributor

@r4f4 r4f4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 19, 2024
@patrickdillon
Copy link
Contributor

/approve

/hold
for https://github.com/openshift/release/pull/51664/files#r1589939386

Feel free to remove this hold as you see fit, but not sure if you want to do what is discussed in the comment.

@openshift-ci openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 22, 2024
@mtulio
Copy link
Contributor

mtulio commented Jul 22, 2024

/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 22, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 2uasimojo, mtulio, patrickdillon, r4f4

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@r4f4
Copy link
Contributor

r4f4 commented Jul 25, 2024

This work has merged as part of #51721

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 25, 2024
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@r4f4
Copy link
Contributor

r4f4 commented Jul 25, 2024

/close

@openshift-ci openshift-ci bot closed this Jul 25, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 25, 2024

@r4f4: Closed this PR.

Details

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. rehearsals-ack Signifies that rehearsal jobs have been acknowledged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants