Skip to content

Conversation

@cgwalters
Copy link
Member

We had an event when we were starting an OS update, but nothing
when it was completed - one could implicitly get that by looking
at the next event, but that's a bit fragile.

And since then we started doing a lot more stuff with the OS,
so let's add an event emitted before and after all OS changes
so we can consistently get e.g. timing information about it.

Relates to #1962
around getting better data about timing during upgrades.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 3, 2020
Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of questions :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question, do we have an events somewhere that has what we are upgrading to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following up on.. myself: in https://github.com/openshift/machine-config-operator/pull/1962/files we have an event that tells us the targeted config but not the OS version - might be nice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can determine the content by looking at the rendered machineconfigs. With the events I'm more interested in timing, not content.

@kikisdeliveryservice
Copy link
Contributor

adding hold since we are post-FF

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 3, 2020
@cgwalters
Copy link
Member Author

adding hold since we are post-FF

This is part of https://bugzilla.redhat.com/show_bug.cgi?id=1852047 - not a new feature. Let's not go overboard with the holds - I think cleanly split up PRs is better than the alternative that holding would force of having a single gigantic PR with the one BZ.

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Aug 3, 2020

This is part of https://bugzilla.redhat.com/show_bug.cgi?id=1852047 - not a new feature. Let's not go overboard with the holds - I
think cleanly split up PRs is better than the alternative that holding would force of having a single gigantic PR with the one BZ.

I had no idea since the BZ isn't in the title. Please add the related BZ to the PRs so we are allowed to merge post-FF.

@kikisdeliveryservice kikisdeliveryservice changed the title daemon: Add events before/after all OS changes Bug 1852047 : daemon: Add events before/after all OS changes Aug 3, 2020
@kikisdeliveryservice kikisdeliveryservice changed the title Bug 1852047 : daemon: Add events before/after all OS changes Bug 1852047: daemon: Add events before/after all OS changes Aug 3, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Aug 3, 2020
@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Bugzilla bug 1852047, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

Bug 1852047: daemon: Add events before/after all OS changes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kikisdeliveryservice
Copy link
Contributor

/bugzilla refresh

@openshift-ci-robot
Copy link
Contributor

@kikisdeliveryservice: This pull request references Bugzilla bug 1852047, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kikisdeliveryservice
Copy link
Contributor

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 3, 2020
@cgwalters
Copy link
Member Author

Is Prow actually enforcing BZs right now? It doesn't look like it - if we're truly requiring a separate BZ per PR that heavily penalizes the workflow of having nice clean split up PRs, so hopefully we can just roll with attaching a bunch of PRs to the same BZ.

@kikisdeliveryservice
Copy link
Contributor

Is Prow actually enforcing BZs right now? It doesn't look like it - if we're truly requiring a separate BZ per PR that heavily penalizes the workflow of having nice clean split up PRs, so hopefully we can just roll with attaching a bunch of PRs to the same BZ

I have no problem with multiple PRs on a BZ, the issue was that this PR had no BZ linked whatsoever.

@kikisdeliveryservice
Copy link
Contributor

/test e2e-aws

Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this verifiable in the mustgather artifacts? (if so can you link?)

@cgwalters
Copy link
Member Author

Is this verifiable in the mustgather artifacts? (if so can you link?)

Definitely, I put up my jq query from this comment over here: https://github.com/cgwalters/homegit/blob/master/bin/oc-prow-mco

Example invocation on the e2e-gcp-upgrade run from this PR:

walters@toolbox /v/s/w/s/g/o/ostree (fix-ci)> oc-prow-mco https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1977/pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade/1290312062802595840/artifacts/e2e-gcp-upgrade/
+ '[' -z https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1977/pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade/1290312062802595840/artifacts/e2e-gcp-upgrade/ ']'
+ baseurl=https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1977/pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade/1290312062802595840/artifacts/e2e-gcp-upgrade/
++ mktemp -d -t oc-prow-mco.XXXXXX
+ tmpdir=/tmp/oc-prow-mco.x0B1dP
+ cd /tmp/oc-prow-mco.x0B1dP
+ curl --show-error --fail -LO https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1977/pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade/1290312062802595840/artifacts/e2e-gcp-upgrade//events.json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   229  100   229    0     0   2573      0 --:--:-- --:--:-- --:--:--  2573
100   256  100   256    0     0   1651      0 --:--:-- --:--:-- --:--:--  1651
100 26.9M  100 26.9M    0     0  41.3M      0 --:--:-- --:--:-- --:--:-- 41.3M
+ jq '.items | map(select(.source.component == "machineconfigcontroller-nodecontroller" or .source.component == "machineconfigdaemon")) | sort_by(.firstTimestamp | fromdate) | map(.firstTimestamp + " " + .reason + " " + .involvedObject.kind + " " + .involvedObject.name + ": " + .message)'
[
  "2020-08-03T15:57:18Z NodeDone Node ci-op-ng66kb8m-28de9-sf9pl-master-0: Setting node ci-op-ng66kb8m-28de9-sf9pl-master-0, currentConfig rendered-master-732609a07c1cb806f1b9d3542fe9b35c to Done",
  "2020-08-03T15:57:18Z NodeDone Node ci-op-ng66kb8m-28de9-sf9pl-master-2: Setting node ci-op-ng66kb8m-28de9-sf9pl-master-2, currentConfig rendered-master-732609a07c1cb806f1b9d3542fe9b35c to Done",
  "2020-08-03T15:57:19Z NodeDone Node ci-op-ng66kb8m-28de9-sf9pl-master-1: Setting node ci-op-ng66kb8m-28de9-sf9pl-master-1, currentConfig rendered-master-732609a07c1cb806f1b9d3542fe9b35c to Done",
  "2020-08-03T16:04:35Z NodeDone Node ci-op-ng66kb8m-28de9-sf9pl-worker-b-4qfbr: Setting node ci-op-ng66kb8m-28de9-sf9pl-worker-b-4qfbr, currentConfig rendered-worker-7534de5b79db421daf59de0d8f57544c to Done",
  "2020-08-03T16:04:41Z NodeDone Node ci-op-ng66kb8m-28de9-sf9pl-worker-d-t4tdr: Setting node ci-op-ng66kb8m-28de9-sf9pl-worker-d-t4tdr, currentConfig rendered-worker-7534de5b79db421daf59de0d8f57544c to Done",
  "2020-08-03T16:04:44Z NodeDone Node ci-op-ng66kb8m-28de9-sf9pl-worker-c-vwq7s: Setting node ci-op-ng66kb8m-28de9-sf9pl-worker-c-vwq7s, currentConfig rendered-worker-7534de5b79db421daf59de0d8f57544c to Done",
  "2020-08-03T16:35:43Z Drain Node ci-op-ng66kb8m-28de9-sf9pl-master-0: Draining node to update config.",
  "2020-08-03T16:35:43Z Drain Node ci-op-ng66kb8m-28de9-sf9pl-worker-d-t4tdr: Draining node to update config.",
  "2020-08-03T16:36:13Z OSUpdateStarted Node ci-op-ng66kb8m-28de9-sf9pl-worker-d-t4tdr: Upgrading OS",
  "2020-08-03T16:36:13Z OSUpdateStaged Node ci-op-ng66kb8m-28de9-sf9pl-worker-d-t4tdr: Changes to OS staged",
  "2020-08-03T16:36:14Z PendingConfig Node ci-op-ng66kb8m-28de9-sf9pl-worker-d-t4tdr: Written pending config rendered-worker-74e1aaab55bda9243933662d3470dbd0",
  "2020-08-03T16:36:14Z Reboot Node ci-op-ng66kb8m-28de9-sf9pl-worker-d-t4tdr: Node will reboot into config rendered-worker-74e1aaab55bda9243933662d3470dbd0",
  "2020-08-03T16:36:39Z OSUpdateStarted Node ci-op-ng66kb8m-28de9-sf9pl-master-0: Upgrading OS",
  "2020-08-03T16:36:39Z OSUpdateStaged Node ci-op-ng66kb8m-28de9-sf9pl-master-0: Changes to OS staged",
  "2020-08-03T16:36:40Z PendingConfig Node ci-op-ng66kb8m-28de9-sf9pl-master-0: Written pending config rendered-master-8a5e07ed1a7b9e16be927f3903832ac2",
  "2020-08-03T16:36:40Z Reboot Node ci-op-ng66kb8m-28de9-sf9pl-master-0: Node will reboot into config rendered-master-8a5e07ed1a7b9e16be927f3903832ac2",
  "2020-08-03T16:37:27Z NodeDone Node ci-op-ng66kb8m-28de9-sf9pl-worker-d-t4tdr: Setting node ci-op-ng66kb8m-28de9-sf9pl-worker-d-t4tdr, currentConfig rendered-worker-74e1aaab55bda9243933662d3470dbd0 to Done",
  "2020-08-03T16:38:02Z Drain Node ci-op-ng66kb8m-28de9-sf9pl-worker-b-4qfbr: Draining node to update config.",
  "2020-08-03T16:38:10Z NodeDone Node ci-op-ng66kb8m-28de9-sf9pl-master-0: Setting node ci-op-ng66kb8m-28de9-sf9pl-master-0, currentConfig rendered-master-8a5e07ed1a7b9e16be927f3903832ac2 to Done",
  "2020-08-03T16:38:16Z Drain Node ci-op-ng66kb8m-28de9-sf9pl-master-1: Draining node to update config.",
  "2020-08-03T16:39:27Z OSUpdateStarted Node ci-op-ng66kb8m-28de9-sf9pl-worker-b-4qfbr: Upgrading OS",
  "2020-08-03T16:39:27Z OSUpdateStaged Node ci-op-ng66kb8m-28de9-sf9pl-worker-b-4qfbr: Changes to OS staged",
  "2020-08-03T16:39:28Z PendingConfig Node ci-op-ng66kb8m-28de9-sf9pl-worker-b-4qfbr: Written pending config rendered-worker-74e1aaab55bda9243933662d3470dbd0",
  "2020-08-03T16:39:28Z Reboot Node ci-op-ng66kb8m-28de9-sf9pl-worker-b-4qfbr: Node will reboot into config rendered-worker-74e1aaab55bda9243933662d3470dbd0",
  "2020-08-03T16:40:38Z NodeDone Node ci-op-ng66kb8m-28de9-sf9pl-worker-b-4qfbr: Setting node ci-op-ng66kb8m-28de9-sf9pl-worker-b-4qfbr, currentConfig rendered-worker-74e1aaab55bda9243933662d3470dbd0 to Done",
  "2020-08-03T16:40:44Z Drain Node ci-op-ng66kb8m-28de9-sf9pl-worker-c-vwq7s: Draining node to update config.",
  "2020-08-03T16:42:25Z OSUpdateStarted Node ci-op-ng66kb8m-28de9-sf9pl-worker-c-vwq7s: Upgrading OS",
  "2020-08-03T16:42:25Z OSUpdateStaged Node ci-op-ng66kb8m-28de9-sf9pl-worker-c-vwq7s: Changes to OS staged",
  "2020-08-03T16:42:25Z PendingConfig Node ci-op-ng66kb8m-28de9-sf9pl-worker-c-vwq7s: Written pending config rendered-worker-74e1aaab55bda9243933662d3470dbd0",
  "2020-08-03T16:42:25Z Reboot Node ci-op-ng66kb8m-28de9-sf9pl-worker-c-vwq7s: Node will reboot into config rendered-worker-74e1aaab55bda9243933662d3470dbd0",
  "2020-08-03T16:43:26Z OSUpdateStarted Node ci-op-ng66kb8m-28de9-sf9pl-master-1: Upgrading OS",
  "2020-08-03T16:43:26Z OSUpdateStaged Node ci-op-ng66kb8m-28de9-sf9pl-master-1: Changes to OS staged",
  "2020-08-03T16:43:26Z PendingConfig Node ci-op-ng66kb8m-28de9-sf9pl-master-1: Written pending config rendered-master-8a5e07ed1a7b9e16be927f3903832ac2",
  "2020-08-03T16:43:26Z Reboot Node ci-op-ng66kb8m-28de9-sf9pl-master-1: Node will reboot into config rendered-master-8a5e07ed1a7b9e16be927f3903832ac2",
  "2020-08-03T16:43:37Z NodeDone Node ci-op-ng66kb8m-28de9-sf9pl-worker-c-vwq7s: Setting node ci-op-ng66kb8m-28de9-sf9pl-worker-c-vwq7s, currentConfig rendered-worker-74e1aaab55bda9243933662d3470dbd0 to Done",
  "2020-08-03T16:44:57Z NodeDone Node ci-op-ng66kb8m-28de9-sf9pl-master-1: Setting node ci-op-ng66kb8m-28de9-sf9pl-master-1, currentConfig rendered-master-8a5e07ed1a7b9e16be927f3903832ac2 to Done",
  "2020-08-03T16:45:03Z Drain Node ci-op-ng66kb8m-28de9-sf9pl-master-2: Draining node to update config.",
  "2020-08-03T16:46:04Z OSUpdateStarted Node ci-op-ng66kb8m-28de9-sf9pl-master-2: Upgrading OS",
  "2020-08-03T16:46:04Z OSUpdateStaged Node ci-op-ng66kb8m-28de9-sf9pl-master-2: Changes to OS staged",
  "2020-08-03T16:46:04Z PendingConfig Node ci-op-ng66kb8m-28de9-sf9pl-master-2: Written pending config rendered-master-8a5e07ed1a7b9e16be927f3903832ac2",
  "2020-08-03T16:46:04Z Reboot Node ci-op-ng66kb8m-28de9-sf9pl-master-2: Node will reboot into config rendered-master-8a5e07ed1a7b9e16be927f3903832ac2",
  "2020-08-03T16:47:29Z NodeDone Node ci-op-ng66kb8m-28de9-sf9pl-master-2: Setting node ci-op-ng66kb8m-28de9-sf9pl-master-2, currentConfig rendered-master-8a5e07ed1a7b9e16be927f3903832ac2 to Done"
]
walters@toolbox /v/s/w/s/g/o/ostree (fix-ci)> 

@cgwalters
Copy link
Member Author

(Though that operates on the "extracted" dump that is handled in Prow, not the must-gather which seems to write the events in YAML, but they actually added a custom HTML renderer for events recently which helps searching exactly these types of things)

@cgwalters
Copy link
Member Author

cgwalters commented Aug 5, 2020

Maybe we should store scripts to extract data from must-gathers and PRs here in the MCO git?

@kikisdeliveryservice
Copy link
Contributor

/retest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only thing is we lost the OSImageURL in the event from line 1315 below. can we re-add? otherwise lgtm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per above though with these events it's mostly about timing, so removing the URL was intentional.

I think this really leads into a larger ergnonomic issue in that we should really be including something like a version number in MachineConfigs, and perhaps even include separate versions for our "base" MCs versus anything user provided.

For example for status we'd report All nodes are updated to config v7 (rendered-<md5here>) or so where we increment that counter each time we generate a new rendered MC.

Similarly we should be showing the ostree version number and not just the sha256.

I can re-add it if you really want, I just personally didn't find it useful - the more interesting information is the MachineConfig names which already contains the osimageurl.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK the next problem with doing what you're asking is machineConfigDiff doesn't have the target...and it leads into the question of why are we special casing osImageURL in the event? Why not also show the kargs that changed? Why not try to show which systemd units? It wouldn't scale - I think these events should just be about timing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or to rephrase, the data in this PR is more useful because we're seeing if there are karg/unit changes too which we didn't have before - if we care to look up exactly what those are, we can look at the rendered config.

Adding osImageURL would make the event much more noisy when we don't need it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair argument, we have other sources to look at what osImageURL is getting used like looking at MCD log or rendered MC. My main motivation of adding osImageURL is that this is the one which we pull in to perform upgrade, extensions and kernel switch.

Copy link
Member Author

@cgwalters cgwalters Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK updated with a compromise - we keep the old event with the osimageurl and only emit it if we're pulling the oscontainer, while adding two new before/after events.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im cool with the compromise :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im ok with the compromise :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apparently im doubly ok with it 😆

@sinnykumari
Copy link
Contributor

Doesn't have strong opinion but slightly agree with Kirsten on retaining OSImageURL information in event. Otherwise, lgtm

We had an event when we were starting an OS update, but nothing
when it was completed - one could implicitly get that by looking
at the next event, but that's a bit fragile.

And since then we started doing a lot more stuff with the OS,
so let's add an event emitted before and after all OS changes
so we can consistently get e.g. timing information about it.

Relates to openshift#1962
around getting better data about timing during upgrades.
@kikisdeliveryservice
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 18, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, kikisdeliveryservice, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [cgwalters,kikisdeliveryservice,sinnykumari]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@kikisdeliveryservice
Copy link
Contributor

/skip

@kikisdeliveryservice
Copy link
Contributor

/test e2e-aws

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Aug 18, 2020

@cgwalters: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-scaleup-rhel7 d30777475283f3aceb5540824429816a8cfba198 link /test e2e-aws-scaleup-rhel7
ci/prow/okd-e2e-aws df727ce link /test okd-e2e-aws
ci/prow/e2e-aws-workers-rhel7 df727ce link /test e2e-aws-workers-rhel7
ci/prow/e2e-ovn-step-registry df727ce link /test e2e-ovn-step-registry

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit d3d361b into openshift:master Aug 19, 2020
@openshift-ci-robot
Copy link
Contributor

@cgwalters: Some pull requests linked via external trackers have merged: openshift/machine-config-operator#1977, openshift/machine-config-operator#1971. The following pull requests linked via external trackers have not merged:

Details

In response to this:

Bug 1852047: daemon: Add events before/after all OS changes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants