describe e2e observer pods #425

deads2k · 2020-08-07T17:09:27Z

These pods will run alongside our existing CI e2e pods (this avoids risk to job state) and provide a way to contribute test, platform, configuration, and version independent observability to our CI runs.

The interaction with the existing e2e flow is designed to be minimal, to reduce risk and effort.

This (or something like it) will allow collecting data the TRT has determined is missing on upgrade and install failures. Most notably, logs for every pod that ever existed (loki) and recording for all intermediate state for operator status (resourcewatcher). Both tools have been built, but we've been unable to get them enabled in the right place for every test, platform, configuration, and version tuple.

/assign @stevekuznetsov
/cc @mfojtik @sttts @soltysh

enhancements/test-platform/e2e-observer-pods.md

stevekuznetsov

Do we expect some or any of the following:

multiple observers per cluster
multiple instances of one observer against a single test cluster
the container image running the observer pod to be build from the repository under test

stevekuznetsov · 2020-08-07T17:15:20Z

enhancements/test-platform/e2e-observer-pods.md

+
+### Changes to CI
+This is a sketch of a possible path.
+ 1. Allow a non-dptp-developer to produces a pod template that mounts a `secret/<job>-kubeconfig` that will later contain a kubeconfig.


Why does it need to be a Pod template? Our current interface for developers looks like a (container image, commands) tuple. Is that sufficient? What do we need users to be able to configure?

Why does it need to be a Pod template? Our current interface for developers looks like a (container image, commands) tuple. Is that sufficient? What do we need users to be able to configure?

I'm not sure. But I do know that podtemplates are standard kube structs, with their existing documentation and clear semantics. Resource requirements come to mind. Env vars probably.

I think it's actually easier for me to provide a podtemplate since my development is likely to be done using a pod.

Most if not all of those points of variance are already handled by existing types in the CI system. I don't want to expose e.g. CAP_SYS_ADMIN or host mount ...

This is important. s/pod template/step reference/g

stevekuznetsov · 2020-08-07T17:16:15Z

enhancements/test-platform/e2e-observer-pods.md

+### Changes to CI
+This is a sketch of a possible path.
+ 1. Allow a non-dptp-developer to produces a pod template that mounts a `secret/<job>-kubeconfig` that will later contain a kubeconfig.
+ 2. When the CI pod starts today (the one with setup containers and stuff), also create an instance of each pod template,


This is too specific and not useful. It is not important for the parallel observer process to start at the time that the job starts.

Is the important requirement here that the process be running and ready to notice $KUBECONFIG being present when it's possible?

enhancements/test-platform/e2e-observer-pods.md

stevekuznetsov · 2020-08-07T17:18:26Z

enhancements/test-platform/e2e-observer-pods.md

+    let's say `pod/<job>-resourcewatcher` and an empty `secret/<job>-kubeconfig`.
+ 3. As soon as a kubeconfig is available (these go into a known location in setup container today), write that
+    `kubeconfig` into every `secret/<job>-kubeconfig` (you probably want to label them).
+ 4. When it is time for collection, the existing pod (I think it's teardown container), writes a new data entry at `.data['teardown']` into every 


What is important about this mechanism?

What is important about this mechanism?

It's a simple notification that meets the min of, "easy to provide and easy to observe in an e2e-observer pod". I don't really how, but this way is easy.

why do the monitors need to know the teardown time? Won't it be enough for them to record "cluster seems to have disappeared" and rely on test-suite failures to get folks to look at jobs where the cluster expoded before teardown.

why do the monitors need to know the teardown time? Won't it be enough for them to record "cluster seems to have disappeared" and rely on test-suite failures to get folks to look at jobs where the cluster expoded before teardown.

collection can be expensive. You may want to defer until the end. Imagine using this for must-gather so we can actually get it on all CI runs. Amazingly, we have failed to achieve even that collection in a consistent way so far.

Imagine using this for must-gather so we can actually get it on all CI runs.

When clusters die hard enough for must-gather to fail, I doubt ci-operator will be able to reliably guess the right window before to trigger the toppling-cluster must-gather. I think log-streaming and other continuous monitoring are probably the best we can deliver in those cases.

This is outstanding -- if it's important that we signal. I'd prefer sending SIGTERM rather than messing with files.

Suggest:

4. When the test workflow enters the post-step phase, SIGTERM is sent to each observer process.

stevekuznetsov · 2020-08-07T17:20:21Z

enhancements/test-platform/e2e-observer-pods.md

+ 4. When it is time for collection, the existing pod (I think it's teardown container), writes a new data entry at `.data['teardown']` into every 
+    `secret/<job>-kubeconfig`.
+    The value should be a timestamp.
+ 5. Ten minutes after `.data['teardown']` is written, the teardown container rsyncs a well known directory,


Why is this necessary? ci-operator has oversight on the control flow and handles cancellation and failure gracefully, with timeouts and grace periods as necessary.

What do you need from the feature?

A signal sent to the observer when the teardown phase of a job begins

A (configurable) grace period, starting at that signal, after which artifacts are gathered

A (configurable) grace period, starting at that signal, after which artifacts are gathered

I wouldn't let someone configure it.

why is this not just "ci-operator TERMs the monitor containers once the workflow wraps up entirely, or after the test steps complete (which would also work as an "I'll reap this cluster soon, you can stop watching" signal), or some such? And then uploads anything from the monitor container's mounted artifacts volume (just like the existing step logic).

Suggest:

5. Observer pods have their artifacts gathered after a grace period, or after the process exists, whichever comes first.

stevekuznetsov · 2020-08-07T17:20:35Z

enhancements/test-platform/e2e-observer-pods.md

+    are placed in some reasonable spot.
+
+    This could be optimized with a file write in the other direction, but the naive approach doesn't require it.
+  6. All `pod/<job>-<e2e-observer-name>` are deleted.


This is not germane to this proposal

This is not germane to this proposal

it's very germane to the example and concrete bits. I'm happy to not use this sketch if you wish to provide a different one, but for a concrete example of how this could be built, this is pretty straightforward

This does not fit into the general model of lifecycle in the CI Operator, and it's not germane in that it does not matter to a functional and correct implementation of an observer Pod.

Outstanding, please remove.

enhancements/test-platform/e2e-observer-pods.md

deads2k · 2020-08-07T17:26:23Z

I will clarify in the doc, but some answers here:

multiple observers per cluster

yes

multiple instances of one observer against a single test cluster

no

the container image running the observer pod to be build from the repository under test

I think no, but I'm open to it if you think it makes your life easier.

enhancements/test-platform/e2e-observer-pods.md

wking · 2020-08-07T20:26:32Z

enhancements/test-platform/e2e-observer-pods.md

+ 2. Your pod must be able to tolerate a kubeconfig that doesn't work.  The kubeconfig may point to a cluster than never comes up.
+    Or that hasn't come up yet.  Your pod must not suck on it.
+ 3. If your pod fails, it will not fail the e2e job.  If you need to fail an e2e job reliably, you need something else.
+ 4. Your pod must terminate when asked.


I think the most important requirement, which you mention above but should probably repeat here, is non-goal 2: that observer containers must not use the admin kubeconfig they receive to reach into the cluster under test and break things. This should be a read-only monitor, or at most, a "be very careful to only touch safe things" monitor.

The remaining conditions (termination, etc.) can be enforced by ci-operator, so monitors that don't handle them gracefully aren't that big a deal.

wking · 2020-08-07T20:30:00Z

enhancements/test-platform/e2e-observer-pods.md

+### Modify the openshift-tests command
+This is easy for some developers, BUT as e2e monitor shows us, it doesn't provide enough information.
+We need information from before the test command is run to find many of our problems.
+Missing logs and missing intermediate resource states (intermediate operator status) are the most egregious so far. 


The vanilla e2e suite already records intermediate operator status in Monitor cluster while tests execute. It does not seem like that big a reach to me to ask the installer to also monitor and log operator status, instead of its current "just collect this at the end if the install failed" behavior. And that would also be useful information to have in the log for installs that fail in the wild, which will not benefit from CI-specific monitoring suites.

Log preservation is unlikely to be covered by either the installer or the Monitor cluster while tests execute "test-case". I think that's a valid use-case for this enhancement. But being able to configure all clusters, CI and otherwise, to be able to stream logs to an external logging aggregator would be nice too, and would cover both CI and real-world use-cases.

stevekuznetsov · 2020-08-12T17:53:12Z

enhancements/test-platform/e2e-observer-pods.md

+ 2. allow a cluster observer to impact the running test.  This is an observer author failure.
+ 3. provide ANY dimension specific data. If an observer needs this, they need to integrate differently.
+ 4. allow multiple instances of a single cluster observer to run against one CI cluster
+ 5. allow providing the cluster observer from a repo being built. The images are provided separately.


This is trivial to allow, we can either remove it or move it to goals

(like, the mechanism we use to ensure that the Pod can only start when the image is imported is identical to that which is used if we're waiting for it to be built)

stevekuznetsov · 2020-08-12T17:53:58Z

enhancements/test-platform/e2e-observer-pods.md

+## Proposal
+
+The overall goal: 
+ 1. have an e2e-observer process is expected to be running before a kubeconfig exists


nit: s/is/that is/

stevekuznetsov · 2020-08-12T17:54:35Z

enhancements/test-platform/e2e-observer-pods.md

+
+### Changes to CI
+This is a sketch of a possible path.
+ 1. Allow a non-dptp-developer to produces a pod template that mounts a `secret/<job>-kubeconfig` that will later contain a kubeconfig.


This is important. s/pod template/step reference/g

stevekuznetsov · 2020-08-12T17:55:37Z

enhancements/test-platform/e2e-observer-pods.md

+    let's say `pod/<job>-resourcewatcher` and an empty `secret/<job>-kubeconfig`.
+ 3. As soon as a kubeconfig is available (these go into a known location in setup container today), write that
+    `kubeconfig` into every `secret/<job>-kubeconfig` (you probably want to label them).
+ 4. When it is time for collection, the existing pod (I think it's teardown container), writes a new data entry at `.data['teardown']` into every 


This is outstanding -- if it's important that we signal. I'd prefer sending SIGTERM rather than messing with files.

stevekuznetsov · 2020-08-12T17:56:26Z

enhancements/test-platform/e2e-observer-pods.md

+    let's say `pod/<job>-resourcewatcher` and an empty `secret/<job>-kubeconfig`.
+ 3. As soon as a kubeconfig is available (these go into a known location in setup container today), write that
+    `kubeconfig` into every `secret/<job>-kubeconfig` (you probably want to label them).
+ 4. When it is time for collection, the existing pod (I think it's teardown container), writes a new data entry at `.data['teardown']` into every 


Suggest:

4. When the test workflow enters the post-step phase, SIGTERM is sent to each observer process.

stevekuznetsov · 2020-08-12T17:57:08Z

enhancements/test-platform/e2e-observer-pods.md

+ 4. When it is time for collection, the existing pod (I think it's teardown container), writes a new data entry at `.data['teardown']` into every 
+    `secret/<job>-kubeconfig`.
+    The value should be a timestamp.
+ 5. Ten minutes after `.data['teardown']` is written, the teardown container rsyncs a well known directory,


Suggest:

5. Observer pods have their artifacts gathered after a grace period, or after the process exists, whichever comes first.

stevekuznetsov · 2020-08-12T17:57:19Z

enhancements/test-platform/e2e-observer-pods.md

+    are placed in some reasonable spot.
+
+    This could be optimized with a file write in the other direction, but the naive approach doesn't require it.
+  6. All `pod/<job>-<e2e-observer-name>` are deleted.


Outstanding, please remove.

enhancements/test-platform/e2e-observer-pods.md

stevekuznetsov · 2020-10-28T17:37:06Z

enhancements/test-platform/e2e-observer-pods.md

+ 2. Before a kubeconfig is present, create an instance of each binary from #1 is created and an empty `secret/<job>-kubeconfig`.
+ 3. As soon as a kubeconfig is available (these go into a known location in setup container today), write that
+    `kubeconfig` into every `secret/<job>-kubeconfig` (you probably want to label them).
+ 4. When it is time for collection, the existing pod (I think it's teardown container), issues a sig-term to the process or perhaps


we spoke about relaxing this to SIGTERM

stevekuznetsov

/lgtm
/approve

openshift-ci-robot · 2020-10-28T18:09:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, stevekuznetsov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [deads2k,stevekuznetsov]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

describe e2e observer pods

8bbcd82

openshift-ci-robot assigned stevekuznetsov Aug 7, 2020

openshift-ci-robot requested review from mfojtik, soltysh and sttts August 7, 2020 17:09

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 7, 2020

stevekuznetsov reviewed Aug 7, 2020

View reviewed changes

enhancements/test-platform/e2e-observer-pods.md Show resolved Hide resolved

stevekuznetsov reviewed Aug 7, 2020

View reviewed changes

LalatenduMohanty reviewed Aug 7, 2020

View reviewed changes

enhancements/test-platform/e2e-observer-pods.md Show resolved Hide resolved

wking reviewed Aug 7, 2020

View reviewed changes

comments

a85cfcd

stevekuznetsov reviewed Aug 12, 2020

View reviewed changes

more

56a33a3

stevekuznetsov reviewed Oct 28, 2020

View reviewed changes

enhancements/test-platform/e2e-observer-pods.md Show resolved Hide resolved

stevekuznetsov reviewed Oct 28, 2020

View reviewed changes

update e2e-observers

44c687b

stevekuznetsov reviewed Oct 28, 2020

View reviewed changes

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 28, 2020

openshift-merge-robot merged commit 81fd2ea into openshift:master Oct 28, 2020

describe e2e observer pods #425

describe e2e observer pods #425

Uh oh!

Conversation

deads2k commented Aug 7, 2020

Uh oh!

Uh oh!

stevekuznetsov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

deads2k commented Aug 7, 2020

Uh oh!

Uh oh!

wking Aug 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wking Aug 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wking Aug 7, 2020 •

edited

Loading

wking Aug 7, 2020 •

edited

Loading