-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI monitor reading #10859
CI monitor reading #10859
Conversation
Please set the appropriate release note label. |
3 similar comments
Please set the appropriate release note label. |
Please set the appropriate release note label. |
Please set the appropriate release note label. |
test-focus K8sDatapathConfig.* |
test-focus K8sDatapathConfig.* |
test-me-please |
test-gke |
4d209a7
to
9d8e6fd
Compare
test-gke |
test-me-please |
9d8e6fd
to
fc5b553
Compare
test-gke |
1 similar comment
test-gke |
test-me-please |
1 similar comment
test-me-please |
test-gke |
fc5b553
to
02c2936
Compare
test-me-please |
test-gke |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two comments inline, good stuff overall
test/k8sT/DatapathConfiguration.go
Outdated
|
||
var monitorOutput []byte // Accumulate output here over time | ||
body := func() bool { | ||
monitorOutput = append(monitorOutput, monitorRes.CombineOutput().Bytes()...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't this end up appending the same beginning of output over and over again, as we are not consuming the bytes that were already read. This may be ok if we just want to find "bad" lines, but I think it would be enough to just monitorOutput = monitorRes.CombineOutput().Bytes()
. The same applies to all similar lines below (196, 890 in Policies.go).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. For some reason I thought that there was a .Read call on a bytes.Buffer somewhere but that doesn't seem to be the case. I've switch it out. A bit less code too! I realised that we already have a WaitUntilMatch already on CmdRes.
test/k8sT/Policies.go
Outdated
monitorOutput = append(monitorOutput, monitorRes.CombineOutput().Bytes()...) | ||
|
||
// Let the monitor get started since it is started in the background. | ||
time.Sleep(2 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can drop the sleep
s since we are using a timeout anyway (we can return false
if exec is not successful)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched this to not use WithTimeout.
e70ee15
to
13b7314
Compare
test-focus K8sDatapathConfiguration.* (https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated-Focus/66/) |
test-focus K8sDatapathConfiguration.* (https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated-Focus/66) |
test-focus K8sDatapathConfiguration.* (this passed but failed on artefact collection :/ https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated-Focus/71/) |
test-focus K8sDatapathConfiguration.* |
test-focus K8sDatapathConfig.* |
test-me-please (timeout :/ https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/18791/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, thanks for picking this up!
Minor nits below around calling of the cancel function.
test/k8sT/DatapathConfiguration.go
Outdated
|
||
err := helpers.WithTimeout(body, "monitor aggregation did not send notifications", &helpers.TimeoutConfig{Timeout: helpers.HelperTimeout}) | ||
Expect(err).To(BeNil(), "Could not read monitor log") | ||
monitorCancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
monitorCancel won't be run if the Expect(err)
fails one line above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I remember thinking that I had to do it this way but I can't see why now (maybe I though it was created inside the body function? clearly it isn't #codingmysteries).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with Joe's comment, aside from that LGTM!
Exposing writing to a file allows us to be more consistent with these artifacts. Signed-off-by: Ray Bejjani <[email protected]>
We previously started monitor, ran some traffic, then stopped monitor and checked the output. This could race and sometimes missed packets, failing tests. We now accumulate the monitor output over time, and pass only when the output contents meet the test requirements, failing after a timeout has passed. Signed-off-by: Ray Bejjani <[email protected]>
cb00ba4
to
204c05b
Compare
test-me-please (timeout) |
test-gke |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, assuming Expect()
doesn't do anything weird that might cause the defer to be skipped.
test-me-please |
I tested by adding a forced fail and some prints. I see the prints on fail and pass so Expect doesn't short-circuit something to bypass defers. I'm going to merge this since. The test failures I do see are wholly unrelated (including test timeouts). |
We see flakes in tests that read from monitor. These are structured to search monitor output for specific substrings. Since monitor may take some time to start up, we can change how we read its output. The 3 tests now read this incrementally, and fail on a timeout. This should allow it to pass when monitor was simply slow to print.
I haven't seen the specific flake in a bunch of testing with this PR. Most test runs have failed, however, often in other tests. I'm fairly confident that I didn't break anything but I would prefer more clean passes. It does do better on GKE and that is where we see the flake the most, however.
@joestringer I think we once discussed this flake in a PR, so I pulled you in for review. Feel free to unassign yourself.