Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI monitor reading #10859

Merged
merged 2 commits into from
Apr 17, 2020
Merged

CI monitor reading #10859

merged 2 commits into from
Apr 17, 2020

Conversation

raybejjani
Copy link
Contributor

@raybejjani raybejjani commented Apr 6, 2020

We see flakes in tests that read from monitor. These are structured to search monitor output for specific substrings. Since monitor may take some time to start up, we can change how we read its output. The 3 tests now read this incrementally, and fail on a timeout. This should allow it to pass when monitor was simply slow to print.

I haven't seen the specific flake in a bunch of testing with this PR. Most test runs have failed, however, often in other tests. I'm fairly confident that I didn't break anything but I would prefer more clean passes. It does do better on GKE and that is where we see the flake the most, however.

@joestringer I think we once discussed this flake in a PR, so I pulled you in for review. Feel free to unassign yourself.

@raybejjani raybejjani added wip area/CI-improvement Topic or proposal to improve the Continuous Integration workflow labels Apr 6, 2020
@maintainer-s-little-helper
Copy link

Please set the appropriate release note label.

3 similar comments
@maintainer-s-little-helper
Copy link

Please set the appropriate release note label.

@maintainer-s-little-helper
Copy link

Please set the appropriate release note label.

@maintainer-s-little-helper
Copy link

Please set the appropriate release note label.

@raybejjani
Copy link
Contributor Author

test-focus K8sDatapathConfig.*

@coveralls
Copy link

coveralls commented Apr 6, 2020

Coverage Status

Coverage increased (+0.03%) to 46.803% when pulling 204c05b on raybejjani:ci-monitor-reading into e92fd24 on cilium:master.

@raybejjani
Copy link
Contributor Author

test-focus K8sDatapathConfig.*

@raybejjani
Copy link
Contributor Author

test-me-please

@raybejjani
Copy link
Contributor Author

test-gke

@raybejjani raybejjani force-pushed the ci-monitor-reading branch from 4d209a7 to 9d8e6fd Compare April 6, 2020 16:20
@raybejjani
Copy link
Contributor Author

test-gke

@raybejjani
Copy link
Contributor Author

test-me-please

@raybejjani raybejjani force-pushed the ci-monitor-reading branch from 9d8e6fd to fc5b553 Compare April 6, 2020 16:52
@raybejjani
Copy link
Contributor Author

test-gke

1 similar comment
@raybejjani
Copy link
Contributor Author

test-gke

@raybejjani
Copy link
Contributor Author

test-me-please

1 similar comment
@raybejjani
Copy link
Contributor Author

test-me-please

@raybejjani
Copy link
Contributor Author

test-gke

@raybejjani raybejjani force-pushed the ci-monitor-reading branch from fc5b553 to 02c2936 Compare April 7, 2020 10:41
@raybejjani
Copy link
Contributor Author

test-me-please

@raybejjani
Copy link
Contributor Author

test-gke

Copy link
Member

@nebril nebril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments inline, good stuff overall


var monitorOutput []byte // Accumulate output here over time
body := func() bool {
monitorOutput = append(monitorOutput, monitorRes.CombineOutput().Bytes()...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this end up appending the same beginning of output over and over again, as we are not consuming the bytes that were already read. This may be ok if we just want to find "bad" lines, but I think it would be enough to just monitorOutput = monitorRes.CombineOutput().Bytes(). The same applies to all similar lines below (196, 890 in Policies.go).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. For some reason I thought that there was a .Read call on a bytes.Buffer somewhere but that doesn't seem to be the case. I've switch it out. A bit less code too! I realised that we already have a WaitUntilMatch already on CmdRes.

monitorOutput = append(monitorOutput, monitorRes.CombineOutput().Bytes()...)

// Let the monitor get started since it is started in the background.
time.Sleep(2 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can drop the sleeps since we are using a timeout anyway (we can return false if exec is not successful)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched this to not use WithTimeout.

@raybejjani raybejjani force-pushed the ci-monitor-reading branch 2 times, most recently from e70ee15 to 13b7314 Compare April 8, 2020 12:46
@raybejjani
Copy link
Contributor Author

raybejjani commented Apr 15, 2020

test-focus K8sDatapathConfiguration.* (https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated-Focus/66/)

@raybejjani
Copy link
Contributor Author

raybejjani commented Apr 15, 2020

test-focus K8sDatapathConfiguration.* (https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated-Focus/66)

@raybejjani
Copy link
Contributor Author

raybejjani commented Apr 15, 2020

test-focus K8sDatapathConfiguration.* (this passed but failed on artefact collection :/ https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated-Focus/71/)

@raybejjani
Copy link
Contributor Author

test-focus K8sDatapathConfiguration.*

@raybejjani raybejjani requested a review from nebril April 15, 2020 14:40
@raybejjani raybejjani requested a review from joestringer April 15, 2020 14:44
@raybejjani raybejjani marked this pull request as ready for review April 15, 2020 14:47
@raybejjani raybejjani requested a review from a team as a code owner April 15, 2020 14:47
@raybejjani
Copy link
Contributor Author

test-focus K8sDatapathConfig.*

@raybejjani
Copy link
Contributor Author

raybejjani commented Apr 15, 2020

@raybejjani raybejjani added release-note/ci This PR makes changes to the CI. and removed dont-merge/debug-only labels Apr 15, 2020
@raybejjani raybejjani changed the title Ci monitor reading CI monitor reading Apr 15, 2020
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, thanks for picking this up!

Minor nits below around calling of the cancel function.


err := helpers.WithTimeout(body, "monitor aggregation did not send notifications", &helpers.TimeoutConfig{Timeout: helpers.HelperTimeout})
Expect(err).To(BeNil(), "Could not read monitor log")
monitorCancel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

monitorCancel won't be run if the Expect(err) fails one line above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I remember thinking that I had to do it this way but I can't see why now (maybe I though it was created inside the body function? clearly it isn't #codingmysteries).

Copy link
Member

@nebril nebril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with Joe's comment, aside from that LGTM!

Exposing writing to a file allows us to be more consistent with these
artifacts.

Signed-off-by: Ray Bejjani <[email protected]>
We previously started monitor, ran some traffic, then stopped monitor
and checked the output. This could race and sometimes missed packets,
failing tests. We now accumulate the monitor output over time, and pass
only when the output contents meet the test requirements, failing after
a timeout has passed.

Signed-off-by: Ray Bejjani <[email protected]>
@raybejjani
Copy link
Contributor Author

raybejjani commented Apr 16, 2020

test-me-please (timeout)

@raybejjani
Copy link
Contributor Author

test-gke

@raybejjani raybejjani requested a review from joestringer April 16, 2020 16:05
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, assuming Expect() doesn't do anything weird that might cause the defer to be skipped.

@raybejjani
Copy link
Contributor Author

test-me-please

@raybejjani
Copy link
Contributor Author

I tested by adding a forced fail and some prints. I see the prints on fail and pass so Expect doesn't short-circuit something to bypass defers.

I'm going to merge this since. The test failures I do see are wholly unrelated (including test timeouts).

@raybejjani raybejjani merged commit 9fbd820 into cilium:master Apr 17, 2020
@raybejjani raybejjani deleted the ci-monitor-reading branch April 17, 2020 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI-improvement Topic or proposal to improve the Continuous Integration workflow release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants