Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified node-kubelet-flaky job and image spec #17687

Merged
merged 1 commit into from
Jun 10, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions config/jobs/kubernetes/sig-node/node-kubelet.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -177,8 +177,8 @@ periodics:
- --deployment=node
- --gcp-project-type=node-e2e-project
- --gcp-zone=us-west1-b
- --node-args=--image-config-file=/workspace/test-infra/jobs/e2e_node/image-config.yaml
- --node-test-args= --kubelet-flags="--cgroups-per-qos=true --cgroup-root=/"
- --node-args=--image-config-file=/workspace/test-infra/jobs/e2e_node/perf-image-config.yaml
- --node-test-args= --kubelet-flags="--cgroups-per-qos=true --cgroup-root=/" --server-start-timeout=420s
- --node-tests=true
- --provider=gce
- --test_args=--nodes=1 --focus="\[Flaky\]"
Expand Down
15 changes: 15 additions & 0 deletions jobs/e2e_node/perf-image-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
---
images:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the existing job also runs for ubuntu, can we include that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is a good point... my decision making for this was purely based on sig node conversations about which images to run tests in and the cos ones are a definite yes.
Everything else that isn't cos seems to be up in the air at this moment (?).
During a previous meeting there was also a mention of organizing things such that each job only ran 1 image.

Right now, kubelet-flaky is running tests on this ubuntu image as well

image: ubuntu-gke-1804-1-16-v20200330 # docker 17.03

so asking for feedback here: should this version of kubelet-flaky run ubuntu as well? and if so, should it run the same ubuntu image that it has been using until now or another?

/cc @spiffxp @MHBauer @bsdnet @vpickard

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alejandrox1 Let's also update the Ubuntu since those are two images running. After we got them green, per the meeting, we may add more.
I will start a topic in sig-node to discuss how to support multiple images in a scalable way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know who monitors the output of this (clearly nobody), nor do I know if it surfaces in a dashboard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet, so im looking at all the ubuntu images for the ubuntu-os-gke-cloud project

gcloud compute images list --project ubuntu-os-gke-cloud

and ubuntu-gke-1804-1-17-v20200605 seems to be the latest one.

As per monitoring this later on, we'll figure it out 😅

cos-stable1:
image: cos-81-12871-119-0 # docker 19.03.6, current latest LTS.
project: cos-cloud
machine: n1-standard-16
metadata: "user-data<test/e2e_node/jenkins/gci-init.yaml,gci-update-strategy=update_disabled"
tests:
- 'Node Performance Testing'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this is me not fully understanding how all of this works)

Does this take precedence over the [Flaky] regex in the job? Work in combination with it?

It feels like we're overloading/misdirecting by using the term "flaky" when really this job seems to be running perf tests

Copy link
Contributor Author

@alejandrox1 alejandrox1 Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it works in combination with the focus on [Flaky].

And it is overloading and misdirecting.
The long version of the story is in
kubernetes/kubernetes#91263
but the short version is that these tests were meant to run in an n1-standard-16 sized machine, see kubernetes/kubernetes#65250.

But for some reason the e2e tests running in this current version of the job were never properly modified to run in that machine and have been running since the beginning in an n1-standard-1.
As such, the have always been failing.
Then they were labeled as "flaky" in kubernetes/kubernetes#78548 with the intent of removing them from the kubelet-serial job.
The idea was to move them to their own e2e job so they didn't pollute another job.

My idea here was to explicitly specify within the image config that this whole fix is only for a handful of performance tests that were labeled as flaky becuase they were running in the wrong machine.
I think a better place for the tests in this job would be in https://github.com/kubernetes/test-infra/blob/master/jobs/e2e_node/benchmark-config.yaml but wanted to take it one step at a time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happens to be the only test marked flaky left in node_e2e.

I think the idea of both Flaky regex and 'Node Performance Testing' is for this to be moved out of flaky and into it's own job eventually. (It should probably be it's own job already, because nothing else is backed by a n1-standard-16.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I won't block on this one as long as there's followup to either rename this job to something like ci-kubernetes-node-kubelet-perf, or split out a new job that does so... I think the node perf tests shouldn't be tagged as [Flaky] but instead something like [Feature:Node Performance]

we currently have no way of verifying/linking whether all e2e tests are executed by at least one CI job, so I could understand if someone wants to preemptively keep a flaky job around even if there are no flaky tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I won't block on this one as long as there's followup to either rename this job to something like ci-kubernetes-node-kubelet-perf, or split out a new job that does so... I think the node perf tests shouldn't be tagged as [Flaky] but instead something like [Feature:Node Performance]

YES to all of this! probably split out a new job for this (or move the contents of the image config into the jobs running cenchmarks TBD with SIG Node).
And huge +1 on [Feature:Node Performance]

ubuntu:
image: ubuntu-gke-1804-1-17-v20200605 # docker 19.03.2 / containerd 1.2.10
project: ubuntu-os-gke-cloud
machine: n1-standard-16
tests:
- 'Node Performance Testing'