Skip to content

Conversation

@pohly
Copy link
Contributor

@pohly pohly commented May 23, 2023

What type of PR is this?

/kind bug
/kind failing-test

What this PR does / why we need it:

Applying it to the entire spec included cleaning up, which makes predicting the acceptable duration harder because it includes code not owned by the test itself. It's better to specify a timeout only for the test code itself.

Which issue(s) this PR fixes:

Fixes #118175

Special notes for your reviewer:

It's not clear why this started flaking. Somehow the namespace cleanup must have gotten slower.

Does this PR introduce a user-facing change?

NONE

Applying it to the entire spec included cleaning up, which makes predicting the
acceptable duration harder because it includes code not owned by the test
itself. It's better to specify a timeout only for the test code itself.
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 23, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label May 23, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 23, 2023
@aojea
Copy link
Member

aojea commented May 23, 2023

Thanks Patrick
/lgtm

It's not clear why this started flaking. Somehow the namespace cleanup must have gotten slower.

yeah, that will be a nice small project for new contributor, instrument the e2e framework to measure these things and being able to analyze trends for regressions

@onsi did you even have request for these kind of metrics on the project?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 23, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

DetailsGit tree hash: 646795e25facf27ce26c123e0d091f63cf6b7b43

@k8s-ci-robot k8s-ci-robot merged commit 61d455e into kubernetes:master May 23, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone May 23, 2023
for _, t := range tests {
test := t
ginkgo.It(t.name, ginkgo.SpecTimeout(f.Timeouts.PodStart), func(ctx context.Context) {
ginkgo.It(t.name, ginkgo.NodeTimeout(f.Timeouts.PodStart), func(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the NodeTimeout does not consider the ginkgo.DeferCleanup , right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the difference.

@onsi
Copy link
Contributor

onsi commented May 23, 2023

hey @aojea - I think this would be a valuable project for something of the scale and scope of k8s e2e. I haven't gotten many requests around it, no - but I can paint a picture for where one might start today and what could be added to Ginkgo over time.

Today - Ginkgo's JSON report format includes a series of SpecEvents for each spec. This includes things like "this node, located at this file location, began at time T and ended at time T+d" as well as "this By at this file location and with this string was called at time T".

I could easily imagine some independent code that consumes JSON reports over time and looks for patterns to identify trends. The biggest limitation right now would be how to match up nodes across different test runs. My intuition is that the combination of spec text (i.e. the concatenation of all the container texts and the It text) plus the file and line number of the node would be a decent identifier. It would not be 100% stable (the text may change or, more obviously, the line number could change) - but I could imagine some code that does a fuzzy match and can get "good enough" at matching up nodes between runs.

This would then allow you to track, statistically, what the median node runtimes are and if any are drifting and which commits correlate with the beginning of the drift.

All of this could be done today and the first iteration of this would not take very long to build at all.

From there I could imagine a few enhancements:

  • the ability to annotate a given node with a user-provided identifier that can be used to track it across test runs and is encoded as part of the node's SpecEvent.
  • graduating this into the Ginkgo CLI as a subcommand that can analyze reports to identify patterns/trends.

I don't know how new ideas/projects germinate in k8s but I'd be happy to support an effort to explore this.

k8s-ci-robot added a commit that referenced this pull request Jun 1, 2023
…0-upstream-release-1.27

Automated cherry pick of #118200: e2e: apply timeout for CSI Storage Capacity test only to node
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Flaky] CSIStorageCapacity e2e tests

4 participants