Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Go Test failed sometimes #1649

Closed
anencore94 opened this issue Aug 31, 2021 · 18 comments
Closed

Go Test failed sometimes #1649

anencore94 opened this issue Aug 31, 2021 · 18 comments

Comments

@anencore94
Copy link
Member

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

One of github action for Go Test failed sometimes, but it succeed when retest.

I've check for some cases:

All case failed for either one of these:

    trial_controller_test.go:261: 
        Timed out after 40.000s.
        Expected
            <bool>: false
        to be true
FAIL
    experiment_controller_test.go:350: 
        Timed out after 40.000s.
        Expected
            <bool>: false
        to be true

What did you expect to happen:

If these issue occured just from the network issue, what about increasing the timeout ?
Does it make sense ? WYDT ?

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Kubeflow version (kfctl version):
  • Minikube version (minikube version):
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):
@andreyvelich
Copy link
Member

Thank you for creating this @anencore94!
Yes, this unit test is flaky.
envtest sometime can fail in the Controller test.

Any ideas how to improve this and keep the same coverage level would be appreciated!

@anencore94
Copy link
Member Author

@andreyvelich Thanks :)
If the envtest you linked failed, could you link the action log in this issue ?
So we can keep track of it, then anybody could fix it

@andreyvelich
Copy link
Member

andreyvelich commented Sep 6, 2021

Let's continue discussion: #1654 (review) for improving controller unit test here.
/priority p2

@anencore94 Thank you for taking this!
Are you sure that increasing timeout will help to avoid flaky unit tests ?

For example in this failed test: https://github.com/kubeflow/katib/runs/3523610062?check_suite_focus=true.
Trial status was updated to Metrics Unavailable, but tests were failed on the previous step, where we compare observation results.

I think to fix it, we might need to split running Experiments in the different unit tests.

For example, currently in the TestReconcileBatchJob we run 3 Experiments:

Trial run with "Failed" BatchJob
Trail with "Complete" BatchJob and Available metrics.
Trail with "Complete" BatchJob and Unavailable metrics.
Maybe we should split it in 3 different unit tests with separate test managers.

@gaocegege
Copy link
Member

Maybe we should split it in 3 different unit tests with separate test managers.

I am wondering why it works. Splitting test cases just makes it fail fast, I think.

@anencore94
Copy link
Member Author

Are you sure that increasing timeout will help to avoid flaky unit tests ?

No, I couldn't for sure as you might expected.

By the way, from your linked action result, I couldn't get this order from test results

{"level":"info","ts":1630924172.6146765,"logger":"trial-controller","msg":"Trial status changed to Failed","Trial":"default/test-trial"}
{"level":"info","ts":1630924172.6534996,"logger":"trial-controller","msg":"Trial status changed to Succeeded","Trial":"default/test-trial"}
{"level":"info","ts":1630924172.6575294,"logger":"trial-controller","msg":"Trial status changed to Metrics Unavailable","Trial":"default/test-trial"}

From the trial_controller_test.go, after trial status changed to Failed, I think the log with trial deleted should follows. But there isn't any.
Even if Supposing that "the log with trial deleted" were implemented to never logged at all, But I couldn't get why test3 was started before waiting test2 to be done and delete it from the above github workflow.

@andreyvelich
Copy link
Member

@anencore94 What do you think about splitting Experiment test cases in the different unit test functions ?
Each function will have separate envtest manager setup.

@anencore94
Copy link
Member Author

What do you think about splitting Experiment test cases in the different unit test functions ?
Each function will have separate envtest manager setup.

I think that helps to debug, but when they are splitted, we couldn't call the test function as ReconcileTest anymore?

@andreyvelich
Copy link
Member

we couldn't call the test function as ReconcileTest anymore?

What do you mean by that ?
We should have various function names for each test Experiment.
For example,
TestReconcileJobFailed(t *testing.T), TestReconcileJobAvailableMetrics(t *testing.T), TestReconcileJobUnavailableMetrics(t *testing.T).

@anencore94
Copy link
Member Author

We should have various function names for each test Experiment.
For example,
TestReconcileJobFailed(t *testing.T), TestReconcileJobAvailableMetrics(t *testing.T), TestReconcileJobUnavailableMetrics(t *testing.T).

Yes, your suggestion is right.
But the reaseon why I intended "we couldn't call the test function as ReconcileTest anymore?" was that I felt handling those 3 cases at once make sense.

However, splitting those 3 would ok.

@stale
Copy link

stale bot commented Jan 3, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale label Jan 3, 2022
@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@stale stale bot closed this as completed Mar 2, 2022
@tenzen-y
Copy link
Member

/reopen

@google-oss-prow
Copy link

@tenzen-y: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-prow google-oss-prow bot reopened this Jun 14, 2022
@tenzen-y
Copy link
Member

/lifecycle frozen

@johnugeorge
Copy link
Member

We need to inspect if we have a better solution with gingko v2

@tenzen-y
Copy link
Member

We need to inspect if we have a better solution with gingko v2

I agree with you.
We can take it after the next release.

@tenzen-y
Copy link
Member

This should be fixed by #2350
/close

Copy link

@tenzen-y: Closing this issue.

In response to this:

This should be fixed by #2350
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants