Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add new extra component to --wait=all to validate a healthy cluster #10424

Merged
merged 6 commits into from
Feb 16, 2021

Conversation

prezha
Copy link
Contributor

@prezha prezha commented Feb 9, 2021

fixes #9936
fixes #10130

i've slightly improved validateComponentHealth functional test, but in this pr i'm also proposing to change how we check if a pod is actually available -

currently, we rely on pod's Running status, and according to Pod phase
(ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase):

The phase of a Pod is a simple, high-level summary of where the Pod is in its lifecycle. The phase is not intended to be a comprehensive rollup of observations of container or Pod state, nor is it intended to be a comprehensive state machine.

The number and meanings of Pod phase values are tightly guarded. Other than what is documented here, nothing should be assumed about Pods that have a given phase value.

Running - The Pod has been bound to a node, and all of the containers have been created. At least one container is still running, or is in the process of starting or restarting.

so it is prereq but not sufficient to consider it operational - we should use Pod conditions/Ready instead (ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions):

ContainersReady: all containers in the Pod are ready.
Initialized: all init containers have started successfully.
Ready: the Pod is able to serve requests and should be added to the load balancing pools of all matching Services.

i've changed accordingly how we wait for api server in kubeadm.WaitForNode as well as kverify.WaitForSystemPods - these changes should improve the odds for validateComponentHealth functional test not to fail, as well as overall stability

three additional notes:

  1. depending on how it would be used, we can either reduce kverify.SystemPodsList (eg, remove currently redundant 'kube-apiserver' from the list -or- amend the kverify.*Components and remove APIServerWaitKey if SystemPodsWaitKey is defined, as it's using the SystemPodsList)
  2. if this makes sense, we can revisit how we wait for api server in other places (currently using cmd.waitForAPIServerProcess in docker-env, and kverify.waitForAPIServerProcess in apiserver used in kubeadmin.restartControlPlane)
  3. we are using healthz endpoint that is deprecated (ref: https://kubernetes.io/docs/reference/using-api/health-checks/#api-endpoints-for-health):
The Kubernetes API server provides 3 API endpoints (healthz, livez and readyz) to indicate the current status of the API server. 
The healthz endpoint is deprecated (since Kubernetes v1.16), and you should use the more specific livez and readyz endpoints instead

time metrics (all completed successfully - here are just resulting values for brevity)

before

❯ time minikube start --driver=docker
minikube start --driver=docker  9.03s user 3.60s system 14% cpu 1:27.75 total

after

❯ time minikube start --driver=docker
minikube start --driver=docker  10.54s user 4.22s system 8% cpu 2:52.82 total
❯ time minikube start --driver=docker --wait=all
minikube start --driver=docker --wait=all  10.19s user 3.95s system 7% cpu 2:58.58 total
❯ time minikube start --driver=docker --wait=none
minikube start --driver=docker --wait=none  9.66s user 3.93s system 15% cpu 1:28.81 total

expectedly, here the time increased for 'plain' minikube start - explanation:
default value for wait flag is kverify.DefaultWaitList: APIServerWaitKey, SystemPodsWaitKey, so, in the kubeadmin.WaitForNode it will trigger WaitForPodReadyByLabel for all system components (incl. apiserver) => currently, minikube start equals to minikube start --wait=all

on the other hand:

  • as in this pr, --wait=all will force wait for all system components to become fully operational
  • if --wait=none, then wait does not happen (no additional wait time is added): should we change the default value for --wait to none then, or we should trigger this strict_check only if explicitly asked for (and also with --all)?

one thing may be worth mentioning: just --wait flag, w/o specifying =xxx does not work as (i've) expected - ie, it's not taking the default value (kverify.DefaultWaitList), but it's 'greedy':

❯ minikube start --driver=docker --wait --alsologtostderr -v=99
...
W0209 16:24:12.243601   26050 start_flags.go:705] The value "--alsologtostderr" is invalid for --wait flag. valid options are "apiserver,system_pods,default_sa,apps_running,node_ready,kubelet"
...

(and that could be explained with the fact that = is optional anyway...)

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 9, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @prezha. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 9, 2021
@minikube-bot
Copy link
Collaborator

Can one of the admins verify this patch?

Copy link
Member

@medyagh medyagh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is good work, and I agree one thing to consider is,
we should ensure our current "minikube start" won't add additional wait for components
but we also want to make "minikube start --wait=all" to wait for every single thing possible (as purposed in this PR, we should not only rely on pod phase status but also wait for the them be actually running)

so if possible please post in the PR descriptions the time metrics Before and After this PR for "normal start" and then if it adds any new time we can introduce a new component to the wait flag

currently the available options for --wait are

  • Default (apiserver,system_pods)
  • False (no wait)
  • All (everything)
  • or a combination of available options "apiserver,system_pods,default_sa,apps_running,node_ready,kubelet"

we can introduce a new option (strict_check) therofore --wait=all will be "apiserver,system_pods,default_sa,apps_running,node_ready,kubelet, strict_check "

note that I dont acutally purpose the name "strict_check" any other name you come up with would be good.
this will allow the user to choose the level of wait they wanna control (so if they rather a fast start but dont care much about strict check they could pass the paramters they want)

but ---wait=all should always give u ALL possible wait to verify everythying

@azhao155
Copy link
Contributor

azhao155 commented Feb 9, 2021

@prezha Just curious how you find out this issue, do you get the clue from any log? I also debugged this before but got no luck. If you could kindly share the debugging experience, that would be a very good learning lesson for all!

@prezha
Copy link
Contributor Author

prezha commented Feb 9, 2021

thank you @medyagh, i've added time metrics to the pr description and commented

@prezha
Copy link
Contributor Author

prezha commented Feb 9, 2021

@prezha Just curious how you find out this issue, do you get the clue from any log? I also debugged this before but got no luck. If you could kindly share the debugging experience, that would be a very good learning lesson for all!

@azhao155

in general: looked at plenty of logs :) - mostly those from failed ci/cd tasks here, but also run locally and then examined run logs and also the containers/services themself, tried to figure out the 'normal'/expected behaviour, then backtraced through the issue and found where it could root...

here is an example of a failed job: https://github.com/kubernetes/minikube/pull/10378/checks?check_run_id=1842596262

here is the command i've run locally:

env TEST_ARGS="-minikube-start-args=--driver=docker --alsologtostderr -v -test.run TestFunctional -test.timeout=10m -test.v -timeout-multiplier=1.5 --cleanup=false" make integration

you can then ssh into individual containers to see beyond default last 25 lines of logs, or you can use something like:

out/minikube -p functional-20210209193741-102061 logs -n 10000 > ~/wip/functional-20210209193741-102061.log

(replace functional-20210209193741-102061 with the one your test above will generate)

here specifically the test failed on services that were not operational, so i've then backtracked where and when they've been started (validateStartWithProxy functional test in out case), that also has --wait=all flag set, so everything shoul be up when tested... then, understood what services wait flag should cover and where, and then saw that we are not actually making sure that those are fully available to satisfy the expectation of the test and wait flag...

i hope this was not too confusing and somewhat helpful :)

@prezha prezha requested a review from medyagh February 9, 2021 23:23
@azhao155
Copy link
Contributor

@prezha Just curious how you find out this issue, do you get the clue from any log? I also debugged this before but got no luck. If you could kindly share the debugging experience, that would be a very good learning lesson for all!

@azhao155

in general: looked at plenty of logs :) - mostly those from failed ci/cd tasks here, but also run locally and then examined run logs and also the containers/services themself, tried to figure out the 'normal'/expected behaviour, then backtraced through the issue and found where it could root...

here is an example of a failed job: https://github.com/kubernetes/minikube/pull/10378/checks?check_run_id=1842596262

here is the command i've run locally:

env TEST_ARGS="-minikube-start-args=--driver=docker --alsologtostderr -v -test.run TestFunctional -test.timeout=10m -test.v -timeout-multiplier=1.5 --cleanup=false" make integration

you can then ssh into individual containers to see beyond default last 25 lines of logs, or you can use something like:

out/minikube -p functional-20210209193741-102061 logs -n 10000 > ~/wip/functional-20210209193741-102061.log

(replace functional-20210209193741-102061 with the one your test above will generate)

here specifically the test failed on services that were not operational, so i've then backtracked where and when they've been started (validateStartWithProxy functional test in out case), that also has --wait=all flag set, so everything shoul be up when tested... then, understood what services wait flag should cover and where, and then saw that we are not actually making sure that those are fully available to satisfy the expectation of the test and wait flag...

i hope this was not too confusing and somewhat helpful :)

Yeah, Thanks @prezha for the detail explanation, it's really helpful and appreciate it!

@sharifelgamal
Copy link
Collaborator

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 10, 2021
@kubernetes kubernetes deleted a comment from minikube-pr-bot Feb 10, 2021
@kubernetes kubernetes deleted a comment from minikube-pr-bot Feb 11, 2021
@minikube-pr-bot
Copy link

kvm2 Driver
Times for minikube: 69.5s 68.5s 71.8s
Average time for minikube: 69.9s

Times for Minikube (PR 10424): 160.1s 135.9s 143.9s
Average time for Minikube (PR 10424): 146.6s

Averages Time Per Log

+--------------------------------+----------+---------------------+
|              LOG               | MINIKUBE | MINIKUBE (PR 10424) |
+--------------------------------+----------+---------------------+
| * minikube v1.17.1 on Debian   | 0.0s     | 0.0s                |
| 9.11 (kvm/amd64)               |          |                     |
| * Using the kvm2 driver based  | 0.0s     | 0.0s                |
| on user configuration          |          |                     |
| * Starting control plane node  | 0.0s     | 0.0s                |
| minikube in cluster minikube   |          |                     |
| * Creating kvm2 VM (CPUs=2,    | 42.6s    | 41.8s               |
| Memory=3700MB, Disk=20000MB)   |          |                     |
| ...                            |          |                     |
| * Preparing Kubernetes v1.20.2 | 2.4s     | 9.5s                |
| on Docker 20.10.2 ...          |          |                     |
|   - Generating certificates    | 4.9s     | 2.9s                |
| and keys ...                   |          |                     |
|   - Booting up control plane   | 16.5s    | 10.6s               |
| ...                            |          |                     |
|   - Configuring RBAC rules ... | 1.4s     | 1.2s                |
| * Verifying Kubernetes         | 1.6s     | 1.8s                |
| components...                  |          |                     |
| * Enabled addons:              | 0.7s     | 78.8s               |
| storage-provisioner,           |          |                     |
| default-storageclass           |          |                     |
| * Done! kubectl is now         | 0.0s     | 0.0s                |
| configured to use "minikube"   |          |                     |
| cluster and "default"          |          |                     |
| namespace by default           |          |                     |
+--------------------------------+----------+---------------------+

docker Driver
Times for minikube: 26.1s 26.0s 27.6s
Average time for minikube: 26.6s

Times for Minikube (PR 10424): 110.9s 95.2s 101.6s
Average time for Minikube (PR 10424): 102.6s

Averages Time Per Log

+--------------------------------+----------+---------------------+
|              LOG               | MINIKUBE | MINIKUBE (PR 10424) |
+--------------------------------+----------+---------------------+
| * minikube v1.17.1 on Debian   | 0.2s     | 0.2s                |
| 9.11 (kvm/amd64)               |          |                     |
| * Using the docker driver      | 0.1s     | 0.1s                |
| based on user configuration    |          |                     |
| * Starting control plane node  | 0.1s     | 0.1s                |
| minikube in cluster minikube   |          |                     |
| * Creating docker container    | 9.9s     | 9.8s                |
| (CPUs=2, Memory=3700MB) ...    |          |                     |
| * Preparing Kubernetes v1.20.2 | 15.1s    | 91.6s               |
| on Docker 20.10.2 ...          |          |                     |
| * Verifying Kubernetes         | 1.1s     | 0.8s                |
| components...                  |          |                     |
| * Enabled addons:              | 0.1s     | 0.1s                |
| storage-provisioner,           |          |                     |
| default-storageclass           |          |                     |
| * Done! kubectl is now         | 0.0s     | 0.0s                |
| configured to use "minikube"   |          |                     |
| cluster and "default"          |          |                     |
| namespace by default           |          |                     |
+--------------------------------+----------+---------------------+

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 12, 2021
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 12, 2021
@prezha
Copy link
Contributor Author

prezha commented Feb 12, 2021

after consultation with @medya, i've updated pr so that these strict checks (ie, waiting for pods in kverify.CorePodsList to have status ready) are only done if --wait flag is set to all: we do not want to add additional delays in startup time in any other case

@minikube-pr-bot
Copy link

kvm2 Driver
Times for minikube: 67.8s 63.3s 65.6s
Average time for minikube: 65.6s

Times for Minikube (PR 10424): 156.6s 158.6s 155.0s
Average time for Minikube (PR 10424): 156.7s

Averages Time Per Log

+--------------------------------+----------+---------------------+
|              LOG               | MINIKUBE | MINIKUBE (PR 10424) |
+--------------------------------+----------+---------------------+
| * minikube v1.17.1 on Debian   | 0.0s     | 0.0s                |
| 9.11 (kvm/amd64)               |          |                     |
| * Using the kvm2 driver based  | 0.0s     | 0.0s                |
| on user configuration          |          |                     |
| * Starting control plane node  | 0.0s     | 0.0s                |
| minikube in cluster minikube   |          |                     |
| * Creating kvm2 VM (CPUs=2,    | 40.4s    | 42.0s               |
| Memory=3700MB, Disk=20000MB)   |          |                     |
| ...                            |          |                     |
| * Preparing Kubernetes v1.20.2 | 9.0s     | 2.3s                |
| on Docker 20.10.2 ...          |          |                     |
|   - Generating certificates    | 2.5s     | 4.7s                |
| and keys ...                   |          |                     |
|   - Booting up control plane   | 10.8s    | 15.6s               |
| ...                            |          |                     |
|   - Configuring RBAC rules ... | 1.0s     | 1.5s                |
| * Verifying Kubernetes         | 1.6s     | 1.6s                |
| components...                  |          |                     |
| * Enabled addons:              | 0.3s     | 88.9s               |
| storage-provisioner,           |          |                     |
| default-storageclass           |          |                     |
| * Done! kubectl is now         | 0.0s     | 0.0s                |
| configured to use "minikube"   |          |                     |
| cluster and "default"          |          |                     |
| namespace by default           |          |                     |
+--------------------------------+----------+---------------------+

docker Driver
Times for minikube: 27.6s 26.5s 26.4s
Average time for minikube: 26.8s

Times for Minikube (PR 10424): 114.2s 109.1s 96.7s
Average time for Minikube (PR 10424): 106.7s

Averages Time Per Log

+--------------------------------+----------+---------------------+
|              LOG               | MINIKUBE | MINIKUBE (PR 10424) |
+--------------------------------+----------+---------------------+
| * minikube v1.17.1 on Debian   | 0.2s     | 0.2s                |
| 9.11 (kvm/amd64)               |          |                     |
| * Using the docker driver      | 0.1s     | 0.1s                |
| based on user configuration    |          |                     |
| * Starting control plane node  | 0.1s     | 0.1s                |
| minikube in cluster minikube   |          |                     |
| * Creating docker container    | 10.0s    | 9.7s                |
| (CPUs=2, Memory=3700MB) ...    |          |                     |
| * Preparing Kubernetes v1.20.2 | 15.1s    | 95.7s               |
| on Docker 20.10.2 ...          |          |                     |
| * Verifying Kubernetes         | 1.3s     | 0.8s                |
| components...                  |          |                     |
| * Enabled addons:              | 0.1s     | 0.1s                |
| storage-provisioner,           |          |                     |
| default-storageclass           |          |                     |
| * Done! kubectl is now         | 0.0s     | 0.0s                |
| configured to use "minikube"   |          |                     |
| cluster and "default"          |          |                     |
| namespace by default           |          |                     |
+--------------------------------+----------+---------------------+

@minikube-pr-bot
Copy link

kvm2 Driver
Times for minikube: 72.9s 70.4s 68.2s
Average time for minikube: 70.5s

Times for Minikube (PR 10424): 71.2s 65.7s 68.3s
Average time for Minikube (PR 10424): 68.4s

Averages Time Per Log

+--------------------------------+----------+---------------------+
|              LOG               | MINIKUBE | MINIKUBE (PR 10424) |
+--------------------------------+----------+---------------------+
| * minikube v1.17.1 on Debian   | 0.0s     | 0.0s                |
| 9.11 (kvm/amd64)               |          |                     |
| * Using the kvm2 driver based  | 0.0s     | 0.0s                |
| on user configuration          |          |                     |
| * Starting control plane node  | 0.0s     | 0.0s                |
| minikube in cluster minikube   |          |                     |
| * Creating kvm2 VM (CPUs=2,    | 43.5s    | 41.8s               |
| Memory=3700MB, Disk=20000MB)   |          |                     |
| ...                            |          |                     |
| * Preparing Kubernetes v1.20.2 | 9.8s     | 16.5s               |
| on Docker 20.10.2 ...          |          |                     |
|   - Generating certificates    | 2.9s     | 1.5s                |
| and keys ...                   |          |                     |
|   - Booting up control plane   | 10.8s    | 5.2s                |
| ...                            |          |                     |
|   - Configuring RBAC rules ... | 1.1s     | 0.8s                |
| * Verifying Kubernetes         | 1.7s     | 1.7s                |
| components...                  |          |                     |
| * Enabled addons:              | 0.6s     | 0.8s                |
| storage-provisioner,           |          |                     |
| default-storageclass           |          |                     |
| * Done! kubectl is now         | 0.0s     | 0.0s                |
| configured to use "minikube"   |          |                     |
| cluster and "default"          |          |                     |
| namespace by default           |          |                     |
+--------------------------------+----------+---------------------+

docker Driver
Times for minikube: 25.9s 26.3s 26.3s
Average time for minikube: 26.1s

Times for Minikube (PR 10424): 26.5s 24.3s 24.9s
Average time for Minikube (PR 10424): 25.2s

Averages Time Per Log

+--------------------------------+----------+---------------------+
|              LOG               | MINIKUBE | MINIKUBE (PR 10424) |
+--------------------------------+----------+---------------------+
| * minikube v1.17.1 on Debian   | 0.2s     | 0.2s                |
| 9.11 (kvm/amd64)               |          |                     |
| * Using the docker driver      | 0.1s     | 0.1s                |
| based on user configuration    |          |                     |
| * Starting control plane node  | 0.1s     | 0.1s                |
| minikube in cluster minikube   |          |                     |
| * Creating docker container    | 9.6s     | 9.6s                |
| (CPUs=2, Memory=3700MB) ...    |          |                     |
| * Preparing Kubernetes v1.20.2 | 15.0s    | 14.2s               |
| on Docker 20.10.2 ...          |          |                     |
| * Verifying Kubernetes         | 1.1s     | 1.0s                |
| components...                  |          |                     |
| * Enabled addons:              | 0.1s     | 0.1s                |
| storage-provisioner,           |          |                     |
| default-storageclass           |          |                     |
| * Done! kubectl is now         | 0.0s     | 0.0s                |
| configured to use "minikube"   |          |                     |
| cluster and "default"          |          |                     |
| namespace by default           |          |                     |
+--------------------------------+----------+---------------------+

Copy link
Member

@medyagh medyagh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prezha

it is possible that we do not Respect the --wait flag if it is passed to an Existiing cluster (on a second stat)

maybe the code needs to be fixed in updateExistingConfigFromFlags

// updateExistingConfigFromFlags will update the existing config from the flags - used on a second start
// skipping updating existing docker env , docker opt, InsecureRegistry, registryMirror, extra-config, apiserver-ips
func updateExistingConfigFromFlags(cmd *cobra.Command, existing *config.ClusterConfig) config.ClusterConfig { //nolint to suppress cyclomatic complexity 45 of func `updateExistingConfigFromFlags` is high (> 30)

	validateFlags(cmd, existing.Driver)

	cc := *existing

Copy link
Member

@medyagh medyagh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test is still failing

I think what is missing is need to add the new wait component to other places such as the bootstrapper

func (k *Bootstrapper) WaitForNode(
... add new component here

and also may be we need to ensure the other Route of waiting (if the cluster already exists, respects the --wait flags)

https://github.com/medyagh/minikube/blob/a44e009c6d54971d208f2e2789fc8473b0bec789/pkg/minikube/bootstrapper/kubeadm/kubeadm.go#L968

or other places in the code that we do waiting after the cluster is already running

@prezha
Copy link
Contributor Author

prezha commented Feb 12, 2021

@prezha

it is possible that we do not Respect the --wait flag if it is passed to an Existiing cluster (on a second stat)

maybe the code needs to be fixed in updateExistingConfigFromFlags

// updateExistingConfigFromFlags will update the existing config from the flags - used on a second start
// skipping updating existing docker env , docker opt, InsecureRegistry, registryMirror, extra-config, apiserver-ips
func updateExistingConfigFromFlags(cmd *cobra.Command, existing *config.ClusterConfig) config.ClusterConfig { //nolint to suppress cyclomatic complexity 45 of func `updateExistingConfigFromFlags` is high (> 30)

	validateFlags(cmd, existing.Driver)

	cc := *existing

@medya do you mean, on subsequent starts, to completely remove --wait flag whichever is set (ie, not just if it's all)?
essentially, in updateExistingConfigFromFlags replacing the segment:

if cmd.Flags().Changed(waitComponents) {
cc.VerifyComponents = interpretWaitFlag(*cmd)
}

with
cc.VerifyComponents = kverify.NoComponents

@prezha
Copy link
Contributor Author

prezha commented Feb 12, 2021

test is still failing

I think what is missing is need to add the new wait component to other places such as the bootstrapper

func (k *Bootstrapper) WaitForNode(
... add new component here

thanks @medyagh, i think i've already added that in the pr:
https://github.com/kubernetes/minikube/pull/10424/files#diff-594d52daef2d962248b10339145fa1e601c0da05d12412e86e9d2f569b4d79c6

@prezha
Copy link
Contributor Author

prezha commented Feb 15, 2021

@prezha
it is possible that we do not Respect the --wait flag if it is passed to an Existiing cluster (on a second stat)
maybe the code needs to be fixed in updateExistingConfigFromFlags

// updateExistingConfigFromFlags will update the existing config from the flags - used on a second start
// skipping updating existing docker env , docker opt, InsecureRegistry, registryMirror, extra-config, apiserver-ips
func updateExistingConfigFromFlags(cmd *cobra.Command, existing *config.ClusterConfig) config.ClusterConfig { //nolint to suppress cyclomatic complexity 45 of func `updateExistingConfigFromFlags` is high (> 30)

	validateFlags(cmd, existing.Driver)

	cc := *existing

@medya do you mean, on subsequent starts, to completely remove --wait flag whichever is set (ie, not just if it's all)?
essentially, in updateExistingConfigFromFlags replacing the segment:

if cmd.Flags().Changed(waitComponents) {
cc.VerifyComponents = interpretWaitFlag(*cmd)
}

with
cc.VerifyComponents = kverify.NoComponents

i think a good place to intervene is:

if existing != nil {

here, on subsequent starts, we can simply turn off whichever checks we want...

what about restarts of components during the first run - should we still respect --wait=all if set (easier) or not (which then complicates things a bit)?

@medyagh
Copy link
Member

medyagh commented Feb 15, 2021

of components during the first run - should we still respect --wait=all if set (easier) or not (which then

@prezha
we should respect --wait=all anytime it is set, and make sure to wait for maximum amount of things we can wait on.

but if not flag is set, and user only does "minikube start" we should not add any new things to wait on.

@minikube-pr-bot
Copy link

kvm2 Driver
Times for minikube: 70.0s 71.1s 69.1s
Average time for minikube: 70.1s

Times for Minikube (PR 10424): 66.4s 68.6s 66.5s
Average time for Minikube (PR 10424): 67.1s

Averages Time Per Log

+--------------------------------+----------+---------------------+
|              LOG               | MINIKUBE | MINIKUBE (PR 10424) |
+--------------------------------+----------+---------------------+
| * minikube v1.17.1 on Debian   | 0.0s     | 0.0s                |
| 9.11 (kvm/amd64)               |          |                     |
| * Using the kvm2 driver based  | 0.0s     | 0.0s                |
| on user configuration          |          |                     |
| * Starting control plane node  | 0.0s     | 0.0s                |
| minikube in cluster minikube   |          |                     |
| * Creating kvm2 VM (CPUs=2,    | 42.5s    | 40.6s               |
| Memory=3700MB, Disk=20000MB)   |          |                     |
| ...                            |          |                     |
| * Preparing Kubernetes v1.20.2 | 9.6s     | 16.5s               |
| on Docker 20.10.2 ...          |          |                     |
|   - Generating certificates    | 3.0s     | 1.3s                |
| and keys ...                   |          |                     |
|   - Booting up control plane   | 11.1s    | 5.4s                |
| ...                            |          |                     |
|   - Configuring RBAC rules ... | 1.2s     | 1.0s                |
| * Verifying Kubernetes         | 1.7s     | 1.7s                |
| components...                  |          |                     |
| * Enabled addons:              | 0.9s     | 0.8s                |
| default-storageclass,          |          |                     |
| storage-provisioner            |          |                     |
| * Done! kubectl is now         | 0.0s     | 0.0s                |
| configured to use "minikube"   |          |                     |
| cluster and "default"          |          |                     |
| namespace by default           |          |                     |
+--------------------------------+----------+---------------------+

docker Driver
Times for minikube: 29.5s 25.5s 26.7s
Average time for minikube: 27.3s

Times for Minikube (PR 10424): 27.4s 26.7s 26.1s
Average time for Minikube (PR 10424): 26.7s

Averages Time Per Log

+--------------------------------+----------+---------------------+
|              LOG               | MINIKUBE | MINIKUBE (PR 10424) |
+--------------------------------+----------+---------------------+
| * minikube v1.17.1 on Debian   | 0.2s     | 0.2s                |
| 9.11 (kvm/amd64)               |          |                     |
| * Using the docker driver      | 0.1s     | 0.1s                |
| based on user configuration    |          |                     |
| * Starting control plane node  | 0.1s     | 0.1s                |
| minikube in cluster minikube   |          |                     |
| * Creating docker container    | 11.1s    | 10.1s               |
| (CPUs=2, Memory=3700MB) ...    |          |                     |
| * Preparing Kubernetes v1.20.2 | 14.7s    | 14.9s               |
| on Docker 20.10.2 ...          |          |                     |
| * Verifying Kubernetes         | 1.0s     | 1.2s                |
| components...                  |          |                     |
| * Enabled addons:              | 0.1s     | 0.1s                |
| storage-provisioner,           |          |                     |
| default-storageclass           |          |                     |
| * Done! kubectl is now         | 0.0s     | 0.0s                |
| configured to use "minikube"   |          |                     |
| cluster and "default"          |          |                     |
| namespace by default           |          |                     |
+--------------------------------+----------+---------------------+

// it appears to be immediately Ready as are all kube-system pods
// then (after ~10sec) it realises it has some changes to apply, implying also pods restarts
// so we wait for kubelet to initialise itself...
time.Sleep(10 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think this is the right thing to sleep 10 seconds randomly, is there any other way to detect that kubelet is trying to initialize itself ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right @medyagh and thank you for proposing to use retry.Expo() instead

@@ -37,16 +37,18 @@ const (
NodeReadyKey = "node_ready"
// KubeletKey is the name used in the flags for waiting for the kubelet status to be ready
KubeletKey = "kubelet"
// OperationalKey is the name used for waiting for pods in CorePodsList to be Ready
OperationalKey = "operational"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we name this, something like
"extra"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good - renamed

@prezha prezha requested a review from medyagh February 16, 2021 22:29
@minikube-pr-bot
Copy link

kvm2 Driver
Times for minikube: 72.0s 70.2s 68.8s
Average time for minikube: 70.3s

Times for Minikube (PR 10424): 64.6s 68.6s 68.7s
Average time for Minikube (PR 10424): 67.3s

Averages Time Per Log

+--------------------------------+----------+---------------------+
|              LOG               | MINIKUBE | MINIKUBE (PR 10424) |
+--------------------------------+----------+---------------------+
| * minikube v1.17.1 on Debian   | 0.0s     | 0.0s                |
| 9.11 (kvm/amd64)               |          |                     |
| * Using the kvm2 driver based  | 0.0s     | 0.0s                |
| on user configuration          |          |                     |
| * Starting control plane node  | 0.0s     | 0.0s                |
| minikube in cluster minikube   |          |                     |
| * Creating kvm2 VM (CPUs=2,    | 43.3s    | 42.3s               |
| Memory=3700MB, Disk=20000MB)   |          |                     |
| ...                            |          |                     |
| * Preparing Kubernetes v1.20.2 | 2.1s     | 8.6s                |
| on Docker 20.10.2 ...          |          |                     |
|   - Generating certificates    | 4.6s     | 3.1s                |
| and keys ...                   |          |                     |
|   - Booting up control plane   | 16.3s    | 10.1s               |
| ...                            |          |                     |
|   - Configuring RBAC rules ... | 1.5s     | 1.2s                |
| * Verifying Kubernetes         | 1.6s     | 1.5s                |
| components...                  |          |                     |
| * Enabled addons:              | 0.6s     | 0.2s                |
| storage-provisioner,           |          |                     |
| default-storageclass           |          |                     |
| * Done! kubectl is now         | 0.0s     | 0.0s                |
| configured to use "minikube"   |          |                     |
| cluster and "default"          |          |                     |
| namespace by default           |          |                     |
+--------------------------------+----------+---------------------+

docker Driver
Times for minikube: 26.9s 26.0s 27.1s
Average time for minikube: 26.6s

Times for Minikube (PR 10424): 26.2s 26.1s 28.2s
Average time for Minikube (PR 10424): 26.8s

Averages Time Per Log

+--------------------------------+----------+---------------------+
|              LOG               | MINIKUBE | MINIKUBE (PR 10424) |
+--------------------------------+----------+---------------------+
| * minikube v1.17.1 on Debian   | 0.2s     | 0.2s                |
| 9.11 (kvm/amd64)               |          |                     |
| * Using the docker driver      | 0.1s     | 0.1s                |
| based on user configuration    |          |                     |
| * Starting control plane node  | 0.1s     | 0.1s                |
| minikube in cluster minikube   |          |                     |
| * Creating docker container    | 9.8s     | 9.9s                |
| (CPUs=2, Memory=3700MB) ...    |          |                     |
| * Preparing Kubernetes v1.20.2 | 15.4s    | 15.5s               |
| on Docker 20.10.2 ...          |          |                     |
| * Verifying Kubernetes         | 0.9s     | 1.0s                |
| components...                  |          |                     |
| * Enabled addons:              | 0.1s     | 0.1s                |
| storage-provisioner,           |          |                     |
| default-storageclass           |          |                     |
| * Done! kubectl is now         | 0.0s     | 0.0s                |
| configured to use "minikube"   |          |                     |
| cluster and "default"          |          |                     |
| namespace by default           |          |                     |
+--------------------------------+----------+---------------------+

// WaitForPodReadyByLabel waits for pod with label ([key:]val) in a namespace to be in Ready condition.
// If namespace is not provided, it defaults to "kube-system".
// If label key is not provided, it will try with "component" and "k8s-app".
func WaitForPodReadyByLabel(cs *kubernetes.Clientset, label, namespace string, timeout time.Duration) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this func is not used outside, make private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@medyagh medyagh changed the title fix WaitForPod by waiting for conditions Ready instead of Running phase add new extra component to --wait=all to validate a healthy cluster Feb 16, 2021
@minikube-pr-bot
Copy link

kvm2 Driver
Times for minikube: 67.8s 68.5s 70.2s
Average time for minikube: 68.8s

Times for Minikube (PR 10424): 69.3s 67.6s 70.5s
Average time for Minikube (PR 10424): 69.2s

Averages Time Per Log

+--------------------------------+----------+---------------------+
|              LOG               | MINIKUBE | MINIKUBE (PR 10424) |
+--------------------------------+----------+---------------------+
| * minikube v1.17.1 on Debian   | 0.1s     | 0.0s                |
| 9.11 (kvm/amd64)               |          |                     |
| * Using the kvm2 driver based  | 0.0s     | 0.0s                |
| on user configuration          |          |                     |
| * Starting control plane node  | 0.0s     | 0.0s                |
| minikube in cluster minikube   |          |                     |
| * Creating kvm2 VM (CPUs=2,    | 42.1s    | 42.3s               |
| Memory=3700MB, Disk=20000MB)   |          |                     |
| ...                            |          |                     |
| * Preparing Kubernetes v1.20.2 | 9.0s     | 2.3s                |
| on Docker 20.10.2 ...          |          |                     |
|   - Generating certificates    | 3.0s     | 4.3s                |
| and keys ...                   |          |                     |
|   - Booting up control plane   | 10.8s    | 15.9s               |
| ...                            |          |                     |
|   - Configuring RBAC rules ... | 1.2s     | 1.6s                |
| * Verifying Kubernetes         | 1.7s     | 1.8s                |
| components...                  |          |                     |
| * Enabled addons:              | 0.6s     | 0.5s                |
| default-storageclass,          |          |                     |
| storage-provisioner            |          |                     |
| * Done! kubectl is now         | 0.0s     | 0.0s                |
| configured to use "minikube"   |          |                     |
| cluster and "default"          |          |                     |
| namespace by default           |          |                     |
+--------------------------------+----------+---------------------+

docker Driver
Times for minikube: 26.1s 26.8s 26.5s
Average time for minikube: 26.5s

Times for Minikube (PR 10424): 27.9s 26.5s 26.4s
Average time for Minikube (PR 10424): 27.0s

Averages Time Per Log

+--------------------------------+----------+---------------------+
|              LOG               | MINIKUBE | MINIKUBE (PR 10424) |
+--------------------------------+----------+---------------------+
| * minikube v1.17.1 on Debian   | 0.2s     | 0.2s                |
| 9.11 (kvm/amd64)               |          |                     |
| * Using the docker driver      | 0.1s     | 0.1s                |
| based on user configuration    |          |                     |
| * Starting control plane node  | 0.1s     | 0.1s                |
| minikube in cluster minikube   |          |                     |
| * Creating docker container    | 9.8s     | 9.9s                |
| (CPUs=2, Memory=3700MB) ...    |          |                     |
| * Preparing Kubernetes v1.20.2 | 15.2s    | 15.3s               |
| on Docker 20.10.2 ...          |          |                     |
| * Verifying Kubernetes         | 1.0s     | 1.3s                |
| components...                  |          |                     |
| * Enabled addons:              | 0.1s     | 0.1s                |
| storage-provisioner,           |          |                     |
| default-storageclass           |          |                     |
| * Done! kubectl is now         | 0.0s     | 0.0s                |
| configured to use "minikube"   |          |                     |
| cluster and "default"          |          |                     |
| namespace by default           |          |                     |
+--------------------------------+----------+---------------------+

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 16, 2021
Copy link
Member

@medyagh medyagh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you very much for fixing this annoying test flake @prezha I really appreciate your patience to get this fixed the right way

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: medyagh, prezha

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@medyagh medyagh merged commit d84f178 into kubernetes:master Feb 16, 2021
@prezha
Copy link
Contributor Author

prezha commented Feb 16, 2021

thank you very much for fixing this annoying test flake @prezha I really appreciate your patience to get this fixed the right way

you are most welcome @medyagh and thank you for your reviews and comments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky TestFunctional/parallel/ComponentHealth: pending Test component health flake
7 participants