Disable Kubelet read-only port 10255 #1025

dghubble · 2018-11-05T01:35:51Z

pod-checkpointer creates an insecureClient and secureClient to the kubelet, but only makes use of the former (via 10255). Adapt localParentPods to try to use the secure client first, then the insecure client to support clusters that disable the kubelet read-only port.

Open questions:

What about we just always use the kubelet secure API (instead of having a fallback)?
Testing with the secure client, we'll need a ClusterRole with node/proxy get. Surprising since the pod-checkpointer also mounts the admin kubeconfig from the host (possibly not being used).

Background

Today in bootkube, the kubelet read-only port (10255) is enabled and pod-checkpointer uses its /pods endpoint to get all pods (later filters to find parent pods). That all works fine.

We've come a long way toward eliminating the read-only port. Cloud load balancers can now health check the apiserver, Prometheus can use the kubelet secure API to scrape metrics, and heapster can get metrics from the kubelet secure API too.

Running clusters with kubelet read-only-port=0 (disabled), the active pod-checkpointer will log:

failed to list local parent pods, assuming none are running: Get http://127.0.0.1:10255/pods/: dial tcp 127.0.0.1:10255: connect: connection refused

Interestingly, even with localParentPods calls failing on these clusters, recovery from cluster power cycling is unaffected. But I figure if we can eliminate this one last use of the read-only API, all the better.

coreosbot · 2018-11-05T01:35:52Z

Can one of the admins verify this patch?

coreosbot · 2018-11-05T01:35:52Z

Can one of the admins verify this patch?

coreosbot · 2018-11-05T01:35:53Z

Can one of the admins verify this patch?

k8s-ci-robot · 2018-11-05T01:35:54Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: dghubble

If they are not already assigned, you can assign the PR to them by writing /assign @dghubble in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dghubble · 2018-11-05T01:44:29Z

I expect I'll need to split the checkpointer change and user-data change for tests to pass.

But manually testing on clusters with --read-only-port=0 and a test checkpointer image from this PR, I was able to resolve the error posted above after adding a ClusterRole:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: pod-checkpointer
rules:
  - apiGroups: [""]
    resources:
      - nodes
      - nodes/proxy
    verbs:
      - get

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: pod-checkpointer
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: pod-checkpointer
subjects:
- kind: ServiceAccount
  name: pod-checkpointer
  namespace: kube-system

rphillips · 2018-11-08T15:33:57Z

Thanks Dalton for looking into this. We just had a conversation about this exact issue. I am worried about the certificate rotation needed for the checkpointer -> kubelet secure connection. There is an edge case where if a node is down for a period of time, comes back, the cert might be expired. Not to mention needing to maintain the cert rotation on that connection.

We are definitely on board to try and remove the insecure port requirement.

dghubble · 2018-11-10T01:20:29Z

Ah right. bootkube supports using kubelet TLS bootstrap so the concern is if there's no apiserver (e.g. the cluster is powered off) when the certificate expires, then on startup, kubelet starts pod-checkpointer which may not make it far enough* to start the bootstrap-apiserver, to issue the new cert so the kubelet can register. So given the choice, we prefer to at least fallback to 10255.

I can post a standalone PR for the checkpointer to try 10250, then fallback to 10255.

*I'm not sure how that can happen, since clusters that disable read-only today, pod-checkpointer can't contact anything. Yet control plane recovery succeeds.

dghubble · 2018-11-26T07:53:56Z

Added the ClusterRole and ClusterRoleBinding to give checkpointer permission to perform requests previously done via the kubelet read-only API. Requires #1027 and a checkpointer release before this can pass tests.

rphillips · 2018-11-26T15:54:58Z

@dghubble can you update the checkpointer hash to: 83e25e5968391b9eb342042c435d1b3eeddb2be1

dghubble · 2018-11-26T18:02:43Z

Updated, and opened #1031 with just the image change, since we might consider it separate from the feature change of disabling read-only in test / example clusters.

Release Note:

* Disable Kubelet read-only port 10255 (may affect applications)
  * Configure any added apps using the insecure API to use the Kubelet secure API (e.g. heapster, prometheus)
  * If using Kubelet certificate rotation, you may wish to leave read-only port on to mitigate any [chance of interruption](https://github.com/kubernetes-incubator/bootkube/pull/1025#issuecomment-437038376).

rphillips · 2018-11-26T19:12:17Z

/ok-to-test

rphillips · 2018-11-26T19:19:52Z

19:17:03 Error: loading manifests: parse file /home/core/assets/manifests/pod-checkpointer-cluster-role.yaml: invalid manifest: yaml: line 7: found unexpected end of stream

* Updates pod-checkpointer to prefer the Kubelet secure API (before falling back to the Kubelet read-only API that is disabled on Typhoon clusters since poseidon/typhoon#324) * Previously, pod-checkpointer checkpointed an initial set of pods during bootstrapping so recovery from power cycling clusters was unaffected, but logs were noisy * kubernetes-retired/bootkube#1027 * kubernetes-retired/bootkube#1025

* Updates pod-checkpointer to prefer the Kubelet secure API (before falling back to the Kubelet read-only API that is disabled on Typhoon clusters since #324) * Previously, pod-checkpointer checkpointed an initial set of pods during bootstrapping so recovery from power cycling clusters was unaffected, but logs were noisy * kubernetes-retired/bootkube#1027 * kubernetes-retired/bootkube#1025

dghubble · 2018-11-27T04:28:24Z

Gah, verbs: ["get'] -> verbs: ["get"]

* Updates pod-checkpointer to prefer the Kubelet secure API (before falling back to the Kubelet read-only API that is disabled on Typhoon clusters since poseidon/typhoon#324) * Previously, pod-checkpointer checkpointed an initial set of pods during bootstrapping so recovery from power cycling clusters was unaffected, but logs were noisy * kubernetes-retired/bootkube#1027 * kubernetes-retired/bootkube#1025

rphillips · 2018-11-27T15:21:10Z

coreosbot run e2e calico

rphillips · 2018-11-27T15:56:08Z

Documentation/network-requirements.md

@@ -24,5 +24,4 @@ The information below describes a minimum set of port allocations used by Kubern
 | TCP      | 4194        | Master & Worker Nodes          | The port of the localhost cAdvisor endpoint |
 | UDP      | 4789        | Master & Worker Nodes          | flannel overlay network - *vxlan backend* |
 | TCP      | 10250       | Master Nodes                   | Worker node Kubelet API for exec and logs.                                  |
-| TCP      | 10255       | Master & Worker Nodes          | Worker node read-only Kubelet API (Heapster).                                  |


We should leave this network requirement in but label it 'optional'. otherwise this lgtm and tests are passing.

dghubble · 2018-11-27T23:39:27Z

Did we want to add just bump the checkpointer via #1031, then rebase this to follow it? Since the two are somewhat independent. 🤷‍♂️ either way works for me

rphillips · 2018-11-29T18:08:54Z

@dghubble Thanks! We can rebase this PR now.

* Add ClusterRole and ClusterRoleBinding to give checkpointer permission to perform requests previously done via the kubelet read-only API

dghubble · 2018-11-29T19:43:19Z

Updated

rphillips · 2018-11-29T21:06:26Z

coreosbot run e2e calico

* Updates pod-checkpointer to prefer the Kubelet secure API (before falling back to the Kubelet read-only API that is disabled on Typhoon clusters since poseidon#324) * Previously, pod-checkpointer checkpointed an initial set of pods during bootstrapping so recovery from power cycling clusters was unaffected, but logs were noisy * kubernetes-retired/bootkube#1027 * kubernetes-retired/bootkube#1025

fejta-bot · 2019-04-27T06:47:03Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-05-27T07:29:42Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-06-26T08:21:00Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-06-26T08:21:08Z

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

* Updates pod-checkpointer to prefer the Kubelet secure API (before falling back to the Kubelet read-only API that is disabled on Typhoon clusters since poseidon/typhoon#324) * Previously, pod-checkpointer checkpointed an initial set of pods during bootstrapping so recovery from power cycling clusters was unaffected, but logs were noisy * kubernetes-retired/bootkube#1027 * kubernetes-retired/bootkube#1025

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 5, 2018

k8s-ci-robot requested review from aaronlevy and rphillips November 5, 2018 01:35

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 5, 2018

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Nov 5, 2018

dghubble mentioned this pull request Nov 12, 2018

pkg/checkpoint: Try kubelet secureClient, fallback to read-only #1027

Merged

sbv-trueenergy mentioned this pull request Nov 22, 2018

Disable Kubelet read-only port 10255 poseidon/typhoon#324

Merged

dghubble changed the title ~~WIP: Disable Kubelet read-only port 10255~~ Disable Kubelet read-only port 10255 Nov 26, 2018

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 26, 2018

dghubble mentioned this pull request Nov 27, 2018

Update pod-checkpointer image to query Kubelet secure api poseidon/terraform-render-bootstrap#91

Merged

dghubble mentioned this pull request Nov 27, 2018

Update pod-checkpointer image to query Kubelet secure API poseidon/typhoon#346

Merged

rphillips reviewed Nov 27, 2018

View reviewed changes

Disable Kubelet read-only port 10255

2b7f342

* Add ClusterRole and ClusterRoleBinding to give checkpointer permission to perform requests previously done via the kubelet read-only API

dghubble mentioned this pull request Dec 18, 2018

Minor docs fix to re-run master tests #1034

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 27, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 27, 2019

k8s-ci-robot closed this Jun 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable Kubelet read-only port 10255 #1025

Disable Kubelet read-only port 10255 #1025

dghubble commented Nov 5, 2018 •

edited

Loading

coreosbot commented Nov 5, 2018

coreosbot commented Nov 5, 2018

coreosbot commented Nov 5, 2018

k8s-ci-robot commented Nov 5, 2018

dghubble commented Nov 5, 2018 •

edited

Loading

rphillips commented Nov 8, 2018

dghubble commented Nov 10, 2018

dghubble commented Nov 26, 2018

rphillips commented Nov 26, 2018

dghubble commented Nov 26, 2018

rphillips commented Nov 26, 2018

rphillips commented Nov 26, 2018

dghubble commented Nov 27, 2018

rphillips commented Nov 27, 2018

rphillips Nov 27, 2018

dghubble Nov 27, 2018

dghubble commented Nov 27, 2018 •

edited

Loading

rphillips commented Nov 29, 2018

dghubble commented Nov 29, 2018

rphillips commented Nov 29, 2018

fejta-bot commented Apr 27, 2019

fejta-bot commented May 27, 2019

fejta-bot commented Jun 26, 2019

k8s-ci-robot commented Jun 26, 2019

Disable Kubelet read-only port 10255 #1025

Disable Kubelet read-only port 10255 #1025

Conversation

dghubble commented Nov 5, 2018 • edited Loading

Background

coreosbot commented Nov 5, 2018

coreosbot commented Nov 5, 2018

coreosbot commented Nov 5, 2018

k8s-ci-robot commented Nov 5, 2018

dghubble commented Nov 5, 2018 • edited Loading

rphillips commented Nov 8, 2018

dghubble commented Nov 10, 2018

dghubble commented Nov 26, 2018

rphillips commented Nov 26, 2018

dghubble commented Nov 26, 2018

rphillips commented Nov 26, 2018

rphillips commented Nov 26, 2018

dghubble commented Nov 27, 2018

rphillips commented Nov 27, 2018

rphillips Nov 27, 2018

Choose a reason for hiding this comment

dghubble Nov 27, 2018

Choose a reason for hiding this comment

dghubble commented Nov 27, 2018 • edited Loading

rphillips commented Nov 29, 2018

dghubble commented Nov 29, 2018

rphillips commented Nov 29, 2018

fejta-bot commented Apr 27, 2019

fejta-bot commented May 27, 2019

fejta-bot commented Jun 26, 2019

k8s-ci-robot commented Jun 26, 2019

dghubble commented Nov 5, 2018 •

edited

Loading

dghubble commented Nov 5, 2018 •

edited

Loading

dghubble commented Nov 27, 2018 •

edited

Loading