Skip to content
This repository has been archived by the owner on Jul 30, 2021. It is now read-only.

Disable Kubelet read-only port 10255 #1025

Closed
wants to merge 1 commit into from
Closed

Disable Kubelet read-only port 10255 #1025

wants to merge 1 commit into from

Conversation

dghubble
Copy link
Contributor

@dghubble dghubble commented Nov 5, 2018

pod-checkpointer creates an insecureClient and secureClient to the kubelet, but only makes use of the former (via 10255). Adapt localParentPods to try to use the secure client first, then the insecure client to support clusters that disable the kubelet read-only port.

Open questions:

  • What about we just always use the kubelet secure API (instead of having a fallback)?
  • Testing with the secure client, we'll need a ClusterRole with node/proxy get. Surprising since the pod-checkpointer also mounts the admin kubeconfig from the host (possibly not being used).

Background

Today in bootkube, the kubelet read-only port (10255) is enabled and pod-checkpointer uses its /pods endpoint to get all pods (later filters to find parent pods). That all works fine.

We've come a long way toward eliminating the read-only port. Cloud load balancers can now health check the apiserver, Prometheus can use the kubelet secure API to scrape metrics, and heapster can get metrics from the kubelet secure API too.

Running clusters with kubelet read-only-port=0 (disabled), the active pod-checkpointer will log:

failed to list local parent pods, assuming none are running: Get http://127.0.0.1:10255/pods/: dial tcp 127.0.0.1:10255: connect: connection refused

Interestingly, even with localParentPods calls failing on these clusters, recovery from cluster power cycling is unaffected. But I figure if we can eliminate this one last use of the read-only API, all the better.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 5, 2018
@coreosbot
Copy link

Can one of the admins verify this patch?

1 similar comment
@coreosbot
Copy link

Can one of the admins verify this patch?

@coreosbot
Copy link

Can one of the admins verify this patch?

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 5, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: dghubble

If they are not already assigned, you can assign the PR to them by writing /assign @dghubble in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Nov 5, 2018
@dghubble
Copy link
Contributor Author

dghubble commented Nov 5, 2018

I expect I'll need to split the checkpointer change and user-data change for tests to pass.

But manually testing on clusters with --read-only-port=0 and a test checkpointer image from this PR, I was able to resolve the error posted above after adding a ClusterRole:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: pod-checkpointer
rules:
  - apiGroups: [""]
    resources:
      - nodes
      - nodes/proxy
    verbs:
      - get
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: pod-checkpointer
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: pod-checkpointer
subjects:
- kind: ServiceAccount
  name: pod-checkpointer
  namespace: kube-system

@rphillips
Copy link
Contributor

Thanks Dalton for looking into this. We just had a conversation about this exact issue. I am worried about the certificate rotation needed for the checkpointer -> kubelet secure connection. There is an edge case where if a node is down for a period of time, comes back, the cert might be expired. Not to mention needing to maintain the cert rotation on that connection.

We are definitely on board to try and remove the insecure port requirement.

@dghubble
Copy link
Contributor Author

Ah right. bootkube supports using kubelet TLS bootstrap so the concern is if there's no apiserver (e.g. the cluster is powered off) when the certificate expires, then on startup, kubelet starts pod-checkpointer which may not make it far enough* to start the bootstrap-apiserver, to issue the new cert so the kubelet can register. So given the choice, we prefer to at least fallback to 10255.

I can post a standalone PR for the checkpointer to try 10250, then fallback to 10255.

*I'm not sure how that can happen, since clusters that disable read-only today, pod-checkpointer can't contact anything. Yet control plane recovery succeeds.

@dghubble dghubble changed the title WIP: Disable Kubelet read-only port 10255 Disable Kubelet read-only port 10255 Nov 26, 2018
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 26, 2018
@dghubble
Copy link
Contributor Author

Added the ClusterRole and ClusterRoleBinding to give checkpointer permission to perform requests previously done via the kubelet read-only API. Requires #1027 and a checkpointer release before this can pass tests.

@rphillips
Copy link
Contributor

@dghubble can you update the checkpointer hash to: 83e25e5968391b9eb342042c435d1b3eeddb2be1

@dghubble
Copy link
Contributor Author

Updated, and opened #1031 with just the image change, since we might consider it separate from the feature change of disabling read-only in test / example clusters.

Release Note:

* Disable Kubelet read-only port 10255 (may affect applications)
  * Configure any added apps using the insecure API to use the Kubelet secure API (e.g. heapster, prometheus)
  * If using Kubelet certificate rotation, you may wish to leave read-only port on to mitigate any [chance of interruption](https://github.com/kubernetes-incubator/bootkube/pull/1025#issuecomment-437038376).

@rphillips
Copy link
Contributor

/ok-to-test

@rphillips
Copy link
Contributor

19:17:03 Error: loading manifests: parse file /home/core/assets/manifests/pod-checkpointer-cluster-role.yaml: invalid manifest: yaml: line 7: found unexpected end of stream

dghubble added a commit to poseidon/terraform-render-bootstrap that referenced this pull request Nov 27, 2018
* Updates pod-checkpointer to prefer the Kubelet secure
API (before falling back to the Kubelet read-only API that
is disabled on Typhoon clusters since
poseidon/typhoon#324)
* Previously, pod-checkpointer checkpointed an initial set
of pods during bootstrapping so recovery from power cycling
clusters was unaffected, but logs were noisy
* kubernetes-retired/bootkube#1027
* kubernetes-retired/bootkube#1025
dghubble added a commit to poseidon/typhoon that referenced this pull request Nov 27, 2018
* Updates pod-checkpointer to prefer the Kubelet secure
API (before falling back to the Kubelet read-only API that
is disabled on Typhoon clusters since
#324)
* Previously, pod-checkpointer checkpointed an initial set
of pods during bootstrapping so recovery from power cycling
clusters was unaffected, but logs were noisy
* kubernetes-retired/bootkube#1027
* kubernetes-retired/bootkube#1025
@dghubble
Copy link
Contributor Author

Gah, verbs: ["get'] -> verbs: ["get"]

dghubble-robot pushed a commit to poseidon/terraform-onprem-kubernetes that referenced this pull request Nov 27, 2018
* Updates pod-checkpointer to prefer the Kubelet secure
API (before falling back to the Kubelet read-only API that
is disabled on Typhoon clusters since
poseidon/typhoon#324)
* Previously, pod-checkpointer checkpointed an initial set
of pods during bootstrapping so recovery from power cycling
clusters was unaffected, but logs were noisy
* kubernetes-retired/bootkube#1027
* kubernetes-retired/bootkube#1025
dghubble-robot pushed a commit to poseidon/terraform-digitalocean-kubernetes that referenced this pull request Nov 27, 2018
* Updates pod-checkpointer to prefer the Kubelet secure
API (before falling back to the Kubelet read-only API that
is disabled on Typhoon clusters since
poseidon/typhoon#324)
* Previously, pod-checkpointer checkpointed an initial set
of pods during bootstrapping so recovery from power cycling
clusters was unaffected, but logs were noisy
* kubernetes-retired/bootkube#1027
* kubernetes-retired/bootkube#1025
dghubble-robot pushed a commit to poseidon/terraform-aws-kubernetes that referenced this pull request Nov 27, 2018
* Updates pod-checkpointer to prefer the Kubelet secure
API (before falling back to the Kubelet read-only API that
is disabled on Typhoon clusters since
poseidon/typhoon#324)
* Previously, pod-checkpointer checkpointed an initial set
of pods during bootstrapping so recovery from power cycling
clusters was unaffected, but logs were noisy
* kubernetes-retired/bootkube#1027
* kubernetes-retired/bootkube#1025
dghubble-robot pushed a commit to poseidon/terraform-google-kubernetes that referenced this pull request Nov 27, 2018
* Updates pod-checkpointer to prefer the Kubelet secure
API (before falling back to the Kubelet read-only API that
is disabled on Typhoon clusters since
poseidon/typhoon#324)
* Previously, pod-checkpointer checkpointed an initial set
of pods during bootstrapping so recovery from power cycling
clusters was unaffected, but logs were noisy
* kubernetes-retired/bootkube#1027
* kubernetes-retired/bootkube#1025
@rphillips
Copy link
Contributor

coreosbot run e2e calico

@@ -24,5 +24,4 @@ The information below describes a minimum set of port allocations used by Kubern
| TCP | 4194 | Master & Worker Nodes | The port of the localhost cAdvisor endpoint |
| UDP | 4789 | Master & Worker Nodes | flannel overlay network - *vxlan backend* |
| TCP | 10250 | Master Nodes | Worker node Kubelet API for exec and logs. |
| TCP | 10255 | Master & Worker Nodes | Worker node read-only Kubelet API (Heapster). |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should leave this network requirement in but label it 'optional'. otherwise this lgtm and tests are passing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@dghubble
Copy link
Contributor Author

dghubble commented Nov 27, 2018

Did we want to add just bump the checkpointer via #1031, then rebase this to follow it? Since the two are somewhat independent. 🤷‍♂️ either way works for me

@rphillips
Copy link
Contributor

@dghubble Thanks! We can rebase this PR now.

* Add ClusterRole and ClusterRoleBinding to give checkpointer
permission to perform requests previously done via the kubelet
read-only API
@dghubble
Copy link
Contributor Author

Updated

@rphillips
Copy link
Contributor

coreosbot run e2e calico

sphw pushed a commit to m10io/tokenx-typhoon that referenced this pull request Dec 17, 2018
* Updates pod-checkpointer to prefer the Kubelet secure
API (before falling back to the Kubelet read-only API that
is disabled on Typhoon clusters since
poseidon#324)
* Previously, pod-checkpointer checkpointed an initial set
of pods during bootstrapping so recovery from power cycling
clusters was unaffected, but logs were noisy
* kubernetes-retired/bootkube#1027
* kubernetes-retired/bootkube#1025
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 27, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 27, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dghubble-robot pushed a commit to poseidon/terraform-azure-kubernetes that referenced this pull request May 25, 2020
* Updates pod-checkpointer to prefer the Kubelet secure
API (before falling back to the Kubelet read-only API that
is disabled on Typhoon clusters since
poseidon/typhoon#324)
* Previously, pod-checkpointer checkpointed an initial set
of pods during bootstrapping so recovery from power cycling
clusters was unaffected, but logs were noisy
* kubernetes-retired/bootkube#1027
* kubernetes-retired/bootkube#1025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants