Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hub pod unable to establish connections to k8s api-server etc on port 6443 with Cilium #3202

Open
Ph0tonic opened this issue Aug 15, 2023 · 12 comments
Labels

Comments

@Ph0tonic
Copy link
Contributor

Ph0tonic commented Aug 15, 2023

Bug description

Default Kube-Spwaner is not able to spawn any user pod, it fails while attempting to create the PVC with a TimeoutError.

Expected behaviour

Should be able to able to spawn pods.

Analysis

After some research, I identified that my problem was linked with the netpol egress config of the hub.
Here are a few cilium logs of dropped packets :

xx drop (Policy denied) flow 0xdd60fbdd to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:53354 -> 148.187.17.13:6443 tcp ACK
xx drop (Policy denied) flow 0xf15f4f3e to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:53354 -> 148.187.17.13:6443 tcp ACK
xx drop (Policy denied) flow 0x409d160a to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:38278 -> 148.187.17.16:6443 tcp ACK
xx drop (Policy denied) flow 0x9f34c210 to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:38278 -> 148.187.17.16:6443 tcp ACK
xx drop (Policy denied) flow 0x2a3106d0 to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:53354 -> 148.187.17.13:6443 tcp ACK
xx drop (Policy denied) flow 0xecb1ac78 to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:53354 -> 148.187.17.13:6443 tcp ACK
xx drop (Policy denied) flow 0x46c3f486 to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:57970 -> 148.187.17.16:6443 tcp SYN
xx drop (Policy denied) flow 0x4cf3b758 to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:57970 -> 148.187.17.16:6443 tcp SYN
xx drop (Policy denied) flow 0xc3f5697 to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:53354 -> 148.187.17.13:6443 tcp ACK
xx drop (Policy denied) flow 0x7dd8720c to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:53354 -> 148.187.17.13:6443 tcp ACK
xx drop (Policy denied) flow 0x73b2516d to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:57978 -> 148.187.17.17:6443 tcp SYN
xx drop (Policy denied) flow 0xdad4447 to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:57978 -> 148.187.17.17:6443 tcp SYN
xx drop (Policy denied) flow 0x86e0bb3d to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:57980 -> 148.187.17.13:6443 tcp SYN
xx drop (Policy denied) flow 0xe8379e6 to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:38278 -> 148.187.17.16:6443 tcp ACK
xx drop (Policy denied) flow 0x1452fdbf to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:57980 -> 148.187.17.13:6443 tcp SYN
xx drop (Policy denied) flow 0x333532eb to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:38278 -> 148.187.17.16:6443 tcp ACK
xx drop (Policy denied) flow 0x6506036d to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:41318 -> 148.187.17.17:6443 tcp SYN
xx drop (Policy denied) flow 0xaafea84e to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:41318 -> 148.187.17.17:6443 tcp SYN
xx drop (Policy denied) flow 0x1637f8ef to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:41332 -> 148.187.17.17:6443 tcp SYN
xx drop (Policy denied) flow 0x7f75fa32 to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:53354 -> 148.187.17.13:6443 tcp ACK
xx drop (Policy denied) flow 0x20c2f3a8 to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:41332 -> 148.187.17.17:6443 tcp SYN
xx drop (Policy denied) flow 0x2339ea83 to endpoint 0, file bpf_lxc.c line 1181, , identity 36262->kube-apiserver: 10.42.5.108:53354 -> 148.187.17.13:6443 tcp ACK

After some research I identified that destination addresses belonged to kube-apiserver, kube-proxy and kube-controller-manager.

To fix the issue I identified that the problem lay in the egress and not in the ingress part. And managed to find a fix:

hub:
  networkPolicy:
    egress:
      - ports:
          - port: 6443

The issue is that the hub tries to access the kube-apiserver to generate a PVC but the request is blocked by the egress configuration.

I am surprised that @vizeit did not have this issue in #3167.

Your personal set up

I am using the latest v3.0.0 version of this helm chart with cilium.

Full environment
# paste output of `pip freeze` or `conda list` here
Configuration
# jupyterhub_config.py
Logs
@Ph0tonic Ph0tonic added the bug label Aug 15, 2023
@consideRatio consideRatio changed the title Unable to spawn pods with cilium hub pod unable to establish connections to k8s api-server etc on port 6443 with Cilium Aug 15, 2023
@consideRatio
Copy link
Member

Did you run into this in a GKE based cluster using Cilium via GCP's dataplane v2, or was this a cluster setup in another way?

@Ph0tonic
Copy link
Contributor Author

Ok so I do not think that it is a GKE based cluster, sorry, I am not familiar with cluster but what I found is that the runtime engine is containerd://1.6.15-k3s1 and cilium is configured.

@consideRatio
Copy link
Member

Ah, its a k3s based cluster. Then i think the main issue is that network policies are enforced at all (cilium, calico), but that access is restricted to the k8s internals there but not in other clusters.

@vizeit
Copy link

vizeit commented Aug 15, 2023

@Ph0tonic existing core network policy takes care of kube api server egress on GKE. I have been testing JupyterHub on GKE Auopilot for a few weeks now and do not see any other issues so far. You can check the details in my post, note the K8sAPIServer

https://www.vizeit.com/troubleshooting-cilium-on-gke/

@vizeit
Copy link

vizeit commented Aug 16, 2023

I have not installed k3s and tested but I think changing the server port to 443 should resolve this issue without any additional policy. I am including the reference links below

[1]https://kubernetes.io/docs/concepts/security/controlling-access/#transport-security

[2]https://docs.k3s.io/cli/server#listeners

@Ph0tonic
Copy link
Contributor Author

Thanks @vizeit, I will have a look at these configurations and see if it fixes my problem.

@Ph0tonic
Copy link
Contributor Author

So I looked at your link and it the difference between 443 and 6443 was not really clear to me.
I found kubernetes/website#30725 which clarifies this.
So from my understanding, 443 should be used as an exposed external port.

I see 2 possibilities :

  1. Add egress port to 6443
  2. Add some documentation to clarify the need for this egress rule.

@vizeit
Copy link

vizeit commented Aug 18, 2023

@Ph0tonic Were you able to test with port 443 to confirm that it works with the existing core network policy?

@bauerjs1
Copy link

bauerjs1 commented Oct 6, 2023

I can reproduce this problem with Cilium on a bare-metal cluster. Disabling the hub NetPol in the Helm chart is my workaround so far.

Access to the API server from pods inside the cluster goes through https://kubernetes.default:443 and I can only curl that from within the jupyterhub container, if the NetPol is disabled (and only then, JH is working properly).

The kubernetes.defailt service has a ClusterIP of 10.233.0.1. The NetPol is quite hard to read since there are many overlapping rules. However, looking at it in https://editor.networkpolicy.io/, I cannot find a rule that would allow traffic to this IP (unfortunately, I can't post the image).

@Ph0tonic
Copy link
Contributor Author

Ph0tonic commented Jan 26, 2024

Hi,
Sorry @vizeit for the late reply. I did not had the possibility and rights to change the cluster config from 6443 to 443, so I could not test it.

The solution which work for me is the following config:

hub:
  networkPolicy:
    egress:
      - ports:
          - port: 6443

@vizeit
Copy link

vizeit commented Jan 26, 2024

Hi, Sorry @vizeit for the late reply. I did not had the possibility and rights to change the cluster config from 6443 to 443, so I could not test it.

The solution which work for me is the following config:

hub:
  networkPolicy:
    egress:
      - ports:
          - port: 6443

@Ph0tonic no problem

@lahwaacz
Copy link
Contributor

lahwaacz commented Jul 7, 2024

Add some documentation to clarify the need for this egress rule.

Trying to clarify this:

  • By default, z2jh allows all egress traffic except private IP ranges:
    egress:
     - to:
       - ipBlock:
           cidr: 0.0.0.0/0
           except:
           - 10.0.0.0/8
           - 172.16.0.0/12
           - 192.168.0.0/16
           - 169.254.169.254/32
  • The kubernetes.default domain name used by the hub always resolves to an IP from the private range, e.g. 10.96.0.1. The public IP range of the Kubernetes API endpoint may also be from one of the private IP ranges, see e.g. Anatomy of the kubernetes.default
  • Hence, an extra egress rule is needed to allow the hub to connect to the Kubernetes API.

The following egress rule mentioned by @Ph0tonic works, but it allows connections to any host on port 6443, not only the Kubernetes API:

hub:
  networkPolicy:
    egress:
      - ports:
          - port: 6443

Alternatively, a CiliumNetworkPolicy can be used to filter traffic specifically from the hub pod to the Kubernetes API:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: allow-access-to-api-server
  namespace: jupyterhub-test
spec:
  egress:
  - toEntities:
    - kube-apiserver
  endpointSelector:
    matchLabels:
      app: jupyterhub
      component: hub

Also note that the same policy should be added also for the image-puller and user-scheduler components for which the chart does not specify any network policy. This is important especially when you want to add a default deny-all policy for the namespace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants