Adding GEP-3539: Gateway API to Expose Pods on Cluster-Internal IP Address (ClusterIP Gateway)#3608
Adding GEP-3539: Gateway API to Expose Pods on Cluster-Internal IP Address (ClusterIP Gateway)#3608ptrivedi wants to merge 6 commits intokubernetes-sigs:mainfrom
Conversation
|
Hi @ptrivedi. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
afc6467 to
835e6a3
Compare
…dress (ClusterIP Gateway) Signed-off-by: Pooja Trivedi poojatrivedi@google.com
835e6a3 to
6a061ca
Compare
|
Adding this comment here for tracking a few open items resulting from the comments on the google doc here: https://docs.google.com/document/d/1N-C-dBHfyfwkKufknwKTDLAw4AP2BnJlnmx0dB-cC4U/edit?tab=t.0
|
1e793b0 to
b5e81ee
Compare
* Fix missing image * Change GEP status to Memorandum * Make GEP navigable * Crop trailing whitespace from images Signed-off-by: Pooja poojatrivedi@google.com
b5e81ee to
e876ced
Compare
|
/assign @thockin |
thockin
left a comment
There was a problem hiding this comment.
First: LOVE IT
The questions I keep coming back to all are around how the node-proxy knows to pay attention to THIS gateway so it can implement the clusterIP or nodePort or externalTrafficPolicy or ...
|
|
||
| ### EndpointSelector as Backend | ||
|
|
||
| A Route can forward traffic to the endpoints selected via selector rules defined in EndpointSelector. |
There was a problem hiding this comment.
FWIW, I can imagine a path toward maybe making this a regular core feature. I am sure that it would be tricky but I don't think it's impossible.
Eg.
Define a Service with selector foo=bar. That triggers us to create a PodSelector for foo=bar. That triggers the endpoints controller(s) to do their thing. Same as we do with IP.
There was a problem hiding this comment.
Interesting thought.
For starters at least, there seemed to be agreement on having a GEP for EndpointSelector as the next step.
There was a problem hiding this comment.
As always, Gateway proves something is a good idea, then core steals the spotlight.
There was a problem hiding this comment.
Define a Service with selector foo=bar. That triggers us to create a PodSelector for foo=bar. That triggers the endpoints controller(s) to do their thing.
FWIW NetworkPolicies also contain selectors that need to be resolved to Pods, and we've occasionally talked about how nice it would be if the selector-to-pod mapping could be handled centrally, rather than every NP impl needing to implement that itself, often doing it redundantly on every node.
I guess in theory, we could do that with EndpointSlice even, since kube-proxy will ignore EndpointSlices that don't have a label pointing back to a Service, so we could just have another set of EndpointSlices for NetworkPolicies... (EndpointSlice has a bunch of fields that are wrong for NetworkPolicy but most of them are optional and could just be left unset...)
Though this also reminds me of my theory that EndpointSlice should have been a gRPC API rather than an object stored in etcd. The EndpointSlice controller can re-derive the entire (controller-generated) EndpointSlice state from Services and Pods at any time, and it needs to keep all that state in memory while it's running anyway. So it should just serve that information out to the controllers that need it (kube-proxy, gateways) in an efficient use-case-specific form (kind of like the original kpng idea) rather than writing it all out to etcd.
(Alternate version: move discovery.k8s.io to an aggregated apiserver that is part of the EndpointSlice controller, and have it serve EndpointSlices out of memory rather than out of etcd.)
| metadata: | ||
| name: cluster-ip | ||
| spec: | ||
| controllerName: "cluster-ip-controller" |
There was a problem hiding this comment.
Is this name "special" or can it be anything?
There was a problem hiding this comment.
The name can be anything but implementations must only reconcile GatewayClasses that has a controllerName that they expect. GatewayClass objects that do not match an implementation's controllerName must ignore that GatewayClass completely, and not update it at all (to prevent fighting on status).
Some implementations allow configuration of this string (for example, Contour allows it so that you can run multiple instances of Contour in a cluster).
There was a problem hiding this comment.
Is that the behavior we want here? In Service, its a single object with (many) multiple controllers consuming it. If I want my service exposed to the CNI, kube-proxy, service mesh, observability platform, ... do I need to make N Gateways?
There was a problem hiding this comment.
See expanded question under https://github.com/kubernetes-sigs/gateway-api/pull/3608/files#r1964558745
Agree with John's question, and I think it betrays a fundamental difference in perspective. I see this idea as "Services with a better API"
There was a problem hiding this comment.
Because we're using the same object that can be used in other contexts though (ie Gateway), we need a way to disambiguate, and the way we have is GatewayClass. I'd be happy to see proposals around alternatives to GatewayClass, but I haven't seen anything to date that handles the problem that implementations of Gateway API almost always need multiple-namespace access, and the only currently available thing we have that's bigger than a single namespace is cluster-wide.
| name: example-cluster-ip-gateway | ||
| spec: | ||
| addresses: | ||
| - 10.12.0.15 |
There was a problem hiding this comment.
How does kube-proxy (or Cilium or Antrea or ...) know which Gateways it should be capturing traffic for?
There was a problem hiding this comment.
Normally that's handled by the rollup of Gateway -> GatewayClass. Implementations own GatewayClasses that specify the correct string in GatewayClass spec.controllerName. All Gateways in that GatewayClass in that GatewayClass would need to be serviced by an implementation that can fulfill this request (that is, it both has the required functionality, and, in this case of requesting a static address, is actually able to assign that address). In the case that an implementation cannot fulfil this Gateway for some reason, it must be marked as not Accepted (by having an Accepted type condition in the Gateway's status with status: false).
There was a problem hiding this comment.
I can't tell if you are giving me a hard time or not :)
What I meant to ask is:
Service as a built-in API is (more or less) universally implemented by on-node agents (kube-proxy, cilium or antrea, ovn, etc). If we are trying to offer a form of ClusterIP Gateway which replaces part of the Service API, how does a user express "this is a cluster IP gateway" in a portable way such that all of the implementations know "this is for me"?
If each implementation has its own controllerName, and the GatewayClass can be named anything the cluster admin wants, how does our poor beleaguered app operator know what to put in their YAML?
Today they can say:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
type: ClusterIP
selector:
foo: bar
ports:
- port: 8080
...and be confident that ANY cluster, regardless of which CNI, will allocate a virtual IP and route traffic.
I'd like to write a generic tool which does:
for each service S in `kubectl get svc -A` {
evaluate template with S to produce an equivalent Gateway
}
There was a problem hiding this comment.
Yeah, okay, I see the use case, but this is the problem with extensions v core - we left the flexibility there for implementations, (for good reason), and now we don't have a way to define a default GatewayClass at all, even for specific use cases.
I think that practically, a tool like you describe would need to know the gatewayclass it was targeting, and output Gateways based on that.
We could conceivably have a convention and pick a reserved name (like cni-clusterip or something), but we've been reluctant in the past to do that, preferring the increased specificity of requiring people to specify something (even though there is a friction cost to be paid there).
(And I wasn't trying to give you a hard time - I have details get pushed out of my head all the time, so wanted to make sure this hadn't happened here. 😄 But also, I wanted to help other readers understand too)
There was a problem hiding this comment.
I think that practically, a tool like you describe would need to know the gatewayclass it was targeting,
Hence my questions about "is this name special". One answer is "thou shalt use the name 'clusterip' and the 'clusterip' is the name thou shalt use", and just hope not to collide with users. Another answer is to define a sub-space of names that users can't currently use, or are exceedingly unlikely to be using e.g. k8s.io:clusterip. This is an appropriate place to ideate, right?
There was a problem hiding this comment.
since 1.33 you can use the IPAddress object to represent an unique IP address in the cluster
There was a problem hiding this comment.
official names looks like a good idea, but I do not think we should make this exclusive, we already have "service.kubernetes.io/service-proxy-name" for Services, so it makes sense we may consider multiple implementations of clusterIP , so we can delegate our prefix to indicate that this is a Service IP --- the relation with the IPAddress object will guarantee the consistency ...IPAddress already has a reference field and a managed-by label
I think my strawman approach is:
- gateway class prefixed with
clusterip.kubernetes.io/kube-proxyorclusterip.kubernetes.io/cilium,antrea,ovn-kubernetes - the gateway allocates the corresponding IPAddress on the cluster to avoid conflicts
&networking.IPAddress{
ObjectMeta: metav1.ObjectMeta{
Name: "192.168.2.2",
Labels: map[string]string{"ipaddress.kubernetes.io/managed-by":"kube-proxy",
},
Spec: networking.IPAddressSpec{
ParentRef: &networking.ParentReference{
Group: "gateway.networking.k8s.io",
Resource: "gateway",
Name: "foo",
Namespace: "bar",
},
},
},There was a problem hiding this comment.
this is even more complicated for type LoadBalancer
currently for services and passthrough LBs, part of the setup belongs to kube-proxy, cilium etc (routing on the nodes) and part to the cloud providers. You also have the .loadBalancerClass (Service API field) field so that you can instruct the LB controller to provision a specific kind of LB on the cloud provider side.
Wouldn't it be best to leave the GatewayClass to be equivalent to loadBalancerClass in this case? There would need to be something that instructs kube-proxy to do the routing on it's side.
d38a4b4 to
2839a40
Compare
2839a40 to
741292c
Compare
| @@ -0,0 +1,240 @@ | |||
| # GEP-3539: ClusterIP Gateway - Gateway API to Expose Pods on Cluster-Internal IP Address | |||
There was a problem hiding this comment.
This might have started out as "ClusterIP Gateways" but at this point it's really more like "Service-equivalent functionality via Gateway API".
|
|
||
| ## Goals | ||
|
|
||
| * Define Gateway API usage to accomplish ClusterIP Service style behavior |
There was a problem hiding this comment.
Beyond the fact that it's not just ClusterIP, I think there are at least 3 use cases hiding in that sentence.
- "Gateway as new-and-improved Service" - Providing an API that does generally the same thing that
v1.Servicedoes, but in a cleaner and more orthogonally-extensible way, so that when people have feature requests like "I wantexternalTrafficPolicy: LocalServices without allocatinghealthCheckNodePorts" (to pick the most recent example), they can do that without us needing to add Yet Another ServiceSpec Flag. - "Gateway as a backend for
v1.Service" - Providing an API that can do everything thatv1.Servicecan do (even the deprecated parts and the parts we don't like), so that you can programmatically turn Services into Gateways and then the backend proxies/loadbalancers/etc would not need to look at Service objects at all. - "MultiNetworkService" - Providing an API that lets users do
v1.Service-equivalent things in multi-network contexts.
The GEP talks about case 2 some, but it doesn't really explain why we'd want to do that (other than via the link to Tim's KubeCon lightning talk).
|
|
||
| ### EndpointSelector as Backend | ||
|
|
||
| A Route can forward traffic to the endpoints selected via selector rules defined in EndpointSelector. |
There was a problem hiding this comment.
Define a Service with selector foo=bar. That triggers us to create a PodSelector for foo=bar. That triggers the endpoints controller(s) to do their thing.
FWIW NetworkPolicies also contain selectors that need to be resolved to Pods, and we've occasionally talked about how nice it would be if the selector-to-pod mapping could be handled centrally, rather than every NP impl needing to implement that itself, often doing it redundantly on every node.
I guess in theory, we could do that with EndpointSlice even, since kube-proxy will ignore EndpointSlices that don't have a label pointing back to a Service, so we could just have another set of EndpointSlices for NetworkPolicies... (EndpointSlice has a bunch of fields that are wrong for NetworkPolicy but most of them are optional and could just be left unset...)
Though this also reminds me of my theory that EndpointSlice should have been a gRPC API rather than an object stored in etcd. The EndpointSlice controller can re-derive the entire (controller-generated) EndpointSlice state from Services and Pods at any time, and it needs to keep all that state in memory while it's running anyway. So it should just serve that information out to the controllers that need it (kube-proxy, gateways) in an efficient use-case-specific form (kind of like the original kpng idea) rather than writing it all out to etcd.
(Alternate version: move discovery.k8s.io to an aggregated apiserver that is part of the EndpointSlice controller, and have it serve EndpointSlices out of memory rather than out of etcd.)
| apiVersion: networking.gke.io/v1alpha1 | ||
| kind: EndpointSelector | ||
| metadata: | ||
| name: front-end-pods |
There was a problem hiding this comment.
probably want this to work the same way EndpointSlice does, where the name is not meaningful (so as to avoid conflicts), and there's a label (or something) that correlates it with its Service
| | ipFamily | IPv4 <br /> IPv6 | Route level | | ||
| | publishNotReadyAddresses | True <br /> False | Route or EndpointSelector level | | ||
| | ClusterIP (headless service) | IPAddress <br /> None | GatewayClass definition for Headless Service type | | ||
| | externalName | External name reference <br /> (e.g. DNS CNAME) | GatewayClass definition for ExternalName Service type | |
There was a problem hiding this comment.
sessionAffinity- As noted elsewhere, this not implemented compatibly by all service proxies. It's also not implemented by many LoadBalancers because historically we have mostly not done any e2e testing for non-GCE LoadBalancers.externalIPs- bad alternative implementation of LoadBalancers. Needed for "exactly equivalent to Service" Gateways but not wanted for "similar to Service" Gateways.externalTrafficPolicy: Local- overly-opinionated combined implementation of two separate features (preserve source IP / route traffic more efficiently). We should do this better for the "similar to Service" case.publishNotReadyAddresses- is this just an early attempt to solve the problem that was later solved better by ProxyTerminatingEndpoints?
Not mentioned here:
trafficDistribution- I'm not sure what Gateway already has for topology, but this is definitely something that should be exposed generically.
|
I still haven't had the bandwidth to come back and give this a full, proper pass, but I did want to point out that, while this PR is currently targeting "Provisional" status, which isn't bound by Gateway API's release cycle, if you did want to look at moving this to Experimental (and thus, having something be implementable) this year, an item needs to be added to the Scoping discussion at #3760 to cover including it there. If folks don't feel there will be bandwidth to push this forward, we can concentrate on getting this into Provisional in the v1.4 timeframe, then look at Experimental for v1.5. |
Yes, there will not be bandwidth to push this forward during this cycle, hence I did not add anything to the scoping discussion. Also there are open questions that need to be addressed and discussion areas to be kicked off. |
@youngnick target would have to be Provisional in the v1.4 timeframe, given bandwidth constraints. |
8dab84a to
634a17f
Compare
634a17f to
b2aac6e
Compare
|
I was checking the feasibility of this disaggregation last weekend and it was not as complex as I thought. // EndpointSelectorSpec describes the desired state of an EndpointSelector.
type EndpointSelectorSpec struct {
// labelSelector for selecting pods.
// +optional
LabelSelector *metav1.LabelSelector `json:"labelSelector,omitempty"`
// ports defines the port information for the EndpointSlices.
// This field is optional. If you leave it out, EndpointSlices will be created
// without port information. This can be useful for L3-only routing.
// +optional
Ports []EndpointSelectorPort `json:"ports,omitempty"`
}and when you create a Service an The problem is with the generation of the |
|
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
|
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
|
@k8s-triage-robot: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Recommend reviewing deploy preview so examples are inlined: https://deploy-preview-3608--kubernetes-sigs-gateway-api.netlify.app/geps/gep-3539/
Signed-off-by: Pooja Trivedi poojatrivedi@google.com
What type of PR is this?
/kind gep
What this PR does / why we need it:
This defines via documentation how Gateway API can be used to accomplish ClusterIP Service behavior. It also proposes DNS record format for ClusterIP Gateway, proposes an EndpointSelector resource, and briefly touches upon Gateway API usage to define LoadBalancer and NodePort behaviors.
Which issue(s) this PR fixes:
Fixes #3539
Does this PR introduce a user-facing change?: