Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR #992 breaks ingress traffic #6157

Closed
alex-berger opened this issue May 21, 2021 · 5 comments
Closed

PR #992 breaks ingress traffic #6157

alex-berger opened this issue May 21, 2021 · 5 comments
Assignees

Comments

@alex-berger
Copy link
Contributor

alex-berger commented May 21, 2021

Bug Report

The recently merged 992 introduces a breaking change for ingress controllers running in ingress mode. It enforces all outbound request to have the l5d-dst-override set, which is a unrealistically strict requirement that breaks many ingress traffic pattern supported by ingress controllers.

What is the issue?

In our specific case we are using gloo-edge (gateway) as ingress controller, which supports non Kubernetes upstreams (e.g. AWS Lambda Functions, EC2 Instances, ...) and also communicates with other gloo services without adding the l5d-dst-override header. Note, gloo-edge has a nice linkerd plugin which automatically adds the l5d-dst-override header to all request targeting Kubernetes upstreams. However, it does not add that header for other kind of upstreams (e.g. static upstreams, EC2 upstreams, ...) and it also does not add that header for request between gloo services (e.g. envoy XDS request, which are not part of the ingress traffic but are emitted by the gateway which is meshed).

{"timestamp":"[ 995.334310s]","level":"WARN","fields":{"message":"Failed to proxy request: ingress-mode routing requires the l5d-dst-override header","client.addr":"10.176.9.124:43466"},"target":"linkerd_app_core::errors","spans":[{"orig_dst":"10.176.47.220:9977","name":"ingress"}],"threadId":"ThreadId(3)"}

Impact

Our gloo-edge ingress gateway no longer work, they are completely broken and therefore ingress traffic in our cluster does not work.

Expected Behavior

  • Outbound linkerd-proxy in ingress mode must accept requests which have no l5d-dst-override header set and use the host resp. authority header for routing decision (as fallback). This will support use-cases where
    • the ingress gateway might forward requests to external workloads (running outside Kubernetes)
    • the ingress gateway itself communicates with other services (e.g. envoy talking to XDS endpoints)
    • ...

How can it be reproduced?

  • Install linkerd edge-21.5.1 with linkerd-proxy image ghcr.io/olix0r/l2-proxy:a01b8bd2.
  • Use gloo-edge gateway meshed in ingress mode and see how it fails to get ready as it fails to contact the gloo and gloo-gateway services.

Logs, error output, etc

{"timestamp":"[ 995.334310s]","level":"WARN","fields":{"message":"Failed to proxy request: ingress-mode routing requires the l5d-dst-override header","client.addr":"10.176.9.124:43466"},"target":"linkerd_app_core::errors","spans":[{"orig_dst":"10.176.47.220:9977","name":"ingress"}],"threadId":"ThreadId(3)"}

linkerd check output

Linkerd core checks
===================

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ can retrieve the control plane version
‼ control plane is up-to-date
    is running version 21.5.1 but the latest edge version is 21.5.2
    see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-21.5.1 but cli running edge-21.5.2
    see https://linkerd.io/2/checks/#l5d-version-control for hints

Environment

  • Kubernetes Version: 1.19.6
  • Cluster Environment: EKS
  • Host OS: Amazon Linux 2
  • Linkerd version: edge-21.5.1 with linkerd-proxy image ghcr.io/olix0r/l2-proxy:a01b8bd2

Possible solution

Revert PR 992.

Additional context

I stumbled over this while testing a proposed fix for #6146 (comment).

@olix0r
Copy link
Member

olix0r commented May 21, 2021

Thanks for testing the edge release @alex-berger! This is helpful feedback.

We'd really like to adopt a different approach for ingresses more along the lines of nginx's service-usptream feature which allows us to run a normal proxy without ingress-mode. I've browsed through the gloo docs but I haven't been able to find anything analogous ;( Maybe there's something in there I've missed, though.

For context, by default Linkerd does discovery for every outbound connection -- that is, we discover service metadata based on the destination ip:port. Ingresses tend to implement their own load balancing, however, so connections are initiated to each endpoint and not to a service address; and communication targets an endpoint we can't associate that traffic with a Service (and therefore ServiceProfile / TrafficSplit). The above-mentioned nginx service-upstream annotation configures nginx to skip load balancing and instead initiate connections directly to the service IP address so that Linkerd can operate as normal. This is especially nice because this setting can be configured differently for each ingress resource; so it's not an all-or-nothing setting.

The ingress-mode feature changes linkerd's behavior to ignore the original address and instead do routing on each request's metadata. This is risky, however, as it's typical for ingresses to pass requests without modifying host headers, which can easily result in a traffic loop, where the proxy tries to send the request back through the ingress. So the override header is present so that the ingress has to explicitly identify a FQDN for requests to be sent to.

I think, ideally, we'd get gloo to support something like the service-upstream so that you can opt-out of gloo's load balancing behavior in favor of linkerd's routing.

In the meantime, let's get more specific about the problems we're trying to solve:

  • What Linkerd features are you relying on between gloo and the target services? Specifically, do you rely on ServiceProfiles or TrafficSplit? If the latter, have you considered using Gloo's traffic splitting features?
  • The non-Kubernetes service traffic targets you reference -- how are they encoded in requests? Do you always target services by FQDN? Are relative names used? Do you use static IP addresses in place of names?

@alex-berger
Copy link
Contributor Author

@olix0r Let me share a picture of our ingress architecture annotated with answers to your questions.

Blank diagram-1

Basically, what we are looking for is kind of a "best effort" ingress mode which has the following properties:

  • if l5d-dst-override is set it will apply all linkerd features (mTLS, ServiceProfile, TrafficSplit, ...)
  • if l5d-dst-override is not set it should forward traffic base on the dst_address of the TCP stream.

olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue May 22, 2021
In f8cc918, we started requiring the `l5d-dst-override` header. As
described in linkerd/linkerd2#6157, this breaks some ingress
configurations, especially those that may route to out-of-cluster
resources.

This change restores original-dst-addr routing for requests that do not
include the `l5d-dst-override` header.

Internally, this change centralizes the `DST_OVERRIDE_HEADER` and
related utilities to the ingress module, as no other modules should have
to know anything about it.
olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue May 22, 2021
In f8cc918, we started requiring the `l5d-dst-override` header. As
described in linkerd/linkerd2#6157, this breaks some ingress
configurations, especially those that may route to out-of-cluster
resources.

This change restores original-dst-addr routing for requests that do not
include the `l5d-dst-override` header.

Internally, this change centralizes the `DST_OVERRIDE_HEADER` and
related utilities to the ingress module, as no other modules should have
to know anything about it.
olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue May 22, 2021
In f8cc918, we started requiring the `l5d-dst-override` header. As
described in linkerd/linkerd2#6157, this breaks some ingress
configurations, especially those that may route to out-of-cluster
resources.

This change restores original-dst-addr routing for requests that do not
include the `l5d-dst-override` header.

Internally, this change centralizes the `DST_OVERRIDE_HEADER` and
related utilities to the ingress module, as no other modules should have
to know anything about it.
olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue May 24, 2021
In f8cc918, we started requiring the `l5d-dst-override` header. As
described in linkerd/linkerd2#6157, this breaks some ingress
configurations, especially those that may route to out-of-cluster
resources.

This change restores original-dst-addr routing for requests that do not
include the `l5d-dst-override` header.

Internally, this change centralizes the `DST_OVERRIDE_HEADER` and
related utilities to the ingress module, as no other modules should have
to know anything about it.
@olix0r
Copy link
Member

olix0r commented May 24, 2021

The changes from linkerd/linkerd2-proxy#1016 should be included in this week's edge release. To test these changes before then, you can use workload annotations like:

linkerd.io/inject: ingress
config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
config.linkerd.io/proxy-version: a07b33d7

This proxy version should once again support routing requests that do not have a l5d-dst-override header set by honoring the request's original destination address.

@olix0r olix0r self-assigned this May 24, 2021
@alex-berger
Copy link
Contributor Author

@olix0r I tested the new proxy image a07b33d7 and can confirm that linkerd/linkerd2-proxy#1016 resolves our problems. Thank you very much for that fix, I really appreciate your help on this issue.

@alex-berger
Copy link
Contributor Author

Closing this issue, as this has been fixed in edge-21.5.3.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants