Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracing guide #568

Merged
merged 3 commits into from
Apr 25, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added doc/observability/grafana_trace.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/observability/grafana_tracing_loki.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
125 changes: 125 additions & 0 deletions doc/observability/tracing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Enabling tracing with a central collector

## Introduction

This guide outlines the steps to enable tracing in Istio and Kuadrant components (Authorino and Limitador), directing traces to a central collector for improved observability and troubleshooting. We'll also explore a typical troubleshooting flow using traces and logs.

## Prerequisites

- A Kubernetes cluster with Istio and Kuadrant installed.
- A trace collector (e.g., Jaeger or Tempo) configured to support [OpenTelemetry](https://opentelemetry.io/) (OTel).

## Configuration Steps

### Istio Tracing Configuration

Enable tracing in Istio by using the [Telemetry API](https://istio.io/v1.11/docs/tasks/observability/distributed-tracing/configurability/telemetry-api/).
david-martin marked this conversation as resolved.
Show resolved Hide resolved
Depending on your method for installing Istio, you will need to configure a tracing `extensionProvider` in your MeshConfig, Istio or IstioOperator resource as well.
Here is an example Telemetry and Istio config to sample 100% of requests, if using the Istio Sail Operator.
adam-cattermole marked this conversation as resolved.
Show resolved Hide resolved

```yaml
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: mesh-default
namespace: istio-system
spec:
tracing:
- providers:
- name: jaeger-tempo
randomSamplingPercentage: 100
---
apiVersion: operator.istio.io/v1alpha1
kind: Istio
metadata:
name: default
spec:
namespace: istio-system
values:
meshConfig:
defaultConfig:
tracing: {}
enableTracing: true
extensionProviders:
- name: jaeger-tempo
opentelemetry:
port: 4317
service: tempo.tempo.svc.cluster.local
adam-cattermole marked this conversation as resolved.
Show resolved Hide resolved
```

### Kuadrant Tracing Configuration

The Authorino and Limitador components have request tracing capabilities.
Here is an example configuration to enable and send traces to a central collector.
Ensure the collector is the same one that Istio is sending traces so that they can be correlated later.

```yaml
apiVersion: kuadrant.io/v1beta2
kind: AuthPolicy
metadata:
name: authpolicy
namespace: default
spec:
rules:
authentication: {}
targetRef:
group: gateway.networking.k8s.io
kind: HTTPRoute
name: mygateway
tracing:
endpoint: rpc://tempo.tempo.svc.cluster.local:4317
insecure: true
---
apiVersion: kuadrant.io/v1beta2
kind: RateLimitPolicy
metadata:
name: limitpolicy
namespace: default
spec:
limits: {}
targetRef:
group: gateway.networking.k8s.io
kind: HTTPRoute
name: mygateway
tracing:
endpoint: rpc://tempo.tempo.svc.cluster.local:4317
insecure: true
```
adam-cattermole marked this conversation as resolved.
Show resolved Hide resolved

Once the changes are applied, the authorino and limitador components will be redeployed tracing enabled.

**Note:**

There are [plans](https://github.com/Kuadrant/architecture/issues/48) to consolidate the tracing configuration to a single location i.e. the Kuadrant CR.
This will eventually eliminate the need to configure tracing in both the Authorino and Limitador CRs.

**Important:**

Currently, trace IDs [do not propagate](https://github.com/envoyproxy/envoy/issues/22028) to wasm modules in Istio/Envoy, affecting trace continuity in Limitador.
This means that requests passed to limitador will not have the relavant 'parent' trace ID in its trace information.
If however the trace initiation point is outside of Envoy/Istio, the 'parent' trace ID will be available to limitador and included in traces passed to the collector.
This has an impact on correlating traces from limitador with traces from authorino, the gateway and any other components in the path of requests.

## Troubleshooting Flow Using Traces and Logs

Using a tracing interface like the Jaeger UI or Grafana, you can search for trace information by the trace ID.
You may get the trace ID from logs, or from a header in a sample request you want to troubleshoot.
You can also search for recent traces, filtering by the service you want to focus on.

Here is an example trace in the Grafana UI showing the total request time from the gateway (Istio), the time to check the curent rate limit count (and update it) in limitador and the time to check auth in Authorino:

<img src="./grafana_trace.png" alt="Trace in Grafana UI" width="800"/>

In limitador, it is possible to enable request logging with trace IDs to get more information on requests.
adam-cattermole marked this conversation as resolved.
Show resolved Hide resolved
A log entry will look something like this, with the `traceparent` field holding the trace ID:

```
"Request received: Request { metadata: MetadataMap { headers: {"te": "trailers", "grpc-timeout": "5000m", "content-type": "application/grpc", "traceparent": "00-4a2a933a23df267aed612f4694b32141-00f067aa0ba902b7-01", "x-envoy-internal": "true", "x-envoy-expected-rq-timeout-ms": "5000"} }, message: RateLimitRequest { domain: "default/toystore", descriptors: [RateLimitDescriptor { entries: [Entry { key: "limit.general_user__f5646550", value: "1" }, Entry { key: "metadata.filter_metadata.envoy\\.filters\\.http\\.ext_authz.identity.userid", value: "alice" }], limit: None }], hits_addend: 1 }, extensions: Extensions }"
```

If you centrally aggregate logs using something like promtail and loki, you can jump between trace information and the relevant logs for that service:

<img src="./grafana_tracing_loki.png" alt="Trace and logs in Grafana UI" width="800"/>

Using a combination of tracing and logs, you can visualise and troubleshoot reuqest timing issues and drill down to specific services.
This method becomes even more powerful when combined with [metrics](https://docs.kuadrant.io/kuadrant-operator/doc/observability/metrics/) and [dashboards](https://docs.kuadrant.io/kuadrant-operator/doc/observability/dashboards/) to get a more complete picture of your users traffic.
Loading