Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the jaeger-mixin for monitoring #1700

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions monitoring/jaeger-mixin/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Prometheus monitoring mixin for Jaeger

The Prometheus monitoring mixin for Jaeger provides a starting point for people wanting to monitor Jaeger using Prometheus, Alertmanager, and Grafana. To use it, you'll need [`jsonnet`](https://github.com/google/go-jsonnet) and [`jb` (jsonnet-bundler)](https://github.com/jsonnet-bundler/jsonnet-bundler). They can be installed using `go get`, as follows:

```console
$ go get github.com/google/go-jsonnet/cmd/jsonnet
$ go get github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb
```

Your monitoring mixin can then be initialized as follows:

```console
$ jb init
$ jb install \
github.com/jaegertracing/jaeger/monitoring/jaeger-mixin@master \
github.com/grafana/jsonnet-libs/grafana-builder@master \
github.com/coreos/kube-prometheus/jsonnet/kube-prometheus@master
```

Your mixin can then look like this:

```jsonnet
local jaegerAlerts = (import 'jaeger-mixin/alerts.libsonnet').prometheusAlerts;
local jaegerDashboard = (import 'jaeger-mixin/mixin.libsonnet').grafanaDashboards;

local kp =
(import 'kube-prometheus/kube-prometheus.libsonnet') +
{
_config+:: {
namespace: 'monitoring',
},
grafanaDashboards+:: {
'jaeger.json': jaegerDashboard['jaeger.json'],
},
prometheusAlerts+:: jaegerAlerts,
};

{ ['00namespace-' + name + '.json']: kp.kubePrometheus[name] for name in std.objectFields(kp.kubePrometheus) } +
{ ['0prometheus-operator-' + name + '.json']: kp.prometheusOperator[name] for name in std.objectFields(kp.prometheusOperator) } +
{ ['node-exporter-' + name + '.json']: kp.nodeExporter[name] for name in std.objectFields(kp.nodeExporter) } +
{ ['kube-state-metrics-' + name + '.json']: kp.kubeStateMetrics[name] for name in std.objectFields(kp.kubeStateMetrics) } +
{ ['alertmanager-' + name + '.json']: kp.alertmanager[name] for name in std.objectFields(kp.alertmanager) } +
{ ['prometheus-' + name + '.json']: kp.prometheus[name] for name in std.objectFields(kp.prometheus) } +
{ ['prometheus-adapter-' + name + '.json']: kp.prometheusAdapter[name] for name in std.objectFields(kp.prometheusAdapter) } +
{ ['grafana-' + name + '.json']: kp.grafana[name] for name in std.objectFields(kp.grafana) }
```

The manifest files can be generated via `jsonnet` and passed directly to `kubectl`:

```console
$ jsonnet -J vendor -cm manifests/ monitoring-setup.jsonnet
$ kubectl apply -f manifests/
```

Make sure your Prometheus setup is properly scraping the Jaeger components, either by creating a `ServiceMonitor` (and the backing `Service` objects), or via `PodMonitor` resources, like:

```console
$ kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: tracing
namespace: monitoring
spec:
podMetricsEndpoints:
- interval: 5s
targetPort: 14269
selector:
matchLabels:
app: jaeger
EOF
```

This `PodMonitor` tells Prometheus to scrape the port `14269` from all pods containing the label `app: jaeger`. If you have the Jaeger Collector, Agent, and Query in different pods, you might need to adjust or create further `PodMonitor` resources to scrape metrics from the other ports.

This mixin was originally developed by [Grafana Labs](https://github.com/grafana/jsonnet-libs/tree/master/jaeger-mixin).

## Background

* For more information about monitoring mixins, see this [design doc](https://docs.google.com/document/d/1A9xvzwqnFVSOZ5fD3blKODXfsat5fg6ZhnKu9LK3lB4/edit#).
116 changes: 116 additions & 0 deletions monitoring/jaeger-mixin/alerts.libsonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
local percentErrs(metric, errSelectors) = '100 * sum(rate(%(metric)s{%(errSelectors)s}[1m])) by (instance, job, namespace) / sum(rate(%(metric)s[1m])) by (instance, job, namespace)' % {
metric: metric,
errSelectors: errSelectors,
};

local percentErrsWithTotal(metric_errs, metric_total) = '100 * sum(rate(%(metric_errs)s[1m])) by (instance, job, namespace) / sum(rate(%(metric_total)s[1m])) by (instance, job, namespace)' % {
metric_errs: metric_errs,
metric_total: metric_total,
};

{
prometheusAlerts+:: {
groups+: [
{
name: 'jaeger_alerts',
rules: [{
alert: 'JaegerHTTPServerErrs',
expr: percentErrsWithTotal('jaeger_agent_http_server_errors_total', 'jaeger_agent_http_server_total') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
{{ $labels.job }} {{ $labels.instance }} is experiencing {{ printf "%.2f" $value }}% HTTP errors.
|||,
},
}, {
alert: 'JaegerRPCRequestsErrors',
expr: percentErrs('jaeger_client_jaeger_rpc_http_requests', 'status_code=~"4xx|5xx"') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
{{ $labels.job }} {{ $labels.instance }} is experiencing {{ printf "%.2f" $value }}% RPC HTTP errors.
|||,
},
}, {
alert: 'JaegerClientSpansDropped',
expr: percentErrs('jaeger_reporter_spans', 'result=~"dropped|err"') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
service {{ $labels.job }} {{ $labels.instance }} is dropping {{ printf "%.2f" $value }}% spans.
|||,
},
}, {
alert: 'JaegerAgentSpansDropped',
expr: percentErrsWithTotal('jaeger_agent_reporter_batches_failures_total', 'jaeger_agent_reporter_batches_submitted_total') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
agent {{ $labels.job }} {{ $labels.instance }} is dropping {{ printf "%.2f" $value }}% spans.
|||,
},
}, {
alert: 'JaegerCollectorDroppingSpans',
expr: percentErrsWithTotal('jaeger_collector_spans_dropped_total', 'jaeger_collector_spans_received_total') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
collector {{ $labels.job }} {{ $labels.instance }} is dropping {{ printf "%.2f" $value }}% spans.
|||,
},
}, {
alert: 'JaegerSamplingUpdateFailing',
expr: percentErrs('jaeger_sampler_queries', 'result="err"') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
{{ $labels.job }} {{ $labels.instance }} is failing {{ printf "%.2f" $value }}% in updating sampling policies.
|||,
},
}, {
alert: 'JaegerThrottlingUpdateFailing',
expr: percentErrs('jaeger_throttler_updates', 'result="err"') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
{{ $labels.job }} {{ $labels.instance }} is failing {{ printf "%.2f" $value }}% in updating throttling policies.
|||,
},
}, {
alert: 'JaegerQueryReqsFailing',
expr: percentErrs('jaeger_query_requests_total', 'result="err"') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
{{ $labels.job }} {{ $labels.instance }} is seeing {{ printf "%.2f" $value }}% query errors on {{ $labels.operation }}.
|||,
},
}],
},
],
},
}
102 changes: 102 additions & 0 deletions monitoring/jaeger-mixin/dashboards.libsonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
local g = (import 'grafana-builder/grafana.libsonnet') + {
qpsPanelErrTotal(selectorErr, selectorTotal):: {
local expr(selector) = 'sum(rate(' + selector + '[1m]))',

aliasColors: {
success: '#7EB26D',
'error': '#E24D42',
},
targets: [
{
expr: expr(selectorErr),
format: 'time_series',
intervalFactor: 2,
legendFormat: 'error',
refId: 'A',
step: 10,
},
{
expr: expr(selectorTotal) + ' - ' + expr(selectorErr),
format: 'time_series',
intervalFactor: 2,
legendFormat: 'success',
refId: 'B',
step: 10,
},
],
} + $.stack,
};

{
grafanaDashboards+: {
'jaeger.json':
g.dashboard('Jaeger')
.addRow(
g.row('Services')
.addPanel(
g.panel('span creation rate') +
g.qpsPanelErrTotal('jaeger_reporter_spans{result=~"dropped|err"}', 'jaeger_reporter_spans') +
g.stack
)
.addPanel(
g.panel('% spans dropped') +
g.queryPanel('sum(rate(jaeger_reporter_spans{result=~"dropped|err"}[1m])) by (namespace) / sum(rate(jaeger_reporter_spans[1m])) by (namespace)', '{{namespace}}') +
{ yaxes: g.yaxes({ format: 'percentunit', max: 1 }) } +
g.stack
)
)
.addRow(
g.row('Agent')
.addPanel(
g.panel('batch ingest rate') +
g.qpsPanelErrTotal('jaeger_agent_reporter_batches_failures_total', 'jaeger_agent_reporter_batches_submitted_total') +
g.stack
)
.addPanel(
g.panel('% batches dropped') +
g.queryPanel('sum(rate(jaeger_agent_reporter_batches_failures_total[1m])) by (cluster) / sum(rate(jaeger_agent_reporter_batches_submitted_total[1m])) by (cluster)', '{{cluster}}') +
{ yaxes: g.yaxes({ format: 'percentunit', max: 1 }) } +
g.stack
)
)
.addRow(
g.row('Collector')
.addPanel(
g.panel('span ingest rate') +
g.qpsPanelErrTotal('jaeger_collector_spans_dropped_total', 'jaeger_collector_spans_received_total') +
g.stack
)
.addPanel(
g.panel('% spans dropped') +
g.queryPanel('sum(rate(jaeger_collector_spans_dropped_total[1m])) by (instance) / sum(rate(jaeger_collector_spans_received_total[1m])) by (instance)', '{{instance}}') +
{ yaxes: g.yaxes({ format: 'percentunit', max: 1 }) } +
g.stack
)
)
.addRow(
g.row('Collector Queue')
.addPanel(
g.panel('span queue length') +
g.queryPanel('jaeger_collector_queue_length', '{{instance}}') +
g.stack
)
.addPanel(
g.panel('span queue time - 95 percentile') +
g.queryPanel('histogram_quantile(0.95, sum(rate(jaeger_collector_in_queue_latency_bucket[1m])) by (le, instance))', '{{instance}}')
)
)
.addRow(
g.row('Query')
.addPanel(
g.panel('qps') +
g.qpsPanelErrTotal('jaeger_query_requests_total{result="err"}', 'jaeger_query_requests_total') +
g.stack
)
.addPanel(
g.panel('latency - 99 percentile') +
g.queryPanel('histogram_quantile(0.99, sum(rate(jaeger_query_latency_bucket[1m])) by (le, instance))', '{{instance}}') +
g.stack
)
),
},
}
14 changes: 14 additions & 0 deletions monitoring/jaeger-mixin/jsonnetfile.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"dependencies": [
{
"name": "grafana-builder",
"source": {
"git": {
"remote": "https://github.com/grafana/jsonnet-libs",
"subdir": "grafana-builder"
}
},
"version": "master"
}
]
}
2 changes: 2 additions & 0 deletions monitoring/jaeger-mixin/mixin.libsonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
(import 'dashboards.libsonnet') +
(import 'alerts.libsonnet')