Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the jaeger-mixin for monitoring #1668

Merged
merged 2 commits into from
Aug 7, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions monitoring/jaeger-mixin/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Prometheus monitoring mixin for Jaeger
jpkrohling marked this conversation as resolved.
Show resolved Hide resolved

The Prometheus monitoring mixin for Jaeger provides a starting point for people wanting to monitor Jaeger using Prometheus, Alertmanager, and Grafana. To use it, you'll need [`jsonnet`](https://github.com/google/go-jsonnet) and [`jb` (jsonnet-bundler)](https://github.com/jsonnet-bundler/jsonnet-bundler). They can be installed using `go get`, as follows:

jpkrohling marked this conversation as resolved.
Show resolved Hide resolved
```console
$ go get github.com/google/go-jsonnet/cmd/jsonnet
$ go get github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb
```

Your monitoring mixin can then be initialized as follows:

```console
$ jb init
$ jb install \
github.com/jaegertracing/jaeger/monitoring/jaeger-mixin@master \
github.com/grafana/jsonnet-libs/grafana-builder@master \
github.com/coreos/kube-prometheus/jsonnet/kube-prometheus@master
```

In the directory where your mixin was initialized, create a new `monitoring-setup.jsonnet`, specifying how your monitoring stack should look like: this file is yours, any customizations to Prometheus, Grafana, or Alertmanager should take place here. A simple example providing only the Jaeger dashboard for Grafana would be:

```jsonnet
local jaegerDashboard = (import 'jaeger-mixin/mixin.libsonnet').grafanaDashboards;
{ ['dashboards-jaeger.json']: jaegerDashboard['jaeger.json'] }
```

The manifest files can be generated via the `jsonnet` command below. Once the command finishes, the file `manifests/dashboards-jaeger.json` should be available and can be loaded directly into Grafana.

```console
$ jsonnet -J vendor -cm manifests/ monitoring-setup.jsonnet
```

An example producing the manifests for a complete monitoring stack is located in this directory, as `monitoring-setup.example.jsonnet`. The manifests include Prometheus, Grafana, and Alertmanager managed via the Prometheus Operator for Kubernetes.

```jsonnet
jpkrohling marked this conversation as resolved.
Show resolved Hide resolved
local jaegerAlerts = (import 'jaeger-mixin/alerts.libsonnet').prometheusAlerts;
local jaegerDashboard = (import 'jaeger-mixin/mixin.libsonnet').grafanaDashboards;

local kp =
(import 'kube-prometheus/kube-prometheus.libsonnet') +
{
_config+:: {
namespace: 'monitoring',
},
grafanaDashboards+:: {
'jaeger.json': jaegerDashboard['jaeger.json'],
},
prometheusAlerts+:: jaegerAlerts,
};

{ ['00namespace-' + name + '.json']: kp.kubePrometheus[name] for name in std.objectFields(kp.kubePrometheus) } +
jpkrohling marked this conversation as resolved.
Show resolved Hide resolved
{ ['0prometheus-operator-' + name + '.json']: kp.prometheusOperator[name] for name in std.objectFields(kp.prometheusOperator) } +
{ ['node-exporter-' + name + '.json']: kp.nodeExporter[name] for name in std.objectFields(kp.nodeExporter) } +
{ ['kube-state-metrics-' + name + '.json']: kp.kubeStateMetrics[name] for name in std.objectFields(kp.kubeStateMetrics) } +
{ ['alertmanager-' + name + '.json']: kp.alertmanager[name] for name in std.objectFields(kp.alertmanager) } +
{ ['prometheus-' + name + '.json']: kp.prometheus[name] for name in std.objectFields(kp.prometheus) } +
{ ['prometheus-adapter-' + name + '.json']: kp.prometheusAdapter[name] for name in std.objectFields(kp.prometheusAdapter) } +
{ ['grafana-' + name + '.json']: kp.grafana[name] for name in std.objectFields(kp.grafana) }
yurishkuro marked this conversation as resolved.
Show resolved Hide resolved
```

The manifest files can be generated via `jsonnet` and passed directly to `kubectl`:

```console
$ jsonnet -J vendor -cm manifests/ monitoring-setup.jsonnet
$ kubectl apply -f manifests/
```

The resulting manifests will include everything that is needed to have a Prometheus, Alertmanager, and Grafana instances. Whenever a new alert rule is needed, or a new dashboard has to be defined, change your `monitoring-setup.jsonnet`, re-generate and re-apply the manifests.

Make sure your Prometheus setup is properly scraping the Jaeger components, either by creating a `ServiceMonitor` (and the backing `Service` objects), or via `PodMonitor` resources, like:

```console
$ kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: tracing
namespace: monitoring
spec:
podMetricsEndpoints:
- interval: 5s
targetPort: 14269
jpkrohling marked this conversation as resolved.
Show resolved Hide resolved
selector:
matchLabels:
app: jaeger
EOF
```

This `PodMonitor` tells Prometheus to scrape the port `14269` from all pods containing the label `app: jaeger`. If you have the Jaeger Collector, Agent, and Query in different pods, you might need to adjust or create further `PodMonitor` resources to scrape metrics from the other ports.

This mixin was originally developed by [Grafana Labs](https://github.com/grafana/jsonnet-libs/tree/master/jaeger-mixin).

## Background

* For more information about monitoring mixins, see this [design doc](https://docs.google.com/document/d/1A9xvzwqnFVSOZ5fD3blKODXfsat5fg6ZhnKu9LK3lB4/view).
116 changes: 116 additions & 0 deletions monitoring/jaeger-mixin/alerts.libsonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
local percentErrs(metric, errSelectors) = '100 * sum(rate(%(metric)s{%(errSelectors)s}[1m])) by (instance, job, namespace) / sum(rate(%(metric)s[1m])) by (instance, job, namespace)' % {
metric: metric,
errSelectors: errSelectors,
};

local percentErrsWithTotal(metric_errs, metric_total) = '100 * sum(rate(%(metric_errs)s[1m])) by (instance, job, namespace) / sum(rate(%(metric_total)s[1m])) by (instance, job, namespace)' % {
metric_errs: metric_errs,
metric_total: metric_total,
};

{
prometheusAlerts+:: {
groups+: [
{
name: 'jaeger_alerts',
rules: [{
alert: 'JaegerHTTPServerErrs',
expr: percentErrsWithTotal('jaeger_agent_http_server_errors_total', 'jaeger_agent_http_server_total') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
{{ $labels.job }} {{ $labels.instance }} is experiencing {{ printf "%.2f" $value }}% HTTP errors.
|||,
},
}, {
alert: 'JaegerRPCRequestsErrors',
expr: percentErrs('jaeger_client_jaeger_rpc_http_requests', 'status_code=~"4xx|5xx"') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
{{ $labels.job }} {{ $labels.instance }} is experiencing {{ printf "%.2f" $value }}% RPC HTTP errors.
|||,
},
}, {
alert: 'JaegerClientSpansDropped',
expr: percentErrs('jaeger_reporter_spans', 'result=~"dropped|err"') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
service {{ $labels.job }} {{ $labels.instance }} is dropping {{ printf "%.2f" $value }}% spans.
|||,
},
}, {
alert: 'JaegerAgentSpansDropped',
expr: percentErrsWithTotal('jaeger_agent_reporter_batches_failures_total', 'jaeger_agent_reporter_batches_submitted_total') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
agent {{ $labels.job }} {{ $labels.instance }} is dropping {{ printf "%.2f" $value }}% spans.
|||,
},
}, {
alert: 'JaegerCollectorDroppingSpans',
expr: percentErrsWithTotal('jaeger_collector_spans_dropped_total', 'jaeger_collector_spans_received_total') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
collector {{ $labels.job }} {{ $labels.instance }} is dropping {{ printf "%.2f" $value }}% spans.
|||,
},
}, {
alert: 'JaegerSamplingUpdateFailing',
expr: percentErrs('jaeger_sampler_queries', 'result="err"') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
{{ $labels.job }} {{ $labels.instance }} is failing {{ printf "%.2f" $value }}% in updating sampling policies.
|||,
},
}, {
alert: 'JaegerThrottlingUpdateFailing',
expr: percentErrs('jaeger_throttler_updates', 'result="err"') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
{{ $labels.job }} {{ $labels.instance }} is failing {{ printf "%.2f" $value }}% in updating throttling policies.
|||,
},
}, {
alert: 'JaegerQueryReqsFailing',
expr: percentErrs('jaeger_query_requests_total', 'result="err"') + '> 1',
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
{{ $labels.job }} {{ $labels.instance }} is seeing {{ printf "%.2f" $value }}% query errors on {{ $labels.operation }}.
|||,
},
}],
},
],
},
}
102 changes: 102 additions & 0 deletions monitoring/jaeger-mixin/dashboards.libsonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
local g = (import 'grafana-builder/grafana.libsonnet') + {
qpsPanelErrTotal(selectorErr, selectorTotal):: {
local expr(selector) = 'sum(rate(' + selector + '[1m]))',

aliasColors: {
success: '#7EB26D',
'error': '#E24D42',
},
targets: [
{
expr: expr(selectorErr),
format: 'time_series',
intervalFactor: 2,
legendFormat: 'error',
refId: 'A',
step: 10,
},
{
expr: expr(selectorTotal) + ' - ' + expr(selectorErr),
format: 'time_series',
intervalFactor: 2,
legendFormat: 'success',
refId: 'B',
step: 10,
},
],
} + $.stack,
};

{
grafanaDashboards+: {
'jaeger.json':
g.dashboard('Jaeger')
.addRow(
g.row('Services')
.addPanel(
g.panel('span creation rate') +
g.qpsPanelErrTotal('jaeger_reporter_spans{result=~"dropped|err"}', 'jaeger_reporter_spans') +
g.stack
)
.addPanel(
g.panel('% spans dropped') +
g.queryPanel('sum(rate(jaeger_reporter_spans{result=~"dropped|err"}[1m])) by (namespace) / sum(rate(jaeger_reporter_spans[1m])) by (namespace)', '{{namespace}}') +
{ yaxes: g.yaxes({ format: 'percentunit', max: 1 }) } +
g.stack
)
)
.addRow(
g.row('Agent')
.addPanel(
g.panel('batch ingest rate') +
g.qpsPanelErrTotal('jaeger_agent_reporter_batches_failures_total', 'jaeger_agent_reporter_batches_submitted_total') +
g.stack
)
.addPanel(
g.panel('% batches dropped') +
g.queryPanel('sum(rate(jaeger_agent_reporter_batches_failures_total[1m])) by (cluster) / sum(rate(jaeger_agent_reporter_batches_submitted_total[1m])) by (cluster)', '{{cluster}}') +
{ yaxes: g.yaxes({ format: 'percentunit', max: 1 }) } +
g.stack
)
)
.addRow(
g.row('Collector')
.addPanel(
g.panel('span ingest rate') +
g.qpsPanelErrTotal('jaeger_collector_spans_dropped_total', 'jaeger_collector_spans_received_total') +
g.stack
)
.addPanel(
g.panel('% spans dropped') +
g.queryPanel('sum(rate(jaeger_collector_spans_dropped_total[1m])) by (instance) / sum(rate(jaeger_collector_spans_received_total[1m])) by (instance)', '{{instance}}') +
{ yaxes: g.yaxes({ format: 'percentunit', max: 1 }) } +
g.stack
)
)
.addRow(
g.row('Collector Queue')
.addPanel(
g.panel('span queue length') +
g.queryPanel('jaeger_collector_queue_length', '{{instance}}') +
g.stack
)
.addPanel(
g.panel('span queue time - 95 percentile') +
g.queryPanel('histogram_quantile(0.95, sum(rate(jaeger_collector_in_queue_latency_bucket[1m])) by (le, instance))', '{{instance}}')
)
)
.addRow(
g.row('Query')
.addPanel(
g.panel('qps') +
g.qpsPanelErrTotal('jaeger_query_requests_total{result="err"}', 'jaeger_query_requests_total') +
g.stack
)
.addPanel(
g.panel('latency - 99 percentile') +
g.queryPanel('histogram_quantile(0.99, sum(rate(jaeger_query_latency_bucket[1m])) by (le, instance))', '{{instance}}') +
g.stack
)
),
},
}
14 changes: 14 additions & 0 deletions monitoring/jaeger-mixin/jsonnetfile.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"dependencies": [
{
"name": "grafana-builder",
"source": {
"git": {
"remote": "https://github.com/grafana/jsonnet-libs",
"subdir": "grafana-builder"
}
},
"version": "master"
}
]
}
2 changes: 2 additions & 0 deletions monitoring/jaeger-mixin/mixin.libsonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
(import 'dashboards.libsonnet') +
(import 'alerts.libsonnet')
23 changes: 23 additions & 0 deletions monitoring/jaeger-mixin/monitoring-setup.example.jsonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
local jaegerAlerts = (import 'jaeger-mixin/alerts.libsonnet').prometheusAlerts;
local jaegerDashboard = (import 'jaeger-mixin/mixin.libsonnet').grafanaDashboards;

local kp =
(import 'kube-prometheus/kube-prometheus.libsonnet') +
{
_config+:: {
namespace: 'monitoring',
},
grafanaDashboards+:: {
'jaeger.json': jaegerDashboard['jaeger.json'],
},
prometheusAlerts+:: jaegerAlerts,
};

{ ['00namespace-' + name + '.json']: kp.kubePrometheus[name] for name in std.objectFields(kp.kubePrometheus) } +
{ ['0prometheus-operator-' + name + '.json']: kp.prometheusOperator[name] for name in std.objectFields(kp.prometheusOperator) } +
{ ['node-exporter-' + name + '.json']: kp.nodeExporter[name] for name in std.objectFields(kp.nodeExporter) } +
{ ['kube-state-metrics-' + name + '.json']: kp.kubeStateMetrics[name] for name in std.objectFields(kp.kubeStateMetrics) } +
{ ['alertmanager-' + name + '.json']: kp.alertmanager[name] for name in std.objectFields(kp.alertmanager) } +
{ ['prometheus-' + name + '.json']: kp.prometheus[name] for name in std.objectFields(kp.prometheus) } +
{ ['prometheus-adapter-' + name + '.json']: kp.prometheusAdapter[name] for name in std.objectFields(kp.prometheusAdapter) } +
{ ['grafana-' + name + '.json']: kp.grafana[name] for name in std.objectFields(kp.grafana) }