Skip to content

Commit 1abde64

Browse files
committed
[DOCS] More monitoring docs
1 parent 55471ab commit 1abde64

File tree

6 files changed

+604
-0
lines changed

6 files changed

+604
-0
lines changed
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
[role="xpack"]
2+
[testenv="basic"]
3+
[[es-monitoring-collectors]]
4+
== Collectors
5+
6+
Collectors, as their name implies, collect things. Each collector runs once for
7+
each collection interval to obtain data from the public APIs in {es} and {xpack}
8+
that it chooses to monitor. When the data collection is finished, the data is
9+
handed in bulk to the <<es-monitoring-exporters,exporters>> to be sent to the
10+
monitoring clusters. Regardless of the number of exporters, each collector only
11+
runs once per collection interval.
12+
13+
There is only one collector per data type gathered. In other words, for any
14+
monitoring document that is created, it comes from a single collector rather
15+
than being merged from multiple collectors. {monitoring} for {es} currently has
16+
a few collectors because the goal is to minimize overlap between them for
17+
optimal performance.
18+
19+
Each collector can create zero or more monitoring documents. For example,
20+
the `index_stats` collector collects all index statistics at the same time to
21+
avoid many unnecessary calls.
22+
23+
[options="header"]
24+
|=======================
25+
| Collector | Data Types | Description
26+
| Cluster Stats | `cluster_stats`
27+
| Gathers details about the cluster state, including parts of
28+
the actual cluster state (for example `GET /_cluster/state`) and statistics
29+
about it (for example, `GET /_cluster/stats`). This produces a single document
30+
type. In versions prior to X-Pack 5.5, this was actually three separate collectors
31+
that resulted in three separate types: `cluster_stats`, `cluster_state`, and
32+
`cluster_info`. In 5.5 and later, all three are combined into `cluster_stats`.
33+
+
34+
This only runs on the _elected_ master node and the data collected
35+
(`cluster_stats`) largely controls the UI. When this data is not present, it
36+
indicates either a misconfiguration on the elected master node, timeouts related
37+
to the collection of the data, or issues with storing the data. Only a single
38+
document is produced per collection.
39+
| Index Stats | `indices_stats`, `index_stats`
40+
| Gathers details about the indices in the cluster, both in summary and
41+
individually. This creates many documents that represent parts of the index
42+
statistics output (for example, `GET /_stats`).
43+
+
44+
This information only needs to be collected once, so it is collected on the
45+
_elected_ master node. The most common failure for this collector relates to an
46+
extreme number of indices -- and therefore time to gather them -- resulting in
47+
timeouts. One summary `indices_stats` document is produced per collection and one
48+
`index_stats` document is produced per index, per collection.
49+
| Index Recovery | `index_recovery`
50+
| Gathers details about index recovery in the cluster. Index recovery represents
51+
the assignment of _shards_ at the cluster level. If an index is not recovered,
52+
it is not usable. This also corresponds to shard restoration via snapshots.
53+
+
54+
This information only needs to be collected once, so it is collected on the
55+
_elected_ master node. The most common failure for this collector relates to an
56+
extreme number of shards -- and therefore time to gather them -- resulting in
57+
timeouts. This creates a single document that contains all recoveries by default,
58+
which can be quite large, but it gives the most accurate picture of recovery in
59+
the production cluster.
60+
| Shards | `shards`
61+
| Gathers details about all _allocated_ shards for all indices, particularly
62+
including what node the shard is allocated to.
63+
+
64+
This information only needs to be collected once, so it is collected on the
65+
_elected_ master node. The collector uses the local cluster state to get the
66+
routing table without any network timeout issues unlike most other collectors.
67+
Each shard is represented by a separate monitoring document.
68+
| Jobs | `job_stats`
69+
| Gathers details about all machine learning job statistics (for example,
70+
`GET /_xpack/ml/anomaly_detectors/_stats`).
71+
+
72+
This information only needs to be collected once, so it is collected on the
73+
_elected_ master node. However, for the master node to be able to perform the
74+
collection, the master node must have `xpack.ml.enabled` set to true (default)
75+
and a license level that supports {ml}.
76+
| Node Stats | `node_stats`
77+
| Gathers details about the running node, such as memory utilization and CPU
78+
usage (for example, `GET /_nodes/_local/stats`).
79+
+
80+
This runs on _every_ node with {monitoring} enabled. One common failure
81+
results in the timeout of the node stats request due to too many segment files.
82+
As a result, the collector spends too much time waiting for the file system
83+
stats to be calculated until it finally times out. A single `node_stats`
84+
document is created per collection. This is collected per node to help to
85+
discover issues with nodes communicating with each other, but not with the
86+
monitoring cluster (for example, intermittent network issues or memory pressure).
87+
|=======================
88+
89+
{monitoring} uses a single threaded scheduler to run the collection of {es}
90+
monitoring data by all of the appropriate collectors on each node. This
91+
scheduler is managed locally by each node and its interval is controlled by
92+
specifying the `xpack.monitoring.collection.interval`, which defaults to 10
93+
seconds (`10s`), at either the node or cluster level.
94+
95+
Fundamentally, each collector works on the same principle. Per collection
96+
interval, each collector is checked to see whether it should run and then the
97+
appropriate collectors run. The failure of an individual collector does not
98+
impact any other collector.
99+
100+
Once collection has completed, all of the monitoring data is passed to the
101+
exporters to route the monitoring data to the monitoring clusters.
102+
103+
If gaps exist in the monitoring charts in {kib}, it is typically because either
104+
a collector failed or the monitoring cluster did not receive the data (for
105+
example, it was being restarted). In the event that a collector fails, a logged
106+
error should exist on the node that attempted to perform the collection.
107+
108+
NOTE: Collection is currently done serially, rather than in parallel, to avoid
109+
extra overhead on the elected master node. The downside to this approach
110+
is that collectors might observe a different version of the cluster state
111+
within the same collection period. In practice, this does not make a
112+
significant difference and running the collectors in parallel would not
113+
prevent such a possibility.
114+
115+
For more information about the configuration options for the collectors, see
116+
<<monitoring-collection-settings>>.
117+
118+
[float]
119+
[[es-monitoring-stack]]
120+
=== Collecting data from across the Elastic Stack
121+
122+
{monitoring} in {es} also receives monitoring data from other parts of the
123+
Elastic Stack. In this way, it serves as an unscheduled monitoring data
124+
collector for the stack.
125+
126+
By default, data collection is disabled. {es} monitoring data is not
127+
collected and all monitoring data from other sources such as {kib}, Beats, and
128+
Logstash is ignored. You must set `xpack.monitoring.collection.enabled` to `true`
129+
to enable the collection of monitoring data. See <<monitoring-settings>>.
130+
131+
Once data is received, it is forwarded to the exporters
132+
to be routed to the monitoring cluster like all monitoring data.
133+
134+
WARNING: Because this stack-level "collector" lives outside of the collection
135+
interval of {monitoring} for {es}, it is not impacted by the
136+
`xpack.monitoring.collection.interval` setting. Therefore, data is passed to the
137+
exporters whenever it is received. This behavior can result in indices for {kib},
138+
Logstash, or Beats being created somewhat unexpectedly.
139+
140+
While the monitoring data is collected and processed, some production cluster
141+
metadata is added to incoming documents. This metadata enables {kib} to link the
142+
monitoring data to the appropriate cluster. If this linkage is unimportant to
143+
the infrastructure that you're monitoring, it might be simpler to configure
144+
Logstash and Beats to report monitoring data directly to the monitoring cluster.
145+
This scenario also prevents the production cluster from adding extra overhead
146+
related to monitoring data, which can be very useful when there are a large
147+
number of Logstash nodes or Beats.
148+
149+
For more information about typical monitoring architectures, see
150+
{xpack-ref}/how-monitoring-works.html[How Monitoring Works].
Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
[role="xpack"]
2+
[testenv="basic"]
3+
[[es-monitoring-exporters]]
4+
== Exporters
5+
6+
The purpose of exporters is to take data collected from any Elastic Stack
7+
source and route it to the monitoring cluster. It is possible to configure
8+
more than one exporter, but the general and default setup is to use a single
9+
exporter.
10+
11+
There are two types of exporters in {es}:
12+
13+
`local`::
14+
The default exporter used by {monitoring} for {es}. This exporter routes data
15+
back into the _same_ cluster. See <<local-exporter>>.
16+
17+
`http`::
18+
The preferred exporter, which you can use to route data into any supported
19+
{es} cluster accessible via HTTP. Production environments should always use a
20+
separate monitoring cluster. See <<http-exporter>>.
21+
22+
Both exporters serve the same purpose: to set up the monitoring cluster and route
23+
monitoring data. However, they perform these tasks in very different ways. Even
24+
though things happen differently, both exporters are capable of sending all of
25+
the same data.
26+
27+
Exporters are configurable at both the node and cluster level. Cluster-wide
28+
settings, which are updated with the
29+
<<cluster-update-settings,`_cluster/settings` API>>, take precedence over
30+
settings in the `elasticsearch.yml` file on each node. When you update an
31+
exporter, it is completely replaced by the updated version of the exporter.
32+
33+
IMPORTANT: It is critical that all nodes share the same setup. Otherwise,
34+
monitoring data might be routed in different ways or to different places.
35+
36+
When the exporters route monitoring data into the monitoring cluster, they use
37+
`_bulk` indexing for optimal performance. All monitoring data is forwarded in
38+
bulk to all enabled exporters on the same node. From there, the exporters
39+
serialize the monitoring data and send a bulk request to the monitoring cluster.
40+
There is no queuing--in memory or persisted to disk--so any failure during the
41+
export results in the loss of that batch of monitoring data. This design limits
42+
the impact on {es} and the assumption is that the next pass will succeed.
43+
44+
Routing monitoring data involves indexing it into the appropriate monitoring
45+
indices. Once the data is indexed, it exists in a monitoring index that, by
46+
default, is named with a daily index pattern. For {es} monitoring data, this is
47+
an index that matches `.monitoring-es-6-*`. From there, the data lives inside
48+
the monitoring cluster and must be curated or cleaned up as necessary. If you do
49+
not curate the monitoring data, it eventually fills up the nodes and the cluster
50+
might fail due to lack of disk space.
51+
52+
TIP: You are strongly recommended to manage the curation of indices and
53+
particularly the monitoring indices. To do so, you can take advantage of the
54+
<<local-exporter-cleaner,cleaner service>> or
55+
{curator-ref-current}/index.html[Elastic Curator].
56+
57+
//TO-DO: Add information about index lifecycle management https://github.com/elastic/x-pack-elasticsearch/issues/2814
58+
59+
When using cluster alerts, {watcher} creates daily `.watcher_history*` indices.
60+
These are not managed by {monitoring} and they are not curated automatically. It
61+
is therefore critical that you curate these indices to avoid an undesirable and
62+
unexpected increase in the number of shards and indices and eventually the
63+
amount of disk usage. If you are using a `local` exporter, you can set the
64+
`xpack.watcher.history.cleaner_service.enabled` setting to `true` and curate the
65+
`.watcher_history*` indices by using the
66+
<<local-exporter-cleaner,cleaner service>>. See <<general-notification-settings>>.
67+
68+
There is also a disk watermark (known as the flood stage
69+
watermark), which protects clusters from running out of disk space. When this
70+
feature is triggered, it makes all indices (including monitoring indices)
71+
read-only until the issue is fixed and a user manually makes the index writeable
72+
again. While an active monitoring index is read-only, it will naturally fail to
73+
write (index) new data and will continuously log errors that indicate the write
74+
failure. For more information, see
75+
{ref}/disk-allocator.html[Disk-based Shard Allocation].
76+
77+
[float]
78+
[[es-monitoring-default-exporter]]
79+
=== Default exporters
80+
81+
If a node or cluster does not explicitly define an {monitoring} exporter, the
82+
following default exporter is used:
83+
84+
[source,yaml]
85+
---------------------------------------------------
86+
xpack.monitoring.exporters.default_local: <1>
87+
type: local
88+
---------------------------------------------------
89+
<1> The exporter name uniquely defines the exporter, but it is otherwise unused.
90+
When you specify your own exporters, you do not need to explicitly overwrite
91+
or reference `default_local`.
92+
93+
If another exporter is already defined, the default exporter is _not_ created.
94+
When you define a new exporter, if the default exporter exists, it is
95+
automatically removed.
96+
97+
[float]
98+
[[es-monitoring-templates]]
99+
=== Exporter templates and ingest pipelines
100+
101+
Before exporters can route monitoring data, they must set up certain {es}
102+
resources. These resources include templates and ingest pipelines. The
103+
following table lists the templates that are required before an exporter can
104+
route monitoring data:
105+
106+
[options="header"]
107+
|=======================
108+
| Template | Purpose
109+
| `.monitoring-alerts` | All cluster alerts for monitoring data.
110+
| `.monitoring-beats` | All Beats monitoring data.
111+
| `.monitoring-es` | All {es} monitoring data.
112+
| `.monitoring-kibana` | All {kib} monitoring data.
113+
| `.monitoring-logstash` | All Logstash monitoring data.
114+
|=======================
115+
116+
The templates are ordinary {es} templates that control the default settings and
117+
mappings for the monitoring indices.
118+
119+
By default, monitoring indices are created daily (for example,
120+
`.monitoring-es-6-2017.08.26`). You can change the default date suffix for
121+
monitoring indices with the `index.name.time_format` setting. You can use this
122+
setting to control how frequently monitoring indices are created by a specific
123+
`http` exporter. You cannot use this setting with `local` exporters. For more
124+
information, see <<http-exporter-settings>>.
125+
126+
WARNING: Some users create their own templates that match _all_ index patterns,
127+
which therefore impact the monitoring indices that get created. It is critical
128+
that you do not disable `_source` storage for the monitoring indices. If you do,
129+
{monitoring} for {kib} does not work and you cannot visualize monitoring data
130+
for your cluster.
131+
132+
The following table lists the ingest pipelines that are required before an
133+
exporter can route monitoring data:
134+
135+
[options="header"]
136+
|=======================
137+
| Pipeline | Purpose
138+
| `xpack_monitoring_2` | Upgrades X-Pack monitoring data coming from X-Pack
139+
5.0 - 5.4 to be compatible with the format used in {monitoring} 5.5.
140+
| `xpack_monitoring_6` | A placeholder pipeline that is empty.
141+
|=======================
142+
143+
Exporters handle the setup of these resources before ever sending data. If
144+
resource setup fails (for example, due to security permissions), no data is sent
145+
and warnings are logged.
146+
147+
NOTE: Empty pipelines are evaluated on the coordinating node during indexing and
148+
they are ignored without any extra effort. This inherently makes them a safe,
149+
no-op operation.
150+
151+
For monitoring clusters that have disabled `node.ingest` on all nodes, it is
152+
possible to disable the use of the ingest pipeline feature. However, doing so
153+
blocks its purpose, which is to upgrade older monitoring data as our mappings
154+
improve over time. Beginning in 6.0, the ingest pipeline feature is a
155+
requirement on the monitoring cluster; you must have `node.ingest` enabled on at
156+
least one node.
157+
158+
WARNING: Once any node running 5.5 or later has set up the templates and ingest
159+
pipeline on a monitoring cluster, you must use {kib} 5.5 or later to view all
160+
subsequent data on the monitoring cluster. The easiest way to determine
161+
whether this update has occurred is by checking for the presence of indices
162+
matching `.monitoring-es-6-*` (or more concretely the existence of the
163+
new pipeline). Versions prior to 5.5 used `.monitoring-es-2-*`.
164+
165+
Each resource that is created by an {monitoring} exporter has a `version` field,
166+
which is used to determine whether the resource should be replaced. The `version`
167+
field value represents the latest version of {monitoring} that changed the
168+
resource. If a resource is edited by someone or something external to
169+
{monitoring}, those changes are lost the next time an automatic update occurs.
170+
171+
include::local-export.asciidoc[]
172+
include::http-export.asciidoc[]

0 commit comments

Comments
 (0)