|
| 1 | +[role="xpack"] |
| 2 | +[testenv="basic"] |
| 3 | +[[es-monitoring-collectors]] |
| 4 | +== Collectors |
| 5 | + |
| 6 | +Collectors, as their name implies, collect things. Each collector runs once for |
| 7 | +each collection interval to obtain data from the public APIs in {es} and {xpack} |
| 8 | +that it chooses to monitor. When the data collection is finished, the data is |
| 9 | +handed in bulk to the <<es-monitoring-exporters,exporters>> to be sent to the |
| 10 | +monitoring clusters. Regardless of the number of exporters, each collector only |
| 11 | +runs once per collection interval. |
| 12 | + |
| 13 | +There is only one collector per data type gathered. In other words, for any |
| 14 | +monitoring document that is created, it comes from a single collector rather |
| 15 | +than being merged from multiple collectors. {monitoring} for {es} currently has |
| 16 | +a few collectors because the goal is to minimize overlap between them for |
| 17 | +optimal performance. |
| 18 | + |
| 19 | +Each collector can create zero or more monitoring documents. For example, |
| 20 | +the `index_stats` collector collects all index statistics at the same time to |
| 21 | +avoid many unnecessary calls. |
| 22 | + |
| 23 | +[options="header"] |
| 24 | +|======================= |
| 25 | +| Collector | Data Types | Description |
| 26 | +| Cluster Stats | `cluster_stats` |
| 27 | +| Gathers details about the cluster state, including parts of |
| 28 | +the actual cluster state (for example `GET /_cluster/state`) and statistics |
| 29 | +about it (for example, `GET /_cluster/stats`). This produces a single document |
| 30 | +type. In versions prior to X-Pack 5.5, this was actually three separate collectors |
| 31 | +that resulted in three separate types: `cluster_stats`, `cluster_state`, and |
| 32 | +`cluster_info`. In 5.5 and later, all three are combined into `cluster_stats`. |
| 33 | ++ |
| 34 | +This only runs on the _elected_ master node and the data collected |
| 35 | +(`cluster_stats`) largely controls the UI. When this data is not present, it |
| 36 | +indicates either a misconfiguration on the elected master node, timeouts related |
| 37 | +to the collection of the data, or issues with storing the data. Only a single |
| 38 | +document is produced per collection. |
| 39 | +| Index Stats | `indices_stats`, `index_stats` |
| 40 | +| Gathers details about the indices in the cluster, both in summary and |
| 41 | +individually. This creates many documents that represent parts of the index |
| 42 | +statistics output (for example, `GET /_stats`). |
| 43 | ++ |
| 44 | +This information only needs to be collected once, so it is collected on the |
| 45 | +_elected_ master node. The most common failure for this collector relates to an |
| 46 | +extreme number of indices -- and therefore time to gather them -- resulting in |
| 47 | +timeouts. One summary `indices_stats` document is produced per collection and one |
| 48 | +`index_stats` document is produced per index, per collection. |
| 49 | +| Index Recovery | `index_recovery` |
| 50 | +| Gathers details about index recovery in the cluster. Index recovery represents |
| 51 | +the assignment of _shards_ at the cluster level. If an index is not recovered, |
| 52 | +it is not usable. This also corresponds to shard restoration via snapshots. |
| 53 | ++ |
| 54 | +This information only needs to be collected once, so it is collected on the |
| 55 | +_elected_ master node. The most common failure for this collector relates to an |
| 56 | +extreme number of shards -- and therefore time to gather them -- resulting in |
| 57 | +timeouts. This creates a single document that contains all recoveries by default, |
| 58 | +which can be quite large, but it gives the most accurate picture of recovery in |
| 59 | +the production cluster. |
| 60 | +| Shards | `shards` |
| 61 | +| Gathers details about all _allocated_ shards for all indices, particularly |
| 62 | +including what node the shard is allocated to. |
| 63 | ++ |
| 64 | +This information only needs to be collected once, so it is collected on the |
| 65 | +_elected_ master node. The collector uses the local cluster state to get the |
| 66 | +routing table without any network timeout issues unlike most other collectors. |
| 67 | +Each shard is represented by a separate monitoring document. |
| 68 | +| Jobs | `job_stats` |
| 69 | +| Gathers details about all machine learning job statistics (for example, |
| 70 | +`GET /_xpack/ml/anomaly_detectors/_stats`). |
| 71 | ++ |
| 72 | +This information only needs to be collected once, so it is collected on the |
| 73 | +_elected_ master node. However, for the master node to be able to perform the |
| 74 | +collection, the master node must have `xpack.ml.enabled` set to true (default) |
| 75 | +and a license level that supports {ml}. |
| 76 | +| Node Stats | `node_stats` |
| 77 | +| Gathers details about the running node, such as memory utilization and CPU |
| 78 | +usage (for example, `GET /_nodes/_local/stats`). |
| 79 | ++ |
| 80 | +This runs on _every_ node with {monitoring} enabled. One common failure |
| 81 | +results in the timeout of the node stats request due to too many segment files. |
| 82 | +As a result, the collector spends too much time waiting for the file system |
| 83 | +stats to be calculated until it finally times out. A single `node_stats` |
| 84 | +document is created per collection. This is collected per node to help to |
| 85 | +discover issues with nodes communicating with each other, but not with the |
| 86 | +monitoring cluster (for example, intermittent network issues or memory pressure). |
| 87 | +|======================= |
| 88 | + |
| 89 | +{monitoring} uses a single threaded scheduler to run the collection of {es} |
| 90 | +monitoring data by all of the appropriate collectors on each node. This |
| 91 | +scheduler is managed locally by each node and its interval is controlled by |
| 92 | +specifying the `xpack.monitoring.collection.interval`, which defaults to 10 |
| 93 | +seconds (`10s`), at either the node or cluster level. |
| 94 | + |
| 95 | +Fundamentally, each collector works on the same principle. Per collection |
| 96 | +interval, each collector is checked to see whether it should run and then the |
| 97 | +appropriate collectors run. The failure of an individual collector does not |
| 98 | +impact any other collector. |
| 99 | + |
| 100 | +Once collection has completed, all of the monitoring data is passed to the |
| 101 | +exporters to route the monitoring data to the monitoring clusters. |
| 102 | + |
| 103 | +If gaps exist in the monitoring charts in {kib}, it is typically because either |
| 104 | +a collector failed or the monitoring cluster did not receive the data (for |
| 105 | +example, it was being restarted). In the event that a collector fails, a logged |
| 106 | +error should exist on the node that attempted to perform the collection. |
| 107 | + |
| 108 | +NOTE: Collection is currently done serially, rather than in parallel, to avoid |
| 109 | + extra overhead on the elected master node. The downside to this approach |
| 110 | + is that collectors might observe a different version of the cluster state |
| 111 | + within the same collection period. In practice, this does not make a |
| 112 | + significant difference and running the collectors in parallel would not |
| 113 | + prevent such a possibility. |
| 114 | + |
| 115 | +For more information about the configuration options for the collectors, see |
| 116 | +<<monitoring-collection-settings>>. |
| 117 | + |
| 118 | +[float] |
| 119 | +[[es-monitoring-stack]] |
| 120 | +=== Collecting data from across the Elastic Stack |
| 121 | + |
| 122 | +{monitoring} in {es} also receives monitoring data from other parts of the |
| 123 | +Elastic Stack. In this way, it serves as an unscheduled monitoring data |
| 124 | +collector for the stack. |
| 125 | + |
| 126 | +By default, data collection is disabled. {es} monitoring data is not |
| 127 | +collected and all monitoring data from other sources such as {kib}, Beats, and |
| 128 | +Logstash is ignored. You must set `xpack.monitoring.collection.enabled` to `true` |
| 129 | +to enable the collection of monitoring data. See <<monitoring-settings>>. |
| 130 | + |
| 131 | +Once data is received, it is forwarded to the exporters |
| 132 | +to be routed to the monitoring cluster like all monitoring data. |
| 133 | + |
| 134 | +WARNING: Because this stack-level "collector" lives outside of the collection |
| 135 | +interval of {monitoring} for {es}, it is not impacted by the |
| 136 | +`xpack.monitoring.collection.interval` setting. Therefore, data is passed to the |
| 137 | +exporters whenever it is received. This behavior can result in indices for {kib}, |
| 138 | +Logstash, or Beats being created somewhat unexpectedly. |
| 139 | + |
| 140 | +While the monitoring data is collected and processed, some production cluster |
| 141 | +metadata is added to incoming documents. This metadata enables {kib} to link the |
| 142 | +monitoring data to the appropriate cluster. If this linkage is unimportant to |
| 143 | +the infrastructure that you're monitoring, it might be simpler to configure |
| 144 | +Logstash and Beats to report monitoring data directly to the monitoring cluster. |
| 145 | +This scenario also prevents the production cluster from adding extra overhead |
| 146 | +related to monitoring data, which can be very useful when there are a large |
| 147 | +number of Logstash nodes or Beats. |
| 148 | + |
| 149 | +For more information about typical monitoring architectures, see |
| 150 | +{xpack-ref}/how-monitoring-works.html[How Monitoring Works]. |
0 commit comments