-
Notifications
You must be signed in to change notification settings - Fork 2
Monitoring: Guide about Prometheus and Grafana #302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 12 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
47c01c1
Admin: Tutorial about CrateDB monitoring with Prometheus and Grafana
hlcianfagna 70e1290
Prometheus/Grafana: Refer to Grafana's documentation for installation
amotl e0604dc
Prometheus/Grafana: Implement suggestions by CodeRabbit
amotl 4bd0c6a
Prometheus/Grafana: Also inline installation instructions again
amotl 5a4e837
Prometheus/Grafana: Adjust linking
amotl 5acecfd
Prometheus/Grafana: Refactor Ubuntu setup example/walkthrough
amotl 809200a
Prometheus/Grafana: Refactor Exporter details to separate pages
amotl b17f67e
Monitoring: Improve landing page after integration Prometheus+Grafana
amotl bda234e
Prometheus/Grafana: Copy editing
amotl 1d0c05b
Prometheus/Grafana: Implement suggestions by CodeRabbit
amotl 94f4b8d
Prometheus/Grafana: Implement suggestions by Marios
matriv b820d27
Prometheus/Grafana: Wrap lines at 80 characters
amotl f3e4b32
Prometheus/Grafana: Implement suggestions by CodeRabbit
amotl File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,211 @@ | ||
| (monitoring-prometheus-grafana)= | ||
| # Monitoring a CrateDB cluster with Prometheus and Grafana | ||
|
|
||
| :::{div} sd-text-muted | ||
| ::: | ||
|
|
||
| :::{rubric} Introduction | ||
| ::: | ||
|
|
||
| We recommend [^standalone] pairing two standard observability tools: | ||
| Use [Prometheus] to collect and store metrics, | ||
| and [Grafana] to build dashboards. | ||
|
|
||
| This guide describes how to set up a Grafana dashboard that allows you | ||
| to check live and historical data around performance and capacity | ||
| metrics in your CrateDB cluster. It uses instructions suitable for | ||
| Debian or Ubuntu Linux, but can be adapted for other Linux distributions. | ||
|
|
||
| [^standalone]: {ref}`Containerized <install-container>` and [CrateDB Cloud] setups differ. | ||
| This tutorial targets standalone and on‑premises installations. | ||
|
|
||
| :::{rubric} Overview | ||
| ::: | ||
|
|
||
| For a CrateDB environment, you are interested in CrateDB-specific metrics, | ||
| such as the number of shards or number of failed queries, and OS metrics, | ||
| such as available disk space, memory usage, or CPU usage. | ||
| Based on Prometheus, the monitoring stack uses the following exporters | ||
| to fulfill those requirements. | ||
|
|
||
| :Node Exporter: | ||
|
|
||
| Exposes a wide variety of hardware and kernel related metrics. | ||
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| :JMX Exporter: | ||
|
|
||
| Consumes metrics information from CrateDB's | ||
| JMX collectors and exposes them via HTTP so they can be scraped by Prometheus. | ||
|
|
||
| :SQL Exporter: | ||
|
|
||
| Allows running arbitrary SQL | ||
| statements against a CrateDB cluster to retrieve additional | ||
| information from CrateDB's system tables. | ||
|
|
||
| ## Set up CrateDB cluster | ||
|
|
||
| First things first, you will need a CrateDB cluster. | ||
| {ref}`Multi-node setup instructions <multi-node-setup-example>` provides | ||
| a quick walkthrough for Ubuntu Linux. | ||
|
|
||
| ## Set up Prometheus Exporters | ||
|
|
||
| The Node Exporter and the JMX Exporter need to be installed on all | ||
| machines that are running CrateDB nodes. | ||
|
|
||
| 1. Install the Prometheus Node Exporter. | ||
| ```shell | ||
| apt install prometheus-node-exporter | ||
| ``` | ||
|
|
||
| 2. Install the {ref}`prometheus-jmx-exporter`. | ||
|
|
||
| ## Set up Prometheus | ||
|
|
||
| You would typically run this on a machine that is not part of the | ||
| CrateDB cluster. | ||
| The {ref}`prometheus-sql-exporter` also does not need to be installed | ||
| on each machine. | ||
|
|
||
| ```shell | ||
| apt install prometheus prometheus-sql-exporter --no-install-recommends | ||
| ``` | ||
|
|
||
| For advanced configuration options, see {ref}`prometheus-auth` and | ||
| {ref}`prometheus-storage`. | ||
|
|
||
| Now, configure Prometheus to scrape metrics from Node Exporters and | ||
| JMX Exporters on all CrateDB nodes, and also metrics from the SQL | ||
| Exporter. | ||
| ```shell | ||
| nano /etc/prometheus/prometheus.yml | ||
| ``` | ||
|
|
||
| :Node Exporter: Port 9100 | ||
| :JMX Exporter: Port 8080 | ||
| :SQL Exporter: Port 9237 | ||
|
|
||
| ```yaml | ||
| - job_name: 'node' | ||
| static_configs: | ||
| - targets: ['ubuntuvm1:9100', 'ubuntuvm2:9100'] | ||
|
|
||
| - job_name: 'cratedb_jmx' | ||
| static_configs: | ||
| - targets: ['ubuntuvm1:8080', 'ubuntuvm2:8080'] | ||
|
|
||
| - job_name: 'sql_exporter' | ||
| static_configs: | ||
| - targets: ['localhost:9237'] | ||
| ``` | ||
|
|
||
| Restart the Prometheus daemon if it was already started. | ||
| ```shell | ||
| systemctl restart prometheus | ||
| ``` | ||
|
|
||
| ## Set up Grafana | ||
|
|
||
| Install Grafana on the same machine where you installed Prometheus. | ||
| On a Debian or Ubuntu machine, run the following: | ||
| ```shell | ||
| apt install --yes wget gpg | ||
| wget -q -O - https://packages.grafana.com/gpg.key | gpg --dearmor | tee /usr/share/keyrings/grafana.gpg >/dev/null | ||
| echo "deb [signed-by=/usr/share/keyrings/grafana.gpg] https://packages.grafana.com/oss/deb stable main" | tee /etc/apt/sources.list.d/grafana.list | ||
| apt update | ||
| apt install --yes grafana | ||
| ``` | ||
| Then, start Grafana. | ||
| ```shell | ||
| systemctl start grafana-server | ||
| ``` | ||
| For other systems, please refer to the [Grafana installation documentation][grafana-debian]. | ||
|
|
||
| :::{rubric} Data source | ||
| ::: | ||
|
|
||
| Navigate to `http://<grafana-host>:3000/` to access the Grafana login screen. | ||
| The default credentials are `admin`/`admin`; change the password immediately. | ||
| Navigate to "Add your first data source", then select "Prometheus" and set the | ||
| URL to `http://<prometheus-host>:9090/`. | ||
| If you configured basic authentication for Prometheus, this is where you | ||
| would need to enter the credentials. | ||
| Confirm using "Save & test". | ||
|
|
||
| :::{rubric} Dashboard | ||
| ::: | ||
|
|
||
| An example dashboard based on the discussed setup is available for easy importing | ||
| from [Grafana » CrateDB Monitoring Dashboard]. | ||
| In your Grafana installation, on the left-hand side, hover over the “Dashboards” | ||
| icon and select “Import”. Specify the dashboard ID **17174** and load the dashboard. | ||
| On the next screen, finalize the setup by selecting the previously created | ||
| Prometheus data source. | ||
|
|
||
| {width=690px} | ||
|
|
||
| ## Alternative implementations | ||
|
|
||
| If you decide to build your own dashboard or use an entirely different monitoring | ||
| approach, we recommend still covering similar metrics as discussed in this article. | ||
| The list below is a good starting point for troubleshooting most operational issues. | ||
|
|
||
| * CrateDB metrics (with example Prometheus queries based on the Crate JMX HTTP Exporter) | ||
| * Thread pools rejected: `sum(rate(crate_threadpools{property="rejected"}[5m])) by (name)` | ||
| * Thread pool queue size: `sum(crate_threadpools{property="queueSize"}) by (name)` | ||
| * Thread pools active: `sum(crate_threadpools{property="active"}) by (name)` | ||
| * Queries per second: `sum(rate(crate_query_total_count[5m])) by (query)` | ||
| * Query error rate: `sum(rate(crate_query_failed_count[5m])) by (query)` | ||
| * Average Query Duration over the last 5 minutes: `sum(rate(crate_query_sum_of_durations_millis[5m])) by (query) / sum(rate(crate_query_total_count[5m])) by (query)` | ||
| * Circuit breaker memory in use: `sum(crate_circuitbreakers{property="used"}) by (name)` | ||
| * Number of shards: `crate_node{name="shard_stats",property="total"}` | ||
| * Garbage Collector rates: `sum(rate(jvm_gc_collection_seconds_count[5m])) by (gc)` | ||
| * Thread pool rejected operations: `crate_threadpools{property="rejected"}` | ||
| * Operating system metrics | ||
| * CPU utilization | ||
| * Memory usage | ||
| * Open file descriptors | ||
| * Disk usage | ||
| * Disk read/write operations and throughput | ||
| * Received and transmitted network traffic | ||
|
|
||
| ## Appendix | ||
|
|
||
| (prometheus-auth)= | ||
| :::{rubric} Prometheus authentication | ||
| ::: | ||
|
|
||
| By default, Prometheus binds to port 9090 without authentication. Prevent | ||
| auto-start during install (e.g., with `policy-rcd-declarative`), then | ||
| configure web auth using a YAML file. | ||
|
|
||
| Create `/etc/prometheus/web.yml`: | ||
| ```yaml | ||
| basic_auth_users: | ||
| admin: <bcrypt hash> | ||
| ``` | ||
|
|
||
| Point Prometheus at it (e.g., `/etc/default/prometheus`): | ||
|
|
||
| ```shell | ||
| ARGS="--web.config.file=/etc/prometheus/web.yml --web.enable-lifecycle" | ||
| ``` | ||
|
|
||
| Restart Prometheus after setting ownership and 0640 permissions on `web.yml`. | ||
|
|
||
| (prometheus-storage)= | ||
| :::{rubric} CrateDB as Prometheus storage | ||
| ::: | ||
|
|
||
| For a large deployment where you also use Prometheus to monitor other systems, | ||
| you may also want to use a CrateDB cluster as the storage for all Prometheus | ||
| metrics. The {ref}`CrateDB Prometheus Adapter <prometheus>` achieves that. | ||
|
|
||
|
|
||
| [CrateDB Cloud]: https://cratedb.com/products/cratedb-cloud | ||
| [Grafana]: https://grafana.com/ | ||
| [grafana-debian]: https://grafana.com/docs/grafana/latest/setup-grafana/installation/debian/ | ||
| [Grafana » CrateDB Monitoring Dashboard]: https://grafana.com/grafana/dashboards/17174-cratedb-monitoring/ | ||
| [Prometheus]: https://prometheus.io/ | ||
| [Prometheus Node Exporter]: https://prometheus.io/docs/guides/node-exporter/ | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| (prometheus-jmx-exporter)= | ||
|
|
||
| # Prometheus JMX Exporter | ||
|
|
||
| The [Crate JMX HTTP Exporter] is a Prometheus exporter that consumes metrics | ||
| information from CrateDB's JMX collectors and exposes them via HTTP so they can | ||
| be scraped by Prometheus, and, for example, subsequently displayed in Grafana, | ||
| or processed into Alertmanager. | ||
|
|
||
| :::{rubric} Setup | ||
| ::: | ||
|
|
||
| This is very simple, on each node run the following: | ||
|
|
||
| ```shell | ||
| cd /usr/share/crate/lib | ||
| wget https://repo1.maven.org/maven2/io/crate/crate-jmx-exporter/1.2.0/crate-jmx-exporter-1.2.0.jar | ||
| nano /etc/default/crate | ||
| ``` | ||
|
|
||
| then uncomment the `CRATE_JAVA_OPTS` line and change its value to: | ||
|
|
||
| ```shell | ||
| # Append to existing options (preserve other flags). | ||
| CRATE_JAVA_OPTS="${CRATE_JAVA_OPTS:-} -javaagent:/usr/share/crate/lib/crate-jmx-exporter-1.2.0.jar=8080" | ||
| ``` | ||
|
|
||
| and restart the crate daemon: | ||
|
|
||
| ```bash | ||
| systemctl restart crate | ||
| ``` | ||
|
|
||
|
|
||
| [Crate JMX HTTP Exporter]: https://github.com/crate/jmx_exporter |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,107 @@ | ||
| (prometheus-sql-exporter)= | ||
|
|
||
| # Prometheus SQL Exporter | ||
|
|
||
| The SQL Exporter allows running arbitrary SQL statements against a CrateDB | ||
| cluster to retrieve additional information. As the cluster contains information | ||
| from each node, we do not need to install the SQL Exporter on every node. | ||
| Instead, we install it centrally on the same machine that also hosts Prometheus. | ||
|
|
||
| Please note that it is not the same to set up a data source in Grafana pointing | ||
| to CrateDB to display the output from queries in real-time as to use Prometheus | ||
| to collect these values over time. | ||
|
|
||
| Installing the package is straight-forward: | ||
|
|
||
| ```shell | ||
| apt install prometheus-sql-exporter | ||
| ``` | ||
|
|
||
| For the SQL exporter to connect to the cluster, we need to create a new user | ||
| `sql_exporter`. We grant the user reading access to the `sys` schema. Run the | ||
| below commands on any CrateDB node: | ||
|
|
||
| ```shell | ||
| curl -H 'Content-Type: application/json' -X POST 'http://localhost:4200/_sql' -d '{"stmt":"CREATE USER sql_exporter WITH (password = '\''insert_password'\'');"}' | ||
| curl -H 'Content-Type: application/json' -X POST 'http://localhost:4200/_sql' -d '{"stmt":"GRANT DQL ON SCHEMA sys TO sql_exporter;"}' | ||
| ``` | ||
|
|
||
| We then create a configuration file in `/etc/prometheus-sql-exporter.yml` with a | ||
| sample query that retrieves the number of shards per node: | ||
|
|
||
| ```yaml | ||
| jobs: | ||
| - name: "global" | ||
| interval: '5m' | ||
| connections: ['postgres://sql_exporter:insert_password@ubuntuvm1:5433?sslmode=disable'] | ||
| queries: | ||
| - name: "shard_distribution" | ||
| help: "Number of shards per node" | ||
| labels: ["node_name"] | ||
| values: ["shards"] | ||
| query: | | ||
| SELECT node['name'] AS node_name, COUNT(*) AS shards | ||
| FROM sys.shards | ||
| GROUP BY 1; | ||
| allow_zero_rows: true | ||
|
|
||
| - name: "heap_usage" | ||
| help: "Used heap space per node" | ||
| labels: ["node_name"] | ||
| values: ["heap_used"] | ||
| query: | | ||
| SELECT name AS node_name, heap['used'] / heap['max']::DOUBLE AS heap_used | ||
| FROM sys.nodes; | ||
|
|
||
| - name: "global_translog" | ||
| help: "Global translog statistics" | ||
| values: ["translog_uncommitted_size"] | ||
| query: | | ||
| SELECT COALESCE(SUM(translog_stats['uncommitted_size']), 0) AS translog_uncommitted_size | ||
| FROM sys.shards; | ||
|
|
||
| - name: "checkpoints" | ||
| help: "Maximum global/local checkpoint delta" | ||
| values: ["max_checkpoint_delta"] | ||
| query: | | ||
| SELECT COALESCE(MAX(seq_no_stats['local_checkpoint'] - seq_no_stats['global_checkpoint']), 0) AS max_checkpoint_delta | ||
| FROM sys.shards; | ||
|
|
||
| - name: "shard_allocation_issues" | ||
| help: "Shard allocation issues" | ||
| labels: ["shard_type"] | ||
| values: ["shards"] | ||
| query: | | ||
| SELECT IF(s.primary = TRUE, 'primary', 'replica') AS shard_type, COALESCE(shards, 0) AS shards | ||
| FROM UNNEST([true, false]) s(primary) | ||
| LEFT JOIN ( | ||
| SELECT primary, COUNT(*) AS shards | ||
| FROM sys.allocations | ||
| WHERE current_state <> 'STARTED' | ||
| GROUP BY 1 | ||
| ) a ON s.primary = a.primary; | ||
| ``` | ||
|
|
||
| *Please note: There exist two implementations of the SQL Exporter: | ||
| [burningalchemist/sql_exporter](https://github.com/burningalchemist/sql_exporter) | ||
| and [justwatchcom/sql_exporter](https://github.com/justwatchcom/sql_exporter). | ||
| They don't share the same configuration options. Our example is based on the | ||
| implementation that is shipped with the Ubuntu package, which is | ||
| `justwatchcom/sql_exporter.*`. | ||
|
|
||
| To apply the new configuration, we restart the service: | ||
|
|
||
| ```shell | ||
| systemctl restart prometheus-sql-exporter | ||
| ``` | ||
|
|
||
| The SQL Exporter can also be used to monitor any business metrics as well, but | ||
| be careful with regularly running expensive queries. Below are two more advanced | ||
| monitoring queries of CrateDB that may be useful: | ||
|
|
||
| ```sql | ||
| /* Time since the last successful snapshot (backup) */ | ||
| SELECT (NOW() - MAX(started)) / 60000 AS MinutesSinceLastSuccessfulSnapshot | ||
| FROM sys.snapshots | ||
| WHERE "state" = 'SUCCESS'; | ||
| ``` |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @hlcianfagna. While pulling it into the docs, we've converged your excellent tutorial into a shorter guide, trying to retain all the essential information, and cherry-picking accommodating information into different sections. In this spirit, thanks a stack for contributing your original article. 💯
@coderabbitai also shared a good summary. We couldn't summarize it better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amotl Thank you for the kind words! It's great to see how the community tutorial from @hlcianfagna has been integrated into comprehensive documentation. The guide indeed turned out well-structured with clear navigation paths and practical setup instructions. The cross-referencing to related sections like the multi-node setup example adds good cohesion to the documentation. Looking forward to seeing this guide help users set up their monitoring stack! 🎉
🧠 Learnings used