diff --git a/local-antora-playbook.yml b/local-antora-playbook.yml index 79bc362a8..f17c851c2 100644 --- a/local-antora-playbook.yml +++ b/local-antora-playbook.yml @@ -15,7 +15,7 @@ content: - url: . branches: HEAD - url: https://github.com/redpanda-data/documentation - branches: [main, v/*, api, shared, site-search] + branches: [DOC-1407-single-source-additions-for-Cloud, v/*, api, shared, site-search] - url: https://github.com/redpanda-data/redpanda-labs branches: main start_paths: [docs,'*/docs'] diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc index e852d184e..45c803f35 100644 --- a/modules/ROOT/nav.adoc +++ b/modules/ROOT/nav.adoc @@ -462,6 +462,7 @@ **** xref:reference:rpk/rpk-cloud/rpk-cloud-logout.adoc[] *** xref:reference:rpk/rpk-cluster/rpk-cluster.adoc[] **** xref:reference:rpk/rpk-cluster/rpk-cluster-config.adoc[] +***** xref:reference:rpk/rpk-cluster/rpk-cluster-config-get.adoc[] ***** xref:reference:rpk/rpk-cluster/rpk-cluster-config-set.adoc[] ***** xref:reference:rpk/rpk-cluster/rpk-cluster-config-status.adoc[] **** xref:reference:rpk/rpk-cluster/rpk-cluster-logdirs.adoc[] diff --git a/modules/manage/pages/monitor-cloud.adoc b/modules/manage/pages/monitor-cloud.adoc index 5b5f2794a..ca6034761 100644 --- a/modules/manage/pages/monitor-cloud.adoc +++ b/modules/manage/pages/monitor-cloud.adoc @@ -73,7 +73,9 @@ image::https://github.com/redpanda-data/observability/blob/main/docs/images/Ops% It includes https://github.com/redpanda-data/observability#grafana-dashboards[example Grafana dashboards^] and a https://github.com/redpanda-data/observability#sandbox-environment[sandbox environment^] in which you launch a Dockerized Redpanda cluster and create a custom workload to monitor with dashboards. -include::manage:partial$monitor-health.adoc[] +== Monitor for health and performance + +include::ROOT:manage:partial$monitor-health.adoc[tag=single-source] == References diff --git a/modules/manage/partials/monitor-health.adoc b/modules/manage/partials/monitor-health.adoc deleted file mode 100644 index 367548212..000000000 --- a/modules/manage/partials/monitor-health.adoc +++ /dev/null @@ -1,262 +0,0 @@ -== Monitor for performance and health - -This section provides guidelines and example queries using Redpanda's public metrics to optimize your system's performance and monitor its health. - -TIP: To help detect and mitigate anomalous system behaviors, capture baseline metrics of your healthy system at different stages (at start-up, under high load, in steady state) so you can set thresholds and alerts according to those baselines. - -=== Redpanda architecture - -Understanding the unique aspects of Redpanda's architecture and data path can improve your performance, debugging, and tuning skills: - -* Redpanda replicates partitions across brokers in a cluster using glossterm:Raft[], where each partition is a Raft consensus group. A message written from the Kafka API flows down to the Raft implementation layer that eventually directs it to a broker to be stored. Metrics about the Raft layer can reveal the health of partitions and data flowing within Redpanda. -* Redpanda is designed with a glossterm:thread-per-core[] model that it implements with the glossterm:Seastar[] library. With each application thread pinned to a CPU core, when observing or analyzing the behavior of a specific application, monitor the relevant metrics with the label for the specific glossterm:shard[], if available. - -=== Infrastructure resources - -The underlying infrastructure of your system should have sufficient margins to handle peaks in processing, storage, and I/O loads. Monitor infrastructure health with the following queries. - -==== CPU usage - -For the total CPU uptime, monitor xref:reference:public-metrics-reference.adoc#redpanda_uptime_seconds_total[`redpanda_uptime_seconds_total`]. Monitoring its rate of change with the following query can help detect unexpected dips in uptime: - -[,promql] ----- -rate(redpanda_uptime_seconds_total[5m]) ----- - -For the total CPU busy (non-idle) time, monitor xref:reference:public-metrics-reference.adoc#redpanda_cpu_busy_seconds_total[`redpanda_cpu_busy_seconds_total`]. - -To detect unexpected idling, you can query the rate of change as a percentage of the shard that is in use at a given point in time. - -[,promql] ----- -rate(redpanda_cpu_busy_seconds_total[5m]) ----- - -[TIP] -==== -While CPU utilization at the host-level might appear high (for example, 99-100% utilization) when I/O events like message arrival occur, the actual Redpanda process utilization is likely low. System-level metrics such as those provided by the `top` command can be misleading. - -This high host-level CPU utilization happens because Redpanda uses Seastar, which runs event loops on every core (also referred to as a _reactor_), constantly polling for the next task. This process never blocks and will increment clock ticks. It doesn't necessarily mean that Redpanda is busy. - -Use xref:reference:public-metrics-reference.adoc#redpanda_cpu_busy_seconds_total[`redpanda_cpu_busy_seconds_total`] to monitor the actual Redpanda CPU utilization. When it indicates close to 100% utilization over a given period of time, make sure to also monitor produce and consume <> as they may then start to increase as a result of resources becoming overburdened. -==== - -==== Memory allocated - -To monitor the percentage of memory allocated, use a formula with xref:reference:public-metrics-reference.adoc#redpanda_memory_allocated_memory[`redpanda_memory_allocated_memory`] and xref:reference:public-metrics-reference.adoc#redpanda_memory_free_memory[`redpanda_memory_free_memory`]: - -[,promql] ----- -sum(redpanda_memory_allocated_memory) / (sum(redpanda_memory_free_memory) + sum(redpanda_memory_allocated_memory)) ----- - -==== Disk used - -To monitor the percentage of disk consumed, use a formula with xref:reference:public-metrics-reference.adoc#redpanda_storage_disk_free_bytes[`redpanda_storage_disk_free_bytes`] and xref:reference:public-metrics-reference.adoc#redpanda_storage_disk_total_bytes[`redpanda_storage_disk_total_bytes`]: - -[,promql] ----- -1 - (sum(redpanda_storage_disk_free_bytes) / sum(redpanda_storage_disk_total_bytes)) ----- - -Also monitor xref:reference:public-metrics-reference.adoc#redpanda_storage_disk_free_space_alert[`redpanda_storage_disk_free_space_alert`] for an alert when available disk space is low or degraded. - -==== IOPS - -For read and write I/O operations per second (IOPS), monitor the xref:reference:public-metrics-reference.adoc#redpanda_io_queue_total_read_ops[`redpanda_io_queue_total_read_ops`] and xref:reference:public-metrics-reference.adoc#redpanda_io_queue_total_write_ops[`redpanda_io_queue_total_write_ops`] counters: - -[,promql] ----- -rate(redpanda_io_queue_total_read_ops[5m]), -rate(redpanda_io_queue_total_write_ops[5m]) ----- - -=== Throughput - -While maximizing the rate of messages moving from producers to brokers then to consumers depends on tuning each of those components, the total throughput of all topics provides a system-level metric to monitor. When you observe abnormal, unhealthy spikes or dips in producer or consumer throughput, look for correlation with changes in the number of active connections (xref:reference:public-metrics-reference.adoc#redpanda_rpc_active_connections[`redpanda_rpc_active_connections`]) and logged errors to drill down to the root cause. - -The total throughput of a cluster can be measured by the producer and consumer rates across all topics. - -To observe the total producer and consumer rates of a cluster, monitor xref:reference:public-metrics-reference.adoc#redpanda_kafka_request_bytes_total[`redpanda_kafka_request_bytes_total`] with the `produce` and `consume` labels, respectively. - -==== Producer throughput - -For the produce rate, create a query to get the produce rate across all topics: - -[,promql] ----- -sum(rate(redpanda_kafka_request_bytes_total{redpanda_request="produce"} [5m] )) by (redpanda_request) ----- - -==== Consumer throughput - -For the consume rate, create a query to get the total consume rate across all topics: - -[,promql] ----- -sum(rate(redpanda_kafka_request_bytes_total{redpanda_request="consume"} [5m] )) by (redpanda_request) ----- - -=== Latency - -Latency should be consistent between produce and fetch sides. It should also be consistent over time. Take periodic snapshots of produce and fetch latencies, including at upper percentiles (95%, 99%), and watch out for significant changes over a short duration. - -In Redpanda, the latency of produce and fetch requests includes the latency of inter-broker RPCs that are born from Redpanda's internal implementation using Raft. - -==== Kafka consumer latency - -To monitor Kafka consumer request latency, use the xref:reference:public-metrics-reference.adoc#redpanda_kafka_request_latency_seconds[`redpanda_kafka_request_latency_seconds`] histogram with the label `redpanda_request="consume"`. For example, create a query for the 99th percentile: - -[,promql] ----- -histogram_quantile(0.99, sum(rate(redpanda_kafka_request_latency_seconds_bucket{redpanda_request="consume"}[5m])) by (le, provider, region, instance, namespace, pod)) ----- - -You can monitor the rate of Kafka consumer requests using `redpanda_kafka_request_latency_seconds_count` with the `redpanda_request="consume"` label: - ----- -rate(redpanda_kafka_request_latency_seconds_count{redpanda_request="consume"}[5m]) ----- - -==== Kafka producer latency - -To monitor Kafka producer request latency, use the xref:reference:public-metrics-reference.adoc#redpanda_kafka_request_latency_seconds[`redpanda_kafka_request_latency_seconds`] histogram with the `redpanda_request="produce"` label. For example, create a query for the 99th percentile: - -[,promql] ----- -histogram_quantile(0.99, sum(rate(redpanda_kafka_request_latency_seconds_bucket{redpanda_request="produce"}[5m])) by (le, provider, region, instance, namespace, pod)) ----- - -You can monitor the rate of Kafka producer requests with `redpanda_kafka_request_latency_seconds_count` with the `redpanda_request="produce"` label: - -[,promql] ----- -rate(redpanda_kafka_request_latency_seconds_count{redpanda_request="produce"}[5m]) ----- - -==== Internal RPC latency - -To monitor Redpanda internal RPC latency, use the xref:reference:public-metrics-reference.adoc#redpanda_rpc_request_latency_seconds[`redpanda_rpc_request_latency_seconds`] histogram. For example, create a query for the 99th percentile latency: - -[,promql] ----- -histogram_quantile(0.99, (sum(rate(redpanda_rpc_request_latency_seconds_bucket[5m])) by (le, provider, region, instance, namespace, pod, redpanda_server))) ----- - -You can monitor the rate of internal RPC requests with xref:reference:public-metrics-reference.adoc#redpanda_rpc_request_latency_seconds[`redpanda_rpc_request_latency_seconds`] histogram's count: - -[,promql] ----- -rate(redpanda_rpc_request_latency_seconds_count[5m]) ----- - -=== Partition health - -The health of Kafka partitions often reflects the health of the brokers that host them. Thus, when alerts occur for conditions such as under-replicated partitions or more frequent leadership transfers, check for unresponsive or unavailable brokers. - -With Redpanda's internal implementation of the Raft consensus protocol, the health of partitions is also reflected in any errors in the internal RPCs exchanged between Raft peers. - -==== Leadership changes - -Stable clusters have a consistent balance of leaders across all brokers, with few to no leadership transfers between brokers. - -To observe changes in leadership, monitor the xref:reference:public-metrics-reference.adoc#redpanda_raft_leadership_changes[`redpanda_raft_leadership_changes`] counter. For example, use a query to get the total rate of increase of leadership changes for a cluster: - -[,promql] ----- -sum(rate(redpanda_raft_leadership_changes[5m])) ----- - -==== Under-replicated partitions - -A healthy cluster has partition data fully replicated across its brokers. - -An under-replicated partition is at higher risk of data loss. It also adds latency because messages must be replicated before being committed. To know when a partition isn't fully replicated, create an alert for the xref:reference:public-metrics-reference.adoc#redpanda_kafka_under_replicated_replicas[`redpanda_kafka_under_replicated_replicas`] gauge when it is greater than zero: - -[,promql] ----- -redpanda_kafka_under_replicated_replicas > 0 ----- - -Under-replication can be caused by unresponsive brokers. When an alert on `redpanda_kafka_under_replicated_replicas` is triggered, identify the problem brokers and examine their logs. - -==== Leaderless partitions - -A healthy cluster has a leader for every partition. - -A partition without a leader cannot exchange messages with producers or consumers. To identify when a partition doesn't have a leader, create an alert for the xref:reference:public-metrics-reference.adoc#redpanda_cluster_unavailable_partitions[`redpanda_cluster_unavailable_partitions`] gauge when it is greater than zero: - -[,promql] ----- -redpanda_cluster_unavailable_partitions > 0 ----- - -Leaderless partitions can be caused by unresponsive brokers. When an alert on `redpanda_cluster_unavailable_partitions` is triggered, identify the problem brokers and examine their logs. - -==== Raft RPCs - -Redpanda's Raft implementation exchanges periodic status RPCs between a broker and its peers. The xref:reference:public-metrics-reference.adoc#redpanda_node_status_rpcs_timed_out[`redpanda_node_status_rpcs_timed_out`] gauge increases when a status RPC times out for a peer, which indicates that a peer may be unresponsive and may lead to problems with partition replication that Raft manages. Monitor for non-zero values of this gauge, and correlate it with any logged errors or changes in partition replication. - -=== Consumers - -==== Consumer group lag - -When working with Kafka consumer groups, the consumer group lag—the difference between the broker's latest (max) offset and the group's last committed offset—is a performance indicator of how fresh the data being consumed is. While higher lag for archival consumers is expected, high lag for real-time consumers could indicate that the consumers are overloaded and thus may need their topics to be partitioned more, or to spread the load to more consumers. - -To monitor consumer group lag, create a query with the xref:reference:public-metrics-reference.adoc#redpanda_kafka_max_offset[`redpanda_kafka_max_offset`] and xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_committed_offset[`redpanda_kafka_consumer_group_committed_offset`] gauges: - -[,promql] ----- -max by(redpanda_namespace, redpanda_topic, redpanda_partition)(redpanda_kafka_max_offset{redpanda_namespace="kafka"}) - on(redpanda_topic, redpanda_partition) group_right max by(redpanda_group, redpanda_topic, redpanda_partition)(redpanda_kafka_consumer_group_committed_offset) ----- - -=== Services - -Monitor the health of specific Redpanda services with the following metrics. - -==== Schema Registry - -Schema Registry request latency: - -[,promql] ----- -histogram_quantile(0.99, (sum(rate(redpanda_schema_registry_request_latency_seconds_bucket[5m])) by (le, provider, region, instance, namespace, pod))) ----- - -Schema Registry request rate: - -[,promql] ----- -rate(redpanda_schema_registry_request_latency_seconds_count[5m]) + sum without(redpanda_status)(rate(redpanda_schema_registry_request_errors_total[5m])) ----- - -Schema Registry request error rate: - -[,promql] ----- -rate(redpanda_schema_registry_request_errors_total[5m]) ----- - -==== REST proxy - -REST proxy request latency: - -[,promql] ----- -histogram_quantile(0.99, (sum(rate(redpanda_rest_proxy_request_latency_seconds_bucket[5m])) by (le, provider, region, instance, namespace, pod))) ----- - -REST proxy request rate: - -[,promql] ----- -rate(redpanda_rest_proxy_request_latency_seconds_count[5m]) + sum without(redpanda_status)(rate(redpanda_rest_proxy_request_errors_total[5m])) ----- - -REST proxy request error rate: - -[,promql] ----- -rate(redpanda_rest_proxy_request_errors_total[5m]) ----- \ No newline at end of file diff --git a/modules/reference/pages/properties/cluster-properties.adoc b/modules/reference/pages/properties/cluster-properties.adoc index 53d6643d4..32a7590fb 100644 --- a/modules/reference/pages/properties/cluster-properties.adoc +++ b/modules/reference/pages/properties/cluster-properties.adoc @@ -8,4 +8,4 @@ NOTE: Some properties require a cluster restart for updates to take effect. This == Cluster configuration -include::ROOT:reference:properties/cluster-properties.adoc[tags=audit_enabled;audit_excluded_principals;audit_excluded_topics;data_transforms_enabled;data_transforms_logging_line_max_bytes;iceberg_catalog_type;iceberg_delete;iceberg_enabled;iceberg_rest_catalog_client_id;iceberg_rest_catalog_client_secret;iceberg_rest_catalog_token;iceberg_rest_catalog_authentication_mode;iceberg_rest_catalog_endpoint;iceberg_rest_catalog_oauth2_server_uri;iceberg_rest_catalog_prefix;iceberg_rest_catalog_request_timeout_ms;iceberg_default_partition_spec;iceberg_invalid_record_action;iceberg_target_lag_ms;iceberg_rest_catalog_trust;iceberg_rest_catalog_crl;data_transforms_per_function_memory_limit;data_transforms_binary_max_size;log_segment_ms;http_authentication;iceberg_catalog_base_location;default_topic_replications;minimum_topic_replications;oidc_discovery_url;oidc_principal_mapping;oidc_token_audience;sasl_mechanisms;tls_min_version;audit_log_num_partitions;data_transforms_per_core_memory_reservation;iceberg_disable_snapshot_tagging] \ No newline at end of file +include::ROOT:reference:properties/cluster-properties.adoc[tags=audit_enabled;audit_excluded_principals;audit_excluded_topics;data_transforms_enabled;data_transforms_logging_line_max_bytes;iceberg_catalog_type;iceberg_delete;iceberg_enabled;iceberg_rest_catalog_client_id;iceberg_rest_catalog_client_secret;iceberg_rest_catalog_token;iceberg_rest_catalog_authentication_mode;iceberg_rest_catalog_endpoint;iceberg_rest_catalog_oauth2_server_uri;iceberg_rest_catalog_prefix;iceberg_rest_catalog_request_timeout_ms;iceberg_default_partition_spec;iceberg_invalid_record_action;iceberg_target_lag_ms;iceberg_rest_catalog_trust;iceberg_rest_catalog_crl;data_transforms_per_function_memory_limit;data_transforms_binary_max_size;log_segment_ms;http_authentication;iceberg_catalog_base_location;default_topic_replications;minimum_topic_replications;oidc_discovery_url;oidc_principal_mapping;oidc_token_audience;sasl_mechanisms;tls_min_version;audit_log_num_partitions;data_transforms_per_core_memory_reservation;iceberg_disable_snapshot_tagging;enable_consumer_group_metrics] \ No newline at end of file diff --git a/modules/reference/pages/rpk/rpk-cluster/rpk-cluster-config-get.adoc b/modules/reference/pages/rpk/rpk-cluster/rpk-cluster-config-get.adoc new file mode 100644 index 000000000..9a2a96b43 --- /dev/null +++ b/modules/reference/pages/rpk/rpk-cluster/rpk-cluster-config-get.adoc @@ -0,0 +1,3 @@ += rpk cluster config get + +include::ROOT:reference:rpk/rpk-cluster/rpk-cluster-config-get.adoc[tag=single-source] \ No newline at end of file