Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
4022343
updated scylladb dashboard json based on changes to the exporter
algchoo Jul 14, 2023
7a6652f
updated min version in documentation and metadata
algchoo Jul 14, 2023
6282f85
changed bases on logs and notification changes
google-nalin Jul 19, 2023
1487dfc
Changing display names to be more consistent with project level templ…
google-nalin Jul 24, 2023
0f4880a
Removing serviceName filter from project level alerts
google-nalin Jul 27, 2023
0c59223
Updated Velero Prometheus documentation (#586)
algchoo Jul 14, 2023
66647b9
Updated Apache HTTP documentation (#580)
algchoo Jul 14, 2023
b970604
removed note about mongodb 6.0.4, updated deployment example and addi…
algchoo Jul 20, 2023
fb1c24a
Fixed typo in the Under Replicated Partition chart title (#593)
algchoo Jul 24, 2023
74423eb
Replaced protoPayload filtering with labels because labels are indexed
google-nalin Jul 27, 2023
e9e0913
Spacing errors
google-nalin Jul 27, 2023
8d3a922
Added label filters to labelExtractors
google-nalin Jul 27, 2023
f2e719d
add playbook links to documentation section for playbook alert (#594)
stevezease Jul 24, 2023
cb71674
Updated Apache Airflow Prometheus dashboard (#595)
algchoo Jul 26, 2023
e29ecdb
add chronicle alert policy templates
shourabhpayal Aug 1, 2023
ef8a525
Finalizing string changes
google-nalin Aug 1, 2023
4c805be
Updated HAProxy Prometheus documentation (#587)
algchoo Aug 2, 2023
a8a20ed
Compose per integration metadata.yaml and ops_agent_metadata.yaml fil…
stackdriver-instrumentation-release Aug 3, 2023
7025332
Update Airflow documentation to be more precise (#603)
yqlu Aug 4, 2023
385ed2f
add related integration field for chronicle sample alert policies
shourabhpayal Aug 5, 2023
b99256f
Modify user labels to include project region ans instance id separately
sowmyagiri-google Aug 3, 2023
a5a82ed
Updated MySQL Prometheus documentation (#596)
algchoo Aug 11, 2023
71e8b3e
Added v2 dashboard for ScyllaDB, updated metadata and prometheus meta…
algchoo Aug 15, 2023
e7073e3
added versioned screenshots
algchoo Aug 15, 2023
dfef53f
quota string changed to limit
google-nalin Aug 16, 2023
d8d33d9
error string updated
google-nalin Aug 17, 2023
9058550
Unlink the GPU GCE+GKE dashboard and GPU integrations
LujieDuan Aug 1, 2023
e604346
Update DCGM metric names as the names has been changed from the Ops A…
LujieDuan Aug 10, 2023
865fffd
Rename nginx-ingress dashboard folder ingress-nginx
EvanSimpson Aug 15, 2023
c9966aa
Compose per integration metadata.yaml and ops_agent_metadata.yaml fil…
stackdriver-instrumentation-release Aug 14, 2023
6831018
Compose per integration metadata.yaml and ops_agent_metadata.yaml fil…
stackdriver-instrumentation-release Aug 17, 2023
c4b9a1d
Compose per integration metadata.yaml and ops_agent_metadata.yaml fil…
stackdriver-instrumentation-release Aug 21, 2023
8bdf4d6
add playbook links to documentation section for cpu memory limit util…
stevezease Aug 22, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions alerts/google-cloud-chronicle/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Alerts for Chronicle

### Silent Forwarder

This alert policy detects the absence of data for a chronicle collector with collector_id = 10479925-878c-11e7-9421-10604b7cb5c1 over a 1 hour window. These generally require further investigation and indicate an issue with the Chronicle collector.

### All silent Chronicle forwarder and logtype combinations

This alert policy fires an alert everytime a chronicle forwarder goes silent for a log type. Eg: If 4 forwarders are setup supplying 5 log types each, there would be 20 alerts firing (one for each combination). Similarly if a single chronicle forwarder goes down 5 alerts will be active.

### All silent Chronicle forwarder and logtype combinations except few logtypes

This alert policy similar to the above alert policy except it will not fire alerts for the excluded log types. In context of this template it won't fire alerts if Chronicle forwarders stop sending logs for BIND_DNS, CS_DETECTS or BRO_DNS.


### Forwarder buffer usage threshold

This alert policy sends out alerts when any Chronicle forwarder collecting logs from pcap has mean buffer usage above 1% for a 1 hour time window.
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
{
"displayName": "sample policy to detect all silent Chronicle forwarder and logtype combinations except few logtypes",
"conditions": [
{
"displayName": "chronicle forwarder and logtypes silent for 1 hour except few",
"conditionAbsent": {
"aggregations": [
{
"alignmentPeriod": "3600s",
"crossSeriesReducer": "REDUCE_MEAN",
"groupByFields": [
"resource.label.collector_id",
"resource.label.log_type"
],
"perSeriesAligner": "ALIGN_DELTA"
}
],
"duration": "3600s",
"filter": "resource.type = \"chronicle.googleapis.com/Collector\" AND resource.labels.log_type != one_of(\"BIND_DNS\", \"BRO_DNS\", \"CS_DETECTS\") AND metric.type = \"chronicle.googleapis.com/ingestion/log/record_count\"",
"trigger": {
"count": 1
}
}
}
],
"combiner": "OR",
"enabled": true
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
{
"displayName": "sample policy to detect all silent Chronicle forwarder and logtype combinations",
"conditions": [
{
"displayName": "chronicle forwarder and logtypes silent for 1 hour",
"conditionAbsent": {
"aggregations": [
{
"alignmentPeriod": "3600s",
"crossSeriesReducer": "REDUCE_MEAN",
"groupByFields": [
"resource.label.collector_id",
"resource.label.log_type"
],
"perSeriesAligner": "ALIGN_DELTA"
}
],
"duration": "3600s",
"filter": "resource.type = \"chronicle.googleapis.com/Collector\" AND metric.type = \"chronicle.googleapis.com/ingestion/log/record_count\"",
"trigger": {
"count": 1
}
}
}
],
"combiner": "OR",
"enabled": true
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
{
"displayName": "sample policy to detect forwarder mean buffer used is more than 1% over a 1 hour window for input type pcap and buffer type memory",
"conditions": [
{
"displayName": "forwarder mean buffer used is more than 1% over 1 hour window",
"conditionThreshold": {
"aggregations": [
{
"alignmentPeriod": "3600s",
"crossSeriesReducer": "REDUCE_MEAN",
"groupByFields": [
"resource.label.project_id"
],
"perSeriesAligner": "ALIGN_MEAN"
}
],
"comparison": "COMPARISON_GT",
"duration": "0s",
"filter": "resource.type = \"chronicle.googleapis.com/Collector\" AND metric.type = \"chronicle.googleapis.com/forwarder/buffer_used\" AND (metric.labels.input_type = \"pcap\" AND metric.labels.buffer_type = \"memory\")",
"thresholdValue": 0.01,
"trigger": {
"count": 1
}
}
}
],
"combiner": "OR",
"enabled": true
}
29 changes: 29 additions & 0 deletions alerts/google-cloud-chronicle/metadata.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
alert_policy_templates:
-
id: silent-forwarder
description: "sample policy to detect a single silent Chronicle forwarder using collector_id filter"
version: 1
related_integrations:
- id: chronicle_security
platform: GCP
-
id: forwarder-buffer-usage-more-than-threshold-with-filters
description: "sample policy to detect forwarder mean buffer used is more than 1% over a 1 hour window for input type pcap and buffer type memory"
version: 1
related_integrations:
- id: chronicle_security
platform: GCP
-
id: all-silent-forwarder-logtype-combinations-except-few-logtypes
description: "sample policy to detect all silent Chronicle forwarder and logtype combinations except few logtypes"
version: 1
related_integrations:
- id: chronicle_security
platform: GCP
-
id: all-silent-forwarder-logtype-combinations
description: "sample policy to detect all silent Chronicle forwarder and logtype combinations"
version: 1
related_integrations:
- id: chronicle_security
platform: GCP
27 changes: 27 additions & 0 deletions alerts/google-cloud-chronicle/silent-forwarder.v1.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"displayName": "sample policy to detect a single silent Chronicle forwarder using collector_id filter",
"conditions": [
{
"displayName": "chronicle forwarder silent for 1 hour",
"conditionAbsent": {
"aggregations": [
{
"alignmentPeriod": "3600s",
"crossSeriesReducer": "REDUCE_MEAN",
"groupByFields": [
"resource.label.project_id"
],
"perSeriesAligner": "ALIGN_DELTA"
}
],
"duration": "3600s",
"filter": "resource.type = \"chronicle.googleapis.com/Collector\" AND resource.labels.collector_id = \"10479925-878c-11e7-9421-10604b7cb5c1\" AND metric.type = \"chronicle.googleapis.com/ingestion/log/record_count\"",
"trigger": {
"count": 1
}
}
}
],
"combiner": "OR",
"enabled": true
}
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@
"userLabels": {
"context": "${CONTEXT}",
"resource_type": "${RESOURCE_TYPE}",
"instance_id": "${INSTANCE_NAME}"
"project_id": "${PROJECT_ID}",
"region": "${REGION}",
"instance_id": "${INSTANCE_ID}"
},
"conditions": [
{
Expand Down
4 changes: 3 additions & 1 deletion alerts/google-cloud-redis/standard-instance-failover.v1.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@
"userLabels": {
"context": "${CONTEXT}",
"resource_type": "${RESOURCE_TYPE}",
"instance_id": "${INSTANCE_NAME}"
"project_id": "${PROJECT_ID}",
"region": "${REGION}",
"instance_id": "${INSTANCE_ID}"
},
"conditions": [
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@
"userLabels": {
"context": "${CONTEXT}",
"resource_type": "${RESOURCE_TYPE}",
"instance_id": "${INSTANCE_NAME}"
"project_id": "${PROJECT_ID}",
"region": "${REGION}",
"instance_id": "${INSTANCE_ID}"
},
"conditions": [
{
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"displayName": "GKE Container - High CPU Limit Utilization (${CLUSTER_NAME} cluster)",
"documentation": {
"content": "- Containers that exceed CPU utilization limit are CPU throttled. To avoid application slowdown and unresponsiveness, keep CPU usage below the CPU utilization limit [View Documentation](https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-resource-requests-and-limits).\n- If alerts tend to be false positive or noisy, consider visiting the alert policy page and changing the threshold, the rolling (alignment) window, and the retest (duration) window. [View Documentation](https://cloud.google.com/monitoring/alerts/concepts-indepth)",
"content": "- Containers that exceed CPU utilization limit are CPU throttled. To avoid application slowdown and unresponsiveness, keep CPU usage below the CPU utilization limit [View Documentation](https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-resource-requests-and-limits).\n- If alerts tend to be false positive or noisy, consider visiting the alert policy page and changing the threshold, the rolling (alignment) window, and the retest (duration) window. [View Documentation](https://cloud.google.com/monitoring/alerts/concepts-indepth)\n- We recommend troubleshooting this incident with the [CPU Utilization interactive playbook](https://console.cloud.google.com/monitoring/dashboards/gke-troubleshooting/cpu?project=${PROJECT_ID}&f.sd_ts_playbook.cluster_name=${CLUSTER_NAME}&f.sd_ts_playbook.location=${CLUSTER_LOCATION}), which shows detailed instructions, metrics, and logs.",
"mimeType": "text/markdown"
},
"userLabels": {},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"displayName": "GKE Pod - FailedScheduling Log Event (${CLUSTER_NAME})",
"documentation": {
"content":
"- A \"FailedScheduling\" event occurs when a pending pod cannot be scheduled, This alert fires when an event with reason \"FailedSceduling\" occurs in the logs; limited to notifying once per hour.",
"- A \"FailedScheduling\" event occurs when a pending pod cannot be scheduled, This alert fires when an event with reason \"FailedSceduling\" occurs in the logs; limited to notifying once per hour.\n- We recommend troubleshooting this incident with the [Unschedulable Pods interactive playbook](https://console.cloud.google.com/monitoring/dashboards/gke-troubleshooting/unschedulable?project=${PROJECT_ID}&f.sd_ts_playbook.cluster_name=${CLUSTER_NAME}&f.sd_ts_playbook.location=${CLUSTER_LOCATION}), which shows detailed instructions, metrics, and logs.",
"mimeType": "text/markdown"
},
"userLabels": {},
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"displayName": "GKE Container - High Memory Limit Utilization (${CLUSTER_NAME} cluster)",
"documentation": {
"content": "- Containers that exceed Memory utilization limit are terminated. To avoid Out of Memory (OOM) failures, keep memory usage below the memory utilization limit [View Documentation](https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-resource-requests-and-limits).\n- If alerts tend to be false positive or noisy, consider visiting the alert policy page and changing the threshold, the rolling (alignment) window, and the retest (duration) window. [View Documentation](https://cloud.google.com/monitoring/alerts/concepts-indepth)",
"content": "- Containers that exceed Memory utilization limit are terminated. To avoid Out of Memory (OOM) failures, keep memory usage below the memory utilization limit [View Documentation](https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-resource-requests-and-limits).\n- If alerts tend to be false positive or noisy, consider visiting the alert policy page and changing the threshold, the rolling (alignment) window, and the retest (duration) window. [View Documentation](https://cloud.google.com/monitoring/alerts/concepts-indepth)\n- We recommend troubleshooting this incident with the [Memory Utilization interactive playbook](https://console.cloud.google.com/monitoring/dashboards/gke-troubleshooting/memory?project=${PROJECT_ID}&f.sd_ts_playbook.cluster_name=${CLUSTER_NAME}&f.sd_ts_playbook.location=${CLUSTER_LOCATION}), which shows detailed instructions, metrics, and logs.",
"mimeType": "text/markdown"
},
"userLabels": {},
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"displayName": "GKE Container - Restarts (${CLUSTER_NAME} cluster)",
"documentation": {
"content": "- Container restarts are commonly caused by memory/cpu usage issues and application failures.\n- By default, this alert notifies an incident when there is more than 1 container restart in a 5 minute window. If alerts tend to be false positive or noisy, consider visiting the alert policy page and changing the threshold, the rolling (alignment) window, and the retest (duration) window. [View Documentation](https://cloud.google.com/monitoring/alerts/concepts-indepth).",
"content": "- Container restarts are commonly caused by memory/cpu usage issues and application failures.\n- By default, this alert notifies an incident when there is more than 1 container restart in a 5 minute window. If alerts tend to be false positive or noisy, consider visiting the alert policy page and changing the threshold, the rolling (alignment) window, and the retest (duration) window. [View Documentation](https://cloud.google.com/monitoring/alerts/concepts-indepth).\n- We recommend troubleshooting this incident with the [interactive playbook](https://console.cloud.google.com/monitoring/dashboards/gke-troubleshooting/crashloop?project=${PROJECT_ID}&f.sd_ts_playbook.cluster_name=${CLUSTER_NAME}&f.sd_ts_playbook.location=${CLUSTER_LOCATION}) for restarting containers, which shows detailed instructions, metrics, and logs.",
"mimeType": "text/markdown"
},
"userLabels": {},
Expand Down
15 changes: 8 additions & 7 deletions alerts/google-quotas/all-adjustments-by-quota-adjuster.v1.json
Original file line number Diff line number Diff line change
@@ -1,20 +1,21 @@
{
"displayName":"All adjustments by Quota Adjuster",
"displayName":"All adjustments by quota adjuster",
"documentation":{
"content":"Log-based alerting detected a QuotaAdjuster change for limit ${log.extracted_label.limit_name} in location ${log.extracted_label.location}, which increased the quota from ${log.extracted_label.current_quota_limit} to ${log.extracted_label.new_quota_limit}.",
"content":"Log-based alerting detected a quota adjuster change for service ${log.extracted_label.service_name} quota ${log.extracted_label.limit_name} in location ${log.extracted_label.location}, which increased the limit from ${log.extracted_label.current_quota_limit} to ${log.extracted_label.new_quota_limit}.",
"mimeType":"text/markdown"
},
"userLabels":{},
"conditions":[
{
"displayName":"Log match condition",
"conditionMatchedLog":{
"filter":"log_id(\"cloudaudit.googleapis.com/system_event\")\nresource.labels.service = \"quotaadjuster.googleapis.com\"\nprotoPayload.serviceName=\"quotaadjuster.googleapis.com\"\nprotoPayload.metadata.quota_change_event.event_status=\"SUCCESS\"",
"filter":"log_id(\"cloudaudit.googleapis.com/system_event\")\nprotoPayload.methodName=\"google.cloud.quotaadjuster.v1main.QuotaAdjusterService.AutoAdjustQuota\"\nlabels.event_state=\"SUCCEEDED\"",
"labelExtractors": {
"current_quota_limit":"EXTRACT(protoPayload.metadata.quota_change_event.current_quota_limit)",
"new_quota_limit":"EXTRACT(protoPayload.metadata.quota_change_event.success_details.new_quota_limit)",
"limit_name":"EXTRACT(protoPayload.metadata.quota_change_event.limit_name)",
"location":"EXTRACT(protoPayload.metadata.quota_change_event.location)"
"current_quota_limit":"EXTRACT(protoPayload.metadata.currentQuotaLimit)",
"new_quota_limit":"EXTRACT(protoPayload.metadata.successDetails.newQuotaLimit)",
"limit_name":"EXTRACT(labels.limit)",
"service_name":"EXTRACT(labels.service)",
"location":"EXTRACT(labels.location)"
}
}
}
Expand Down
12 changes: 6 additions & 6 deletions alerts/google-quotas/metadata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,26 +25,26 @@ alert_policy_templates:
version: 1
-
id: all-adjustments-by-quota-adjuster
description: "Monitor Quota Adjuster activity and the notifications channel will alert you to all adjustments."
description: "Monitor quota adjuster activity and the notifications channel will alert you to all adjustments."
version: 1
-
id: quota-adjuster-errors-and-failures
description: "Monitor quota use across projects and the notifications channel will alert you to any failures or errors encountered while automatically adjusting your quotas."
description: "Monitor quota use across projects and the notifications channel will alert you to any failures or errors encountered while adjusting your quotas."
version: 1
-
id: qa-scoped-limit-location-all-adjustments
description: "Monitor Quota Adjuster activity and the notifications channel will alert you to all adjustments."
description: "Monitor quota adjuster activity and the notifications channel will alert you to all adjustments."
version: 1
-
id: qa-scoped-limit-location-all-failures
description: "Monitor quota use across projects and the notifications channel will alert you to any failures or errors encountered while automatically adjusting your quotas."
description: "Monitor quota usage across projects and the notifications channel will alert you to any failures or errors encountered while adjusting your quotas."
version: 1
-
id: qa-scoped-limit-all-adjustments
description: "Monitor Quota Adjuster activity and the notifications channel will alert you to all adjustments."
description: "Monitor quota adjuster activity and the notifications channel will alert you to all adjustments."
version: 1
-
id: qa-scoped-limit-all-failures
description: "Monitor quota use across projects and the notifications channel will alert you to any failures or errors encountered while automatically adjusting your quotas."
description: "Monitor quota use across projects and the notifications channel will alert you to any failures or errors encountered while adjusting your quotas."
version: 1

15 changes: 8 additions & 7 deletions alerts/google-quotas/qa-scoped-limit-all-adjustments.v1.json
Original file line number Diff line number Diff line change
@@ -1,20 +1,21 @@
{
"displayName": "Quota Adjuster < ${SERVICE_TITLE} - ${LIMIT_DISPLAY_NAME}",
"displayName": "All adjustments by quota adjuster < ${SERVICE_TITLE} - ${LIMIT_DISPLAY_NAME}",
"documentation": {
"content": "Log-based alerting detected a QuotaAdjuster change for limit ${log.extracted_label.limit_name} in location ${log.extracted_label.location}, which increased the quota from ${log.extracted_label.current_quota_limit} to ${log.extracted_label.new_quota_limit}.",
"content": "Log-based alerting detected a quota adjuster change for service ${log.extracted_label.service_name} quota ${LIMIT_DISPLAY_NAME} in location ${log.extracted_label.location}, which increased the limit from ${log.extracted_label.current_quota_limit} to ${log.extracted_label.new_quota_limit}.",
"mimeType": "text/markdown"
},
"userLabels": {},
"conditions": [
{
"displayName": "Log match condition",
"conditionMatchedLog": {
"filter": "log_id(\"cloudaudit2.googleapis.com/system_event\")\nresource.labels.service=\"quotaadjuster.googleapis.com\" \nprotoPayload.serviceName=\"quotaadjuster.googleapis.com\" \nprotoPayload.metadata.quota_change_event.event_status=\"SUCCESS\" \nprotoPayload.metadata.quota_change_event.limit_name=\"${LIMIT_NAME}\"\n",
"filter": "log_id(\"cloudaudit.googleapis.com/system_event\")\nprotoPayload.methodName=\"google.cloud.quotaadjuster.v1main.QuotaAdjusterService.AutoAdjustQuota\"\nlabels.event_state=\"SUCCEEDED\"\nlabels.service=\"${SERVICE_NAME}\"\nlabels.limit=\"${LIMIT_NAME}\"\n",
"labelExtractors": {
"current_quota_limit":"EXTRACT(protoPayload.metadata.quota_change_event.current_quota_limit)",
"new_quota_limit":"EXTRACT(protoPayload.metadata.quota_change_event.success_details.new_quota_limit)",
"limit_name":"EXTRACT(protoPayload.metadata.quota_change_event.limit_name)",
"location":"EXTRACT(protoPayload.metadata.quota_change_event.location)"
"current_quota_limit":"EXTRACT(protoPayload.metadata.currentQuotaLimit)",
"new_quota_limit":"EXTRACT(protoPayload.metadata.success_Details.newQuotaLimit)",
"limit_name":"EXTRACT(labels.limit)",
"service_name":"EXTRACT(labels.service)",
"location":"EXTRACT(labels.location)"
}
}
}
Expand Down
Loading