Skip to content

Commit 2b2395d

Browse files
authored
Merge branch 'main' into k8s-node-nrql
2 parents 64501cd + f1f422c commit 2b2395d

File tree

70 files changed

+9992
-152
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

70 files changed

+9992
-152
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
name: 5xx Server Errors
2+
3+
description: |+
4+
This alert is triggered if the customer faces 5xx server errors more than 5 times in 5 minutes.
5+
type: STATIC
6+
nrql:
7+
query: "SELECT count(*) as '5xx Server Errors' from Transaction WHERE httpResponseCode LIKE '5%'"
8+
9+
# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
10+
valueFunction: SINGLE_VALUE
11+
12+
# List of Critical and Warning thresholds for the condition
13+
terms:
14+
- priority: CRITICAL
15+
# Operator used to compare against the threshold.
16+
operator: ABOVE
17+
# Value that triggers a violation
18+
threshold: 10
19+
# Time in seconds; 120 - 3600
20+
thresholdDuration: 300
21+
# How many data points must be in violation for the duration
22+
thresholdOccurrences: ALL
23+
- priority: WARNING
24+
# Operator used to compare against the threshold.
25+
operator: ABOVE
26+
# Value that triggers a violation
27+
threshold: 5
28+
# Time in seconds; 120 - 3600
29+
thresholdDuration: 300
30+
# How many data points must be in violation for the duration
31+
thresholdOccurrences: ALL
32+
33+
# Duration after which a violation automatically closes
34+
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
35+
violationTimeLimitSeconds: 86400
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
name: CPU Usage (%)
2+
3+
description: |+
4+
This alert is triggered if CPU usage exceeds 90% for 5 minutes.
5+
type: STATIC
6+
nrql:
7+
query: "SELECT latest(host.cpuPercent) AS 'CPU Used %' FROM Metric"
8+
9+
# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
10+
valueFunction: SINGLE_VALUE
11+
12+
# List of Critical and Warning thresholds for the condition
13+
terms:
14+
- priority: CRITICAL
15+
# Operator used to compare against the threshold.
16+
operator: ABOVE
17+
# Value that triggers a violation
18+
threshold: 90
19+
# Time in seconds; 120 - 3600
20+
thresholdDuration: 300
21+
# How many data points must be in violation for the duration
22+
thresholdOccurrences: ALL
23+
- priority: WARNING
24+
# Operator used to compare against the threshold.
25+
operator: ABOVE
26+
# Value that triggers a violation
27+
threshold: 80
28+
# Time in seconds; 120 - 3600
29+
thresholdDuration: 300
30+
# How many data points must be in violation for the duration
31+
thresholdOccurrences: ALL
32+
33+
# Duration after which a violation automatically closes
34+
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
35+
violationTimeLimitSeconds: 86400
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
name: Downtime (%)
2+
3+
description: |+
4+
This alert is triggered if Downtime is more than 1% for 2 minutes.
5+
type: STATIC
6+
nrql:
7+
query: "SELECT percentage(count(result), where result = 'FAILED') as 'Downtime (%)' from SyntheticCheck"
8+
9+
# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
10+
valueFunction: SINGLE_VALUE
11+
12+
# List of Critical and Warning thresholds for the condition
13+
terms:
14+
- priority: CRITICAL
15+
# Operator used to compare against the threshold.
16+
operator: ABOVE
17+
# Value that triggers a violation
18+
threshold: 1
19+
# Time in seconds; 120 - 3600
20+
thresholdDuration: 120
21+
# How many data points must be in violation for the duration
22+
thresholdOccurrences: ALL
23+
- priority: WARNING
24+
# Operator used to compare against the threshold.
25+
operator: ABOVE
26+
# Value that triggers a violation
27+
threshold: 0.5
28+
# Time in seconds; 120 - 3600
29+
thresholdDuration: 120
30+
# How many data points must be in violation for the duration
31+
thresholdOccurrences: ALL
32+
33+
# Duration after which a violation automatically closes
34+
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
35+
violationTimeLimitSeconds: 86400
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
name: Memory Usage (%)
2+
3+
description: |+
4+
This alert is triggered if Memory usage exceeds 90% for 5 minutes.
5+
type: STATIC
6+
nrql:
7+
query: "SELECT latest(host.memoryUsedPercent) as 'Memory Used %' FROM Metric"
8+
9+
# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
10+
valueFunction: SINGLE_VALUE
11+
12+
# List of Critical and Warning thresholds for the condition
13+
terms:
14+
- priority: CRITICAL
15+
# Operator used to compare against the threshold.
16+
operator: ABOVE
17+
# Value that triggers a violation
18+
threshold: 90
19+
# Time in seconds; 120 - 3600
20+
thresholdDuration: 300
21+
# How many data points must be in violation for the duration
22+
thresholdOccurrences: ALL
23+
- priority: WARNING
24+
# Operator used to compare against the threshold.
25+
operator: ABOVE
26+
# Value that triggers a violation
27+
threshold: 80
28+
# Time in seconds; 120 - 3600
29+
thresholdDuration: 300
30+
# How many data points must be in violation for the duration
31+
thresholdOccurrences: ALL
32+
33+
# Duration after which a violation automatically closes
34+
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
35+
violationTimeLimitSeconds: 86400
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
name: High Index Utilization
2+
3+
description: |+
4+
This alert is triggered when the Index Utilization is above 90%.
5+
6+
type: STATIC
7+
nrql:
8+
query: "SELECT average(`aws.cloudsearch.IndexUtilization`) as 'Query' FROM Metric"
9+
10+
# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
11+
valueFunction: SINGLE_VALUE
12+
13+
# List of Critical and Warning thresholds for the condition
14+
terms:
15+
- priority: CRITICAL
16+
# Operator used to compare against the threshold.
17+
operator: ABOVE
18+
# Value that triggers a violation
19+
threshold: 90
20+
# Time in seconds; 120 - 3600
21+
thresholdDuration: 300
22+
# How many data points must be in violation for the duration
23+
thresholdOccurrences: ALL
24+
25+
# Duration after which a violation automatically closes
26+
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
27+
violationTimeLimitSeconds: 86400
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
name: Model Deployment Failed
2+
3+
description: |+
4+
This alert is triggered if the number of Failure exceeds 20 within 10 minutes.
5+
type: STATIC
6+
nrql:
7+
query: "FROM Metric SELECT sum(azure.machinelearningservices.workspaces.ModelDeployFailed) AS 'ModelDeployFailed'"
8+
9+
# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
10+
valueFunction: SINGLE_VALUE
11+
12+
# List of Critical and Warning thresholds for the condition
13+
terms:
14+
- priority: CRITICAL
15+
# Operator used to compare against the threshold.
16+
operator: ABOVE
17+
# Value that triggers a violation
18+
threshold: 20
19+
# Time in seconds; 120 - 3600
20+
thresholdDuration: 600
21+
# How many data points must be in violation for the duration
22+
thresholdOccurrences: ALL
23+
24+
# Adding a Warning threshold is optional
25+
- priority: WARNING
26+
operator: ABOVE
27+
threshold: 10
28+
thresholdDuration: 600
29+
thresholdOccurrences: ALL
30+
31+
# Duration after which a violation automatically closes
32+
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
33+
violationTimeLimitSeconds: 86400
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Name of the alert
2+
name: Image Status
3+
4+
# Description and details
5+
description: |+
6+
This alert is triggered when the image status is inactive for 5 minutes.
7+
# Type of alert
8+
type: STATIC
9+
10+
# NRQL query
11+
nrql:
12+
13+
query: "SELECT count(*)FROM OSImageSample where openstack.glance.image.status != 'active'"
14+
15+
# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
16+
valueFunction: SINGLE_VALUE
17+
18+
# List of Critical and Warning thresholds for the condition
19+
terms:
20+
- priority: CRITICAL
21+
# Operator used to compare against the threshold.
22+
operator: ABOVE
23+
# Value that triggers a violation
24+
threshold: 1
25+
# Time in seconds; 120 - 3600
26+
thresholdDuration: 300
27+
# How many data points must be in violation for the duration
28+
thresholdOccurrences: ALL
29+
30+
# Duration after which a violation automatically closes
31+
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
32+
violationTimeLimitSeconds: 86400
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Name of the alert
2+
name: Memory Usage Percent
3+
4+
# Description and details
5+
description: |+
6+
This alert is triggered when the memory usage exceeds 90% for 5 minutes
7+
# Type of alert
8+
type: STATIC
9+
10+
# NRQL query
11+
nrql:
12+
13+
query: "SELECT average(`memoryUsedBytes`/`memoryTotalBytes`*100) FROM SystemSample facet entityName"
14+
15+
# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
16+
valueFunction: SINGLE_VALUE
17+
18+
# List of Critical and Warning thresholds for the condition
19+
terms:
20+
- priority: CRITICAL
21+
# Operator used to compare against the threshold.
22+
operator: ABOVE
23+
# Value that triggers a violation
24+
threshold: 90
25+
# Time in seconds; 120 - 3600
26+
thresholdDuration: 300
27+
# How many data points must be in violation for the duration
28+
thresholdOccurrences: ALL
29+
- priority: WARNING
30+
# Operator used to compare against the threshold.
31+
operator: ABOVE
32+
# Value that triggers a violation
33+
threshold: 85
34+
# Time in seconds; 120 - 3600, must be a multiple of 60 for Baseline conditions
35+
thresholdDuration: 300
36+
# How many data points must be in violation for the duration
37+
thresholdOccurrences: ALL
38+
# Duration after which a violation automatically closes
39+
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
40+
violationTimeLimitSeconds: 86400
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Name of the alert
2+
name: Server Free Memory Low(%)
3+
4+
# Description and details
5+
description: |+
6+
This alert is triggered when server free memory is less than 10% for 5 minutes.
7+
# Type of alert
8+
type: STATIC
9+
10+
# NRQL query
11+
nrql:
12+
13+
query: "SELECT average(`memoryFreeBytes`/`memoryTotalBytes`*100) FROM SystemSample facet entityName"
14+
15+
# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
16+
valueFunction: SINGLE_VALUE
17+
18+
# List of Critical and Warning thresholds for the condition
19+
terms:
20+
- priority: CRITICAL
21+
# Operator used to compare against the threshold.
22+
operator: BELOW
23+
# Value that triggers a violation
24+
threshold: 10
25+
# Time in seconds; 120 - 3600
26+
thresholdDuration: 300
27+
# How many data points must be in violation for the duration
28+
thresholdOccurrences: ALL
29+
30+
# Duration after which a violation automatically closes
31+
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
32+
violationTimeLimitSeconds: 86400
+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
name: 4xx client errors
2+
3+
description: |+
4+
This alert is triggered if customer faces 4xx errors more than 5 times for 5 minutes.
5+
type: STATIC
6+
nrql:
7+
query: "SELECT count(*) from Transaction WHERE httpResponseCode LIKE '4%'"
8+
9+
# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
10+
valueFunction: SINGLE_VALUE
11+
12+
# List of Critical and Warning thresholds for the condition
13+
terms:
14+
- priority: CRITICAL
15+
# Operator used to compare against the threshold.
16+
operator: ABOVE
17+
# Value that triggers a violation
18+
threshold: 5
19+
# Time in seconds; 120 - 3600
20+
thresholdDuration: 300
21+
# How many data points must be in violation for the duration
22+
thresholdOccurrences: ALL
23+
- priority: WARNING
24+
# Operator used to compare against the threshold.
25+
operator: ABOVE
26+
# Value that triggers a violation
27+
threshold: 3
28+
# Time in seconds; 120 - 3600
29+
thresholdDuration: 300
30+
# How many data points must be in violation for the duration
31+
thresholdOccurrences: ALL
32+
33+
# Duration after which a violation automatically closes
34+
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
35+
violationTimeLimitSeconds: 86400

alert-policies/stripe/downtime.yml

+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
name: Downtime (%)
2+
3+
description: |+
4+
This alert is triggered if Downtime is more than 1% for 2 minutes.
5+
type: STATIC
6+
nrql:
7+
query: "SELECT percentage(count(result), where result = 'FAILED') as 'Downtime (%)' from SyntheticCheck"
8+
9+
# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
10+
valueFunction: SINGLE_VALUE
11+
12+
# List of Critical and Warning thresholds for the condition
13+
terms:
14+
- priority: CRITICAL
15+
# Operator used to compare against the threshold.
16+
operator: ABOVE
17+
# Value that triggers a violation
18+
threshold: 1
19+
# Time in seconds; 120 - 3600
20+
thresholdDuration: 120
21+
# How many data points must be in violation for the duration
22+
thresholdOccurrences: ALL
23+
- priority: WARNING
24+
# Operator used to compare against the threshold.
25+
operator: ABOVE
26+
# Value that triggers a violation
27+
threshold: 0.5
28+
# Time in seconds; 120 - 3600
29+
thresholdDuration: 120
30+
# How many data points must be in violation for the duration
31+
thresholdOccurrences: ALL
32+
33+
# Duration after which a violation automatically closes
34+
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
35+
violationTimeLimitSeconds: 86400

0 commit comments

Comments
 (0)