Skip to content

Commit 2667222

Browse files
authored
Merge branch 'main' into Java-RMI-Installation-upgrade
2 parents f61138a + 3d80446 commit 2667222

File tree

122 files changed

+14509
-1495
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

122 files changed

+14509
-1495
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
name: 5xx Server Errors
2+
3+
description: |+
4+
This alert is triggered if the customer faces 5xx server errors more than 5 times in 5 minutes.
5+
type: STATIC
6+
nrql:
7+
query: "SELECT count(*) as '5xx Server Errors' from Transaction WHERE httpResponseCode LIKE '5%'"
8+
9+
# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
10+
valueFunction: SINGLE_VALUE
11+
12+
# List of Critical and Warning thresholds for the condition
13+
terms:
14+
- priority: CRITICAL
15+
# Operator used to compare against the threshold.
16+
operator: ABOVE
17+
# Value that triggers a violation
18+
threshold: 10
19+
# Time in seconds; 120 - 3600
20+
thresholdDuration: 300
21+
# How many data points must be in violation for the duration
22+
thresholdOccurrences: ALL
23+
- priority: WARNING
24+
# Operator used to compare against the threshold.
25+
operator: ABOVE
26+
# Value that triggers a violation
27+
threshold: 5
28+
# Time in seconds; 120 - 3600
29+
thresholdDuration: 300
30+
# How many data points must be in violation for the duration
31+
thresholdOccurrences: ALL
32+
33+
# Duration after which a violation automatically closes
34+
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
35+
violationTimeLimitSeconds: 86400
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
name: CPU Usage (%)
2+
3+
description: |+
4+
This alert is triggered if CPU usage exceeds 90% for 5 minutes.
5+
type: STATIC
6+
nrql:
7+
query: "SELECT latest(host.cpuPercent) AS 'CPU Used %' FROM Metric"
8+
9+
# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
10+
valueFunction: SINGLE_VALUE
11+
12+
# List of Critical and Warning thresholds for the condition
13+
terms:
14+
- priority: CRITICAL
15+
# Operator used to compare against the threshold.
16+
operator: ABOVE
17+
# Value that triggers a violation
18+
threshold: 90
19+
# Time in seconds; 120 - 3600
20+
thresholdDuration: 300
21+
# How many data points must be in violation for the duration
22+
thresholdOccurrences: ALL
23+
- priority: WARNING
24+
# Operator used to compare against the threshold.
25+
operator: ABOVE
26+
# Value that triggers a violation
27+
threshold: 80
28+
# Time in seconds; 120 - 3600
29+
thresholdDuration: 300
30+
# How many data points must be in violation for the duration
31+
thresholdOccurrences: ALL
32+
33+
# Duration after which a violation automatically closes
34+
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
35+
violationTimeLimitSeconds: 86400
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
name: Downtime (%)
2+
3+
description: |+
4+
This alert is triggered if Downtime is more than 1% for 2 minutes.
5+
type: STATIC
6+
nrql:
7+
query: "SELECT percentage(count(result), where result = 'FAILED') as 'Downtime (%)' from SyntheticCheck"
8+
9+
# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
10+
valueFunction: SINGLE_VALUE
11+
12+
# List of Critical and Warning thresholds for the condition
13+
terms:
14+
- priority: CRITICAL
15+
# Operator used to compare against the threshold.
16+
operator: ABOVE
17+
# Value that triggers a violation
18+
threshold: 1
19+
# Time in seconds; 120 - 3600
20+
thresholdDuration: 120
21+
# How many data points must be in violation for the duration
22+
thresholdOccurrences: ALL
23+
- priority: WARNING
24+
# Operator used to compare against the threshold.
25+
operator: ABOVE
26+
# Value that triggers a violation
27+
threshold: 0.5
28+
# Time in seconds; 120 - 3600
29+
thresholdDuration: 120
30+
# How many data points must be in violation for the duration
31+
thresholdOccurrences: ALL
32+
33+
# Duration after which a violation automatically closes
34+
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
35+
violationTimeLimitSeconds: 86400
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
name: Memory Usage (%)
2+
3+
description: |+
4+
This alert is triggered if Memory usage exceeds 90% for 5 minutes.
5+
type: STATIC
6+
nrql:
7+
query: "SELECT latest(host.memoryUsedPercent) as 'Memory Used %' FROM Metric"
8+
9+
# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
10+
valueFunction: SINGLE_VALUE
11+
12+
# List of Critical and Warning thresholds for the condition
13+
terms:
14+
- priority: CRITICAL
15+
# Operator used to compare against the threshold.
16+
operator: ABOVE
17+
# Value that triggers a violation
18+
threshold: 90
19+
# Time in seconds; 120 - 3600
20+
thresholdDuration: 300
21+
# How many data points must be in violation for the duration
22+
thresholdOccurrences: ALL
23+
- priority: WARNING
24+
# Operator used to compare against the threshold.
25+
operator: ABOVE
26+
# Value that triggers a violation
27+
threshold: 80
28+
# Time in seconds; 120 - 3600
29+
thresholdDuration: 300
30+
# How many data points must be in violation for the duration
31+
thresholdOccurrences: ALL
32+
33+
# Duration after which a violation automatically closes
34+
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
35+
violationTimeLimitSeconds: 86400
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
name: High Capacity Utilization
2+
3+
description: |+
4+
This alert is triggered when the Capacity Utilization is above 90%.
5+
6+
type: STATIC
7+
nrql:
8+
query: "SELECT average(`aws.appstream.CapacityUtilization`) as 'Query' FROM Metric"
9+
10+
# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
11+
valueFunction: SINGLE_VALUE
12+
13+
# List of Critical and Warning thresholds for the condition
14+
terms:
15+
- priority: CRITICAL
16+
# Operator used to compare against the threshold.
17+
operator: ABOVE
18+
# Value that triggers a violation
19+
threshold: 90
20+
# Time in seconds; 120 - 3600
21+
thresholdDuration: 300
22+
# How many data points must be in violation for the duration
23+
thresholdOccurrences: ALL
24+
25+
# Duration after which a violation automatically closes
26+
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
27+
violationTimeLimitSeconds: 86400
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
name: High Insufficient Capacity Errors
2+
3+
description: |+
4+
This alert is triggered when Insufficient Capacity Errors are above 10 in 10 minutes.
5+
6+
type: STATIC
7+
nrql:
8+
query: "SELECT count(`aws.appstream.InsufficientCapacityError`) as 'Query' FROM Metric"
9+
10+
# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
11+
valueFunction: SINGLE_VALUE
12+
13+
# List of Critical and Warning thresholds for the condition
14+
terms:
15+
- priority: CRITICAL
16+
# Operator used to compare against the threshold.
17+
operator: ABOVE
18+
# Value that triggers a violation
19+
threshold: 100
20+
# Time in seconds; 120 - 3600
21+
thresholdDuration: 600
22+
# How many data points must be in violation for the duration
23+
thresholdOccurrences: ALL
24+
25+
# Duration after which a violation automatically closes
26+
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
27+
violationTimeLimitSeconds: 86400
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
name: High Index Utilization
2+
3+
description: |+
4+
This alert is triggered when the Index Utilization is above 90%.
5+
6+
type: STATIC
7+
nrql:
8+
query: "SELECT average(`aws.cloudsearch.IndexUtilization`) as 'Query' FROM Metric"
9+
10+
# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
11+
valueFunction: SINGLE_VALUE
12+
13+
# List of Critical and Warning thresholds for the condition
14+
terms:
15+
- priority: CRITICAL
16+
# Operator used to compare against the threshold.
17+
operator: ABOVE
18+
# Value that triggers a violation
19+
threshold: 90
20+
# Time in seconds; 120 - 3600
21+
thresholdDuration: 300
22+
# How many data points must be in violation for the duration
23+
thresholdOccurrences: ALL
24+
25+
# Duration after which a violation automatically closes
26+
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
27+
violationTimeLimitSeconds: 86400
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
name: Model Deployment Failed
2+
3+
description: |+
4+
This alert is triggered if the number of Failure exceeds 20 within 10 minutes.
5+
type: STATIC
6+
nrql:
7+
query: "FROM Metric SELECT sum(azure.machinelearningservices.workspaces.ModelDeployFailed) AS 'ModelDeployFailed'"
8+
9+
# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
10+
valueFunction: SINGLE_VALUE
11+
12+
# List of Critical and Warning thresholds for the condition
13+
terms:
14+
- priority: CRITICAL
15+
# Operator used to compare against the threshold.
16+
operator: ABOVE
17+
# Value that triggers a violation
18+
threshold: 20
19+
# Time in seconds; 120 - 3600
20+
thresholdDuration: 600
21+
# How many data points must be in violation for the duration
22+
thresholdOccurrences: ALL
23+
24+
# Adding a Warning threshold is optional
25+
- priority: WARNING
26+
operator: ABOVE
27+
threshold: 10
28+
thresholdDuration: 600
29+
thresholdOccurrences: ALL
30+
31+
# Duration after which a violation automatically closes
32+
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
33+
violationTimeLimitSeconds: 86400

alert-policies/f5/f5-node-offline.yml

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
name: F5 Node Offline
2+
description: |+
3+
This alert fires when an F5 Node has an availability state = 'offline' for at least 10 minutes.
4+
type: STATIC
5+
nrql:
6+
query: "FROM F5BigIpNodeSample SELECT count(*) FACET reportingEndpoint, displayName WHERE node.availabilityState = 0"
7+
valueFunction: SINGLE_VALUE
8+
terms:
9+
- priority: CRITICAL
10+
operator: ABOVE
11+
threshold: 0
12+
thresholdDuration: 600
13+
thresholdOccurrences: ALL
14+
signal:
15+
aggregationDelay: 120
16+
aggregationMethod: EVENT_FLOW
17+
aggregationWindow: 60
18+
19+
violationTimeLimitSeconds: 259200
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
name: F5 Pool Member Offline
2+
description: |+
3+
This alert fires when an F5 Pool Member has an availability state = 'offline' for at least 10 minutes.
4+
type: STATIC
5+
nrql:
6+
query: "FROM F5BigIpPoolMemberSample SELECT count(*) FACET aparse(url, '%//*'), poolName, displayName WHERE member.availabilityState = 0"
7+
valueFunction: SINGLE_VALUE
8+
terms:
9+
- priority: CRITICAL
10+
operator: ABOVE
11+
threshold: 0
12+
thresholdDuration: 600
13+
thresholdOccurrences: ALL
14+
signal:
15+
aggregationDelay: 120
16+
aggregationMethod: EVENT_FLOW
17+
aggregationWindow: 60
18+
19+
violationTimeLimitSeconds: 259200

alert-policies/f5/f5-pool-offline.yml

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
name: F5 Pool Offline
2+
description: |+
3+
This alert fires when an F5 Pool has an availability state = 'offline' for at least 10 minutes.
4+
type: STATIC
5+
nrql:
6+
query: "FROM F5BigIpPoolSample SELECT count(*) FACET reportingEndpoint, displayName WHERE pool.availabilityState = 0"
7+
valueFunction: SINGLE_VALUE
8+
terms:
9+
- priority: CRITICAL
10+
operator: ABOVE
11+
threshold: 0
12+
thresholdDuration: 600
13+
thresholdOccurrences: ALL
14+
signal:
15+
aggregationDelay: 120
16+
aggregationMethod: EVENT_FLOW
17+
aggregationWindow: 60
18+
19+
violationTimeLimitSeconds: 259200
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
name: F5 Virtual Server Offline
2+
description: |+
3+
This alert fires when an F5 Virtual Server has an availability state = 'offline' for at least 10 minutes.
4+
type: STATIC
5+
nrql:
6+
query: "FROM F5BigIpVirtualServerSample SELECT count(*) FACET reportingEndpoint, displayName WHERE virtualserver.availabilityState = 0"
7+
valueFunction: SINGLE_VALUE
8+
terms:
9+
- priority: CRITICAL
10+
operator: ABOVE
11+
threshold: 0
12+
thresholdDuration: 600
13+
thresholdOccurrences: ALL
14+
signal:
15+
aggregationDelay: 120
16+
aggregationMethod: EVENT_FLOW
17+
aggregationWindow: 60
18+
19+
violationTimeLimitSeconds: 259200
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Name of the alert
2+
name: Image Status
3+
4+
# Description and details
5+
description: |+
6+
This alert is triggered when the image status is inactive for 5 minutes.
7+
# Type of alert
8+
type: STATIC
9+
10+
# NRQL query
11+
nrql:
12+
13+
query: "SELECT count(*)FROM OSImageSample where openstack.glance.image.status != 'active'"
14+
15+
# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
16+
valueFunction: SINGLE_VALUE
17+
18+
# List of Critical and Warning thresholds for the condition
19+
terms:
20+
- priority: CRITICAL
21+
# Operator used to compare against the threshold.
22+
operator: ABOVE
23+
# Value that triggers a violation
24+
threshold: 1
25+
# Time in seconds; 120 - 3600
26+
thresholdDuration: 300
27+
# How many data points must be in violation for the duration
28+
thresholdOccurrences: ALL
29+
30+
# Duration after which a violation automatically closes
31+
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
32+
violationTimeLimitSeconds: 86400

0 commit comments

Comments
 (0)