Skip to content

Commit

Permalink
NVIDIA DCGM - New AI Quickstart
Browse files Browse the repository at this point in the history
  • Loading branch information
RamanaReddy8801 committed Oct 31, 2023
1 parent f793b3e commit 15ecf0b
Show file tree
Hide file tree
Showing 8 changed files with 350 additions and 0 deletions.
27 changes: 27 additions & 0 deletions alert-policies/nvidia-dcgm/HighTemperature.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: High GPU Temperature

description: |+
This alert is triggered when the Nvidia GPU Temperature is above 90%.
type: STATIC
nrql:
query: "SELECT latest(DCGM_FI_DEV_GPU_TEMP) AS 'gpu temperature' FROM Metric WHERE metricName LIKE 'DCGM_FI_DEV_GPU_TEMP'"

# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
valueFunction: SINGLE_VALUE

# List of Critical and Warning thresholds for the condition
terms:
- priority: CRITICAL
# Operator used to compare against the threshold.
operator: ABOVE
# Value that triggers a violation
threshold: 90
# Time in seconds; 120 - 3600
thresholdDuration: 300
# How many data points must be in violation for the duration
thresholdOccurrences: ALL

# Duration after which a violation automatically closes
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
violationTimeLimitSeconds: 86400
27 changes: 27 additions & 0 deletions alert-policies/nvidia-dcgm/XidError.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: XID Error

description: |+
This alert is triggered when the error is higher than 3 for 5 minutes.
type: STATIC
nrql:
query: "SELECT latest(DCGM_FI_DEV_XID_ERRORS) AS 'errors' FROM Metric WHERE metricName like 'DCGM_FI_DEV_XID_ERRORS'"

# Function used to aggregate the NRQL query value(s) for comparison to the terms.threshold (Default: SINGLE_VALUE)
valueFunction: SINGLE_VALUE

# List of Critical and Warning thresholds for the condition
terms:
- priority: CRITICAL
# Operator used to compare against the threshold.
operator: ABOVE
# Value that triggers a violation
threshold: 3
# Time in seconds; 120 - 3600
thresholdDuration: 300
# How many data points must be in violation for the duration
thresholdOccurrences: ALL

# Duration after which a violation automatically closes
# Time in seconds; 300 - 2592000 (Default: 86400 [1 day])
violationTimeLimitSeconds: 86400
246 changes: 246 additions & 0 deletions dashboards/nvidia-dcgm/nvidia-dcgm.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,246 @@
{
"name": "NVIDIA",
"description": null,
"pages": [
{
"name": "Overview",
"description": null,
"widgets": [
{
"title": "",
"layout": {
"column": 1,
"row": 1,
"width": 2,
"height": 2
},
"linkedEntityGuids": null,
"visualization": {
"id": "viz.markdown"
},
"rawConfiguration": {
"text": "![NVIDIA DCGM](https://assets.nvidiagrid.net/ngc/logos/DCGM.png)"
}
},
{
"title": "GPU Temperature ",
"layout": {
"column": 3,
"row": 1,
"width": 4,
"height": 3
},
"linkedEntityGuids": null,
"visualization": {
"id": "viz.area"
},
"rawConfiguration": {
"facet": {
"showOtherSeries": false
},
"legend": {
"enabled": true
},
"nrqlQueries": [
{
"accountIds": [],
"query": "SELECT latest(DCGM_FI_DEV_GPU_TEMP ) AS 'gpu temperature' FROM Metric WHERE metricName LIKE 'DCGM_FI_DEV_GPU_TEMP' TIMESERIES "
}
],
"platformOptions": {
"ignoreTimeRange": false
},
"units": {
"unit": "CELSIUS"
}
}
},
{
"title": "Power usage(%)",
"layout": {
"column": 7,
"row": 1,
"width": 3,
"height": 3
},
"linkedEntityGuids": null,
"visualization": {
"id": "viz.billboard"
},
"rawConfiguration": {
"facet": {
"showOtherSeries": false
},
"nrqlQueries": [
{
"accountIds": [],
"query": "SELECT average(DCGM_FI_DEV_POWER_USAGE) AS 'usage' FROM Metric WHERE metricName LIKE 'DCGM_FI_DEV_POWER_USAGE' "
}
],
"platformOptions": {
"ignoreTimeRange": false
}
}
},
{
"title": "Total nvlink bandwidth",
"layout": {
"column": 10,
"row": 1,
"width": 3,
"height": 3
},
"linkedEntityGuids": null,
"visualization": {
"id": "viz.area"
},
"rawConfiguration": {
"facet": {
"showOtherSeries": false
},
"legend": {
"enabled": true
},
"nrqlQueries": [
{
"accountIds": [],
"query": "SELECT latest(DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL) AS 'nvlink bandwidth' FROM Metric WHERE metricName like 'DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL' TIMESERIES "
}
],
"platformOptions": {
"ignoreTimeRange": false
}
}
},
{
"title": "",
"layout": {
"column": 1,
"row": 3,
"width": 2,
"height": 2
},
"linkedEntityGuids": null,
"visualization": {
"id": "viz.markdown"
},
"rawConfiguration": {
"text": "**About**\n\nInstrument your application with New Relic - [Add Data](https://one.newrelic.com).\n\nInstrument NVIDIA DCGM with New Relic using the [documentation](https://docs.newrelic.com/).\n\n[Please rate this dashboard](https://docs.google.com/forms/d/e/1FAIpQLSclR38J8WbbB2J1tHnllKUkzWZkJhf4SrJGyavpMd4t82NjnQ/viewform?usp=pp_url&entry.1615922415=nvidia-dcgm) here and let us know how we can improve it for you."
}
},
{
"title": "Clocks(MHz)",
"layout": {
"column": 3,
"row": 4,
"width": 5,
"height": 3
},
"linkedEntityGuids": null,
"visualization": {
"id": "viz.area"
},
"rawConfiguration": {
"facet": {
"showOtherSeries": false
},
"legend": {
"enabled": true
},
"nrqlQueries": [
{
"accountIds": [],
"query": "SELECT latest(DCGM_FI_DEV_MEM_CLOCK) AS 'MEM Clock', latest(DCGM_FI_DEV_SM_CLOCK) AS 'SM Clock' FROM Metric TIMESERIES"
}
],
"platformOptions": {
"ignoreTimeRange": false
}
}
},
{
"title": "Framebuffer free (bytes)",
"layout": {
"column": 8,
"row": 4,
"width": 3,
"height": 3
},
"linkedEntityGuids": null,
"visualization": {
"id": "viz.billboard"
},
"rawConfiguration": {
"facet": {
"showOtherSeries": false
},
"nrqlQueries": [
{
"accountIds": [],
"query": "SELECT latest(DCGM_FI_DEV_FB_FREE) AS 'Free', latest(DCGM_FI_DEV_FB_USED) AS 'Used' FROM Metric"
}
],
"platformOptions": {
"ignoreTimeRange": false
}
}
},
{
"title": "XID errors",
"layout": {
"column": 11,
"row": 4,
"width": 2,
"height": 3
},
"linkedEntityGuids": null,
"visualization": {
"id": "viz.billboard"
},
"rawConfiguration": {
"facet": {
"showOtherSeries": false
},
"nrqlQueries": [
{
"accountIds": [],
"query": "SELECT latest(DCGM_FI_DEV_XID_ERRORS) AS 'errors' FROM Metric WHERE metricName like 'DCGM_FI_DEV_XID_ERRORS'"
}
],
"platformOptions": {
"ignoreTimeRange": false
}
}
},
{
"title": "GPU utilisation ",
"layout": {
"column": 1,
"row": 5,
"width": 2,
"height": 2
},
"linkedEntityGuids": null,
"visualization": {
"id": "viz.billboard"
},
"rawConfiguration": {
"facet": {
"showOtherSeries": false
},
"nrqlQueries": [
{
"accountIds": [],
"query": "SELECT average(DCGM_FI_DEV_GPU_UTIL) AS 'gpu utilisation' FROM Metric WHERE metricName LIKE 'DCGM_FI_DEV_GPU_UTIL'"
}
],
"platformOptions": {
"ignoreTimeRange": false
}
}
}
]
}
],
"variables": []
}
Binary file added dashboards/nvidia-dcgm/nvidia-dcgm01.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
13 changes: 13 additions & 0 deletions data-sources/nvidia-dcgm/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
id: nvidia-dcgm
displayName: NVIDIA DCGM
description: |
Monitor and analyze your NVIDIA DCGM infrastructure with New Relic.
install:
primary:
link:
url: https://docs.newrelic.com/docs/infrastructure/host-integrations/host-integrations-list/nvidia-dcgm-integration/
icon: logo.png
keywords:
- NVIDIA DCGM
- dcgm
- gpu
Binary file added data-sources/nvidia-dcgm/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
37 changes: 37 additions & 0 deletions quickstarts/nvidia-dcgm/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
slug: nvidia-dcgm
description: |
## Why monitor NVIDIA DCGM?
monitoring NVIDIA DCGM is essential for maintaining the health and efficiency of your GPU infrastructure in a data center. It helps with performance optimization, fault detection, resource management, energy efficiency, and overall data center health, while also aiding in troubleshooting, security, and compliance.
## Comprehensive monitoring quickstart for NVIDIA DCGM
New Relic comprehensive monitoring of your GPU infrastructure in your data center. This setup will allow you to monitor GPU performance and health while leveraging the capabilities of New Relic for data visualization, alerting, and analysis.
## What’s included in this quickstart?
New Relic NVIDIA DCGM monitoring quickstart provides quality out-of-the-box reporting:
- Dashboards (power usage, gpu utilisation, clocks, etc)
- Alerts for ZooKeeper (gpu temperature, xid error)
summary: |
Monitor and analyze your NVIDIA DCGM infrastructure with New Relic.
icon: logo.png
level: New Relic
authors:
- New Relic
- Ramana Reddy
title: NVIDIA DCGM
documentation:
- name: NVIDIA DCGM integration documentation
description: |
Monitor and instrument your NVIDIA DCGM with New Relic to gain deep insights into your performance.
url: https://docs.newrelic.com/docs/infrastructure/host-integrations/host-integrations-list/nvidia-dcgm-integration/
keywords:
- NVIDIA DCGM
- dcgm
- gpu
dataSourceIds:
- nvidia-dcgm
dashboards:
- nvidia-dcgm
alertPolicies:
- nvidia-dcgm
Binary file added quickstarts/nvidia-dcgm/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 15ecf0b

Please sign in to comment.