What's included?
dashboards
1
NVIDIA DCGM quickstart contains 1 dashboard. These interactive visualizations let you easily explore your data, understand context, and resolve problems faster.
NVIDIA-DCGM
alerts
2
NVIDIA DCGM observability quickstart contains 2 alerts. These alerts detect changes in key performance metrics. Integrate these alerts with your favorite tools (like Slack, PagerDuty, etc.) and New Relic will let you know when something needs your attention.
High GPU Temperature
TThis alert is triggered when the NVIDIA GPU Temperature is above 90%.
XID Error
This alert is triggered when the error is higher than 3 for 5 minutes.
documentation
1
NVIDIA DCGM observability quickstart contains 1 documentation reference. This is how you'll get your data into New Relic.
Why monitor NVIDIA DCGM?
monitoring NVIDIA DCGM is essential for maintaining the health and efficiency of your GPU infrastructure in a data center. It helps with performance optimization, fault detection, resource management, energy efficiency, and overall data center health, while also aiding in troubleshooting, security, and compliance.
Comprehensive monitoring quickstart for NVIDIA DCGM
New Relic comprehensive monitoring of your GPU infrastructure in your data center. This setup will allow you to monitor GPU performance and health while leveraging the capabilities of New Relic for data visualization, alerting, and analysis.
What’s included in this quickstart?
New Relic NVIDIA DCGM monitoring quickstart provides quality out-of-the-box reporting:
- Dashboards (power usage, GPU utilisation, clocks, etc)
- Alerts for NVIDIA DCGM (GPU temperature, Xid error)