Why monitor NVIDIA DCGM?
monitoring NVIDIA DCGM is essential for maintaining the health and efficiency of your GPU infrastructure in a data center. It helps with performance optimization, fault detection, resource management, energy efficiency, and overall data center health, while also aiding in troubleshooting, security, and compliance.
Comprehensive monitoring quickstart for NVIDIA DCGM
New Relic comprehensive monitoring of your GPU infrastructure in your data center. This setup will allow you to monitor GPU performance and health while leveraging the capabilities of New Relic for data visualization, alerting, and analysis.
What’s included in this quickstart?
New Relic NVIDIA DCGM monitoring quickstart provides quality out-of-the-box reporting:
- Dashboards (power usage, GPU utilisation, clocks, etc)
- Alerts for NVIDIA DCGM (GPU temperature, Xid error)