NVIDIA vGPU Monitoring using vROPs

NVIDIA virtual GPU (vGPU) software enables every virtual machine to get the benefits of a GPU, just like a physical desktop. Because work that was typically performed by the CPU has been offloaded to the GPU, the user has a much better experience and more users can be supported.

NVIDIA vGPU is the only GPU virtualization solution that provides end-to-end management and monitoring to deliver real-time insight into GPU performance. And to monitor this, we have a management pack the NVIDIA Virtual GPU Management Pack for vRealize Operations.

NVIDIA Virtual GPU Management Pack for VMware vRealize Operations enables you to use a VMware vRealize Operations cluster to monitor the performance of NVIDIA physical GPUs and virtual GPUs.

The Management Pack provides physical and virtual views with powerful insights across your entire GPU-enabled virtual infrastructure. The NVIDIA management pack is a standalone software that utilizes VMware vROps framework for data collection and data visualization.

In this Post, we will have a high level look into how to configure and use the Management pack along with the metrics and its usage in production environment.

To use the management pack, First install the management pack into vROPs. The installation is just like installation of any other management pack.

Once installed, Add an account for NVIDIA Adapter and enter all the required parameters for configuring the adapter.

You just need to integrate it with the vCenter where you have the GPU enabled clusters.

Note: To collect data from hosts in VMware vCenter that are running NVIDIA GPUs and the NVIDIA vGPU Manager, the account used for integration of the NVIDIA vGPU adapter requires the CIM interaction privilege.

Once Configured, wait for the status to turn Green. At this point, the GPU metrics will start flowing in.

Once the data starts flowing in and you have GPU enabled clusters and VMs with vGPUs assigned and powered on. You will see the following 4 object types created.

The GPU Object type will appear when you have any GPU enabled clusters and the vGPU object types will appear when you have GPU using VMs.

The GPU object types created have a direct association with the parent vCenter adapter resources, however you wont find any metrics for GPU at the host, VM or Cluster level.

You will have to write a supermetric to fetch the value from the child GPU object and then assign it to the parent vCenter object resource.

Also if you are using an allocation model and want to check the number of vGPU assigned / remaining to the VMs in a GPU enabled cluster, the management pack has an issue where it is unable to get the vGPU assigned to powered off VMs. I have already raised this issue with NVIDIA and hope it is resolved in further releases of the management pack.

For now, we can use a workaround by creating custom groups and assigning custom properties to vGPU enabled VMs.

The management pack also has 5 OOB dashboards for NVIDIA: