Nvidia vGPU Passthrough from ESXi to a Citrix CVAD EUC Environment

by Ray Davis, CTA

Hello, I wanted to share my experience setting Nvidia vGPU passthrough from ESXi to a Citrix CVAD ECU environment. I will say this was my first experience with this, and we had help from Patrick Coble, Rody Kossen, and my work colleague and Citrix Engineer Mike Niegos. We worked on getting a three-node cluster ready for vGPU for 7 users with GPU intensive applications that pertain to the Data Science team and Cuda Cores. This guide goes over the main steps of this deployment process with ESX, NVIDIA, and Citrix CVAD 1912 CU1. This is a collection of screenshots and commands that are used to prepare the ESX hosts and the settings and steps needed for the vGPU to work with the following below:

New – V100S
5120 CUDA Cores
640 Tensor Cores
32GB RAM

Current – 2080TI
4352 CUDA Cores
11GB RAM

Prerequisites 

1. Ensure the ESX Host has a GPU. (Hardware-> Graphics)

2. A login to the NVIDIA Portal. (Should have an email to the POC of the purchase and other accounts.)

3. NVIDIA License Entitlement

  1. GRID License (Many types and levels based on the amount of GPU each user requires, and this is by CCU\VM based on your allocation type)
  2. SUMS (Support, Upgrade, Maintenance) This is what entitles you to be able to get support along with updates to their software.
  3. https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/solutions/resources/documents1/Virtual-GPU-Packaging-and-Licensing-Guide.pdf
  4. https://docs.nvidia.com/grid/11.0/grid-vgpu-user-guide/index.html#vgpu-types-tesla-v100-pcie-32gb (Deployed GPU)


4. A VM for the Licensing Server Role

  1. 1-2vCPU and 2-4GB of RAM are recommended starting out with less than 50-100 machines.
  2. Java Installed OpenJDK or Java SE JRE
  3. https://docs.nvidia.com/grid/ls/latest/grid-license-server-user-guide/index.html


5. Download the Driver for the VM

  1. Download the Software Pack for GRID which will have a ZIP file with multiple contents.

6. Download the VIB for your specific version of ESX

  1. https://kb.vmware.com/s/article/2143832  (VM Build to Version Lookup)
  2. https://www.vmware.com/resources/compatibility/detail.php?deviceCategory=sptg&productid=43793
  3. https://nvid.nvidia.com

7. Host BIOS Settings

  1. Dell vXRail  https://infohub.delltechnologies.com/l/vdi-deployment-guide-vmware-horizon-7-for-vxrail-and-vsan-ready-nodes-1/installing-and-configuring-nvidia-gpu
  2. Boot the Host and Enter “F2” when prompted. You will have around 5-10 seconds before the OS starts loading.

c. Select “System BIOS”:

d. Select “Integrated Devices”:

e. Ensure SR-IOV Global Enable is Enabled (Should be by Default):

f. Change the Memory Mapped I/O Base to 12TB (56TB is Default):

g. Select Back and Finish to Reboot the System



8. Enable SSH and ESXi Shell

  1. Select each Item and hit Start.

b. Depending on your policies for your hosts, you will have to set this after each reboot.


NVIDIA License Server Install

1. Prerequisites

2. Install Java (Download Latest JRE or OpenJDK)

  1. Select “Install”

b. Install Progress

b. Set Java_Home Path

1. Open up the System Info using the About or Control panel, then select “Environment Variables.”

2. Enter “JAVA_HOME” as the variable name.

3. Navigate to the JRE folder.

4. NVIDIA Licensing Download

  1. Login to NVIDIA Portal and get to Downloads
  2. Select Download on any version.

c. Select the Latest License Server Version.

d. Copy the install to the License Server.

e. Ensure Java is Installed, or you will receive this error.

f. The error you will receive if you haven’t set JAVA_HOME

g. Installer Loading:

h. Select “Next”

i. Select “I accept the terms…” and Select “Next”

j. Select “I accept the terms…” and Select “Next”

k. Select “Next”

l. Select “I accept the terms…” and Select “Next”

m. Select “Install”

n. Install Progress

o. Install Progress

p. Install Progress

Configuring the NVIDIA License Server Configuration

1.  We recommend creating a shortcut in the Public Start Menu so that there is a link to this management console. There is no shortcut by default, and it can be hard for others to identify the server role.

2.  http://localhost:8080/licserver 

3. Check MAC Address of the Server (Will be needed in the NVIDIA Portal for license allocation)

4.  Enter the Hostname (Case Sensitive) and enter a Description, then enter the MAC Address. If you have another license server, you will want to enter its Hostname and MAC Address too.

5.  Select the License Type and enter the Quantity, and Select “Add”

6.  Repeat the process above for each licensed type that is needed on this license server.

7.  Select “Create License Server”

8.  May see this if your login expired before allocation.

9.  Select the new License Server.

10.  Select the “Download” button.

11.  License File Downloaded

Configuring the NVIDIA License Server File Allocation

1.  Log into the License Server.

2.  Select “License Management” and Select “Browse,” then select “Upload”

3.  Licenses successfully applied.

Deployment Process – ESX Host

1.  Place the host in Maintenance Mode so that there are no VMs running, and you will be able to reboot it.

2.  Enable SSH and Shell on the Host

3.  WinSCP or Copy the VIB File to a Datastore

4.  Install VIB


a. esxcli software vib install -v /vmfs/volumes/5ef5d45f-8d3669c0-32dd-e4434be284e0/NVIDIA-VMware-450.80-1OEM.670.0.0.8169922.x86_64.vib


If you make a mistake, run the CMD to remove the VIP

esxcli software vib remove --vibname=NVIDIA-kepler-VMware_ESXi_6.0_Host_Drive

https://kb.vmware.com/s/article/2033434 (NVIDIA VIB Install Overview)

https://kb.vmware.com/s/article/2075500 (VIB Install Namespace Errors)


b. Validate the VIB Install

vmkload_mod -l | grep nvidia 

https://docs.vmware.com/en/VMware-Horizon-7/7.1/com.vmware.horizon-view.linuxdesktops.doc/GUID-AA333E98-0AA4-419B-8676-8B2C6F89CAF7.html (VIB Install)

https://www.dell.com/support/kbdoc/en-us/541606 Problem Installing

https://kb.vmware.com/s/article/2075500 (VIB Install Namespace Error)

c. Ensure NVIDIA SMI launches.

d. Ensure the GPU is in VGPU Mode (Should be by default)

5. Disable ECC on the GPUs Memory ( determine if this is needed by the Host/Vendor)

a. https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html

b. https://infohub.delltechnologies.com/l/vdi-deployment-guide-vmware-horizon-7-for-vxrail-and-vsan-ready-nodes-1/installing-and-configuring-nvidia-gpu 

Nvidia-smi -e 0

Nvidia-smi -q | grep -I ECC

6. Reboot the Host.

a. Upon rebooting the host, you should see the nvidia-init service starting upon the reboot.

b. Post Reboot verification that the vGPU ECC is disabled.

c. Nvidia-smi -q | grep -I ECC

Deployment Process ESX Host – ESX PCI Settings

1.  Ensure there are no Passthrough devices under Hardware -> PCI Devices

2.  You can Select “All PCI Devices” and see that the Card Shows and see that the card shows, but without the VIB, you will not see the model number or Status, only the PCI Address.

3.  Check the Device Under Hardware -> Graphic Devices and see that the card shows, but without the VIB, you will not see the Name.

4.  Change to Shared Direct (Default is Shared) Known Issue with VxRail

5.  https://www.dell.com/support/kbdoc/en-us/529445

6.  Post VIB Install

Deployment Process – vCenter Cluster

1.  Starting with release 6.7 U1, vMotion with vGPU and suspend and resume with vGPU are supported on suitable GPUs as listed in Supported NVIDIA GPUs and Validated Server Platforms.

2.  With the GPUs being allocated to a VM, it is important to adjust the DRS settings to ensure that the VM is not vMotioned while it is up, as it can impact the user’s GPU performance. (Again, determine if this is supported from the link above.)

3.  Setting DRS to Partially Automated so that DRS decisions are only made on the first initial VM Power On and will stay the same until it is manually moved.  This means we should monitor these hosts to ensure the vGPU workloads are balanced based on GPU limitations and usage.

4.  Select the Cluster and Right-Click and Select “Properties”

5.  Select “DRS Automation” and Select the Edit Icon on the right side.

Deployment Process VM – vGPU Allocation

1.  https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.vsphere.vm_admin.doc/GUID-C597DC2A-FE28-4243-8F40-9F8061C7A663.html (Overview vGPU Assignment Guide)

2.  Right-click the VM and select “Edit Settings.”

3.  Select “Add New Device” and Select “Shared PCI Device”

4.  NVIDIA GRID vGPU should be selected by default as a sharable PCI device.

4.  Q = Quadro, CAD/HPC/AI workloads, 
C= compute (no graphics) for AI etc.,
 A = vApps for SBC

5.  Select your vGPU Profile

6.  No need to edit the video card settings once the GPU has been set as a Shared GPU.

Deployment Process VM – NVIDIA Driver

1.  Install NVIDIA Driver first on the target machine (Citrix VDA will install HDX3DPro if it detects a GPU after this install and the VDA is deployed).

a. Ensure you can RDP or VNC to the machine because once this driver is installed, you will not be able to use the virtualization console anymore.

b. Make sure it is the GRID driver for the specified OS and Machine type too.

c. Extraction Progress

d.  System Check Progress

e.  Select “Agree and Continue”

f.  Select “Next”

g.  Install Progress

h.  Select “Restart Now” after verifying all three roles successfully installed.

h.  Once it is rebooted now, you will not be able to see the VM from the console, so RDP or connect to it using your broker.

i.  Below you can see the NVIDIA card can be seen now along with the actual profile that is assigned to the VM.

j.  It is recommended to have HDX Monitor and or RD Analyzer to help track GPU Usage, Policy Application, and overall session performance to ensure the GPU is meeting the needs based on the assigned profile.

1. https://cis.citrix.com/hdx/download/ 

2. https://rdanalyzer.com/downloads/ 

k. Another way is to use a simple 3D program to ensure GPU is also working.

1. https://www.lego.com/en-us/ldd 

Lego Files https://www.mybricks4u.com/brickmodels.html 

2. https://www.freecadweb.org

3. https://librecad.org 

CAD Files https://grabcad.com/library 

l. Virtual GPU Types Reference

1. https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#virtual-gpu-types-grid-reference

2. This breaks down what each frame buffer is for each selection in the guess OS for the share vGPU profile.

m. Now I will assign a different GPU size.

n. You can see my stats here.

o. You will see the total memory showing 4095 MB (I was told this is a bug and to check the detailed information in the NVidia Control panel).

p. This is the part where we just did some basic testing.

q. Citrix VDA Install (will be updated once I go through this), but this information is based on the collaboration from the help in the setup.

r. Then install the VDA, and the HDX 3D pro will be installed automatically since 1909, I think.

1. It will see the vGPU and know how to use it, as we stated above.

2. Below are some links around the HDX 3D Pro:

3. HDX 3D deployment considerations: https://support.citrix.com/article/CTX131385


s. VDA versions 7.16 and newer

1. VDA will detect the presence of supported GPU drivers automatically at runtime and leverage GPU for graphics rendering and acceleration if available. 

2. Citrix optimizations can be enabled via Citrix policy “Optimize for 3D graphics workload.”

3. (Note: There is no special installation mode for HDX 3D Pro.)

4. Refer to product documentation for additional information: GPU acceleration for Windows single-session OS/ Multi-session OS https://docs.citrix.com/en-us/citrix-virtual-apps-desktops/1912-ltsr/graphics/hdx-3d-pro/gpu-acceleration-server.html


t. This is a great quick matrix from Rody Kossen & Patrick Coble:

Another Reference to this:
Q = Quadro, CAD/HPC/AI workloads
C= compute (no graphics) for all etc
A = vApps for SBC workloads like XenApp


u. Citrix Policies

1. Policy for office workers:

2. For actively changing regions

3. Quality on high

4. Enable hardware encoding

v. CAD:

1. For the entire screen

2. optimize for 3D workloads enabled

3. Quality: build to Lossless

4. Allow Visually lossless: Enabled

5. Hardware encoding enabled


Another way is to use a simple 3D program to ensure GPU is also working.

a. https://www.lego.com/en-us/ldd 

b. https://www.onshape.com/en/products/free 

c. CAD Files https://grabcad.com/library


Deployment Process VM – Citrix Policies

1.  These are some recommended policies to be created or added to these profiled users.

2.  Policy for office workers: 

a. For actively changing regions

b. Quality on high

c. Enable hardware encoding

3. CAD Users

a. For the entire screen

b. Optimize for 3D workloads enabled

c. Quality: build to Lossless

d. Allow Visually lossless: Enabled

e. Hardware encoding enabled


Supporting Links

GPU Licensing Modes
https://docs.nvidia.com/grid/11.0/grid-vgpu-user-guide/index.html#vgpu-types-tesla-v100-pcie-32gb


Guide on how the GPU Scheduler will share CUDA Cores between sessions.
https://docs.nvidia.com/grid/11.0/grid-vgpu-user-guide/index.html#changing-vgpu-scheduling-policy


Overview Blog for ESX and vXRail
https://vpickle.wordpress.com/2018/08/29/nvidia-gpus-on-vxrail-vsphere-6-5/
https://infohub.delltechnologies.com/l/vdi-deployment-guide-vmware-horizon-7-for-vxrail-and-vsan-ready-nodes-1/installing-and-configuring-nvidia-gpu
https://www.dell.com/support/kbdoc/en-us/529445


How to Install NVIDIA License Server
https://docs.nvidia.com/grid/ls/latest/grid-license-server-user-guide/index.html#installing-nvidia-grid-license-server-windows
Requires Java

1. Port 7070 is open to enable remote clients to access licenses from the server.

2. Port 8080 is closed to ensure that the management interface is available only through a web browser running locally on the license server host.

Troubleshooting Commands

GPU Shown as having no GPU Memory
https://docs.nvidia.com/grid/latest/grid-vgpu-release-notes-vmware-vsphere/index.html#bug-2644794-vmware-vcenter-shows-no-gpu-memory


1. Stop all running VM instances on the host.

2. Stop the Xorg service.

[root@esxi:~] /etc/init.d/xorg stop


3. Stop nv-hostengine.

[root@esxi:~] nv-hostengine -t


4. Wait for 1 second to allow nv-hostengine to stop.

5. Start nv-hostengine.

[root@esxi:~] nv-hostengine -d


6. Start the Xorg service.

[root@esxi:~] /etc/init.d/xorg start

vGPU ESX 6.7 Overview
https://gridforums.nvidia.com/default/topic/9941/nvidia-virtual-gpu-drivers/esxi-6-7-tesla-v100-430-27-not-working/


HDX3DPro Overview
https://docs.citrix.com/en-us/citrix-virtual-apps-desktops/1912-ltsr/graphics/hdx-3d-pro.html


Disabling ECC Memory Guide
https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#disabling-ecc-memory


Disabling ECC Memory Discussion
https://gridforums.nvidia.com/default/topic/9941/nvidia-virtual-gpu-drivers/esxi-6-7-tesla-v100-430-27-not-working/


Good Post on XORG failing to start and where to find some of those log errors after installing the VIB (dmesg | grep NVIDIA     (This shows the logs of log files)
https://gridforums.nvidia.com/default/topic/9941/nvidia-virtual-gpu-drivers/esxi-6-7-tesla-v100-430-27-not-working/


Current Graphic Cards (V100S)
https://images.nvidia.com/content/technologies/volta/pdf/volta-v100-datasheet-update-us-1165301-r5.pdf


Existing User Graphics Cards
https://www.nvidia.com/en-us/geforce/graphics-cards/rtx-2080-ti/


Cuda
https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html

Check TCC mode  ( still need to figure this out). I might now apply.

Why you need this.
https://docs.nvidia.com/gameworks/content/developertools/desktop/tesla_compute_cluster.htm

“The TCC (Tesla Compute Cluster) driver is a Windows driver that supports CUDA C/C++ applications. The driver enables remote desktop services and reduces the CUDA kernel launch overhead on Windows. Note that the TCC driver disables graphics on the Tesla products.”

https://subscription.packtpub.com/book/programming/9781788996242/app01/app01sec02/wddm-tcc-mode-in-windows

nvidia-smi -g {GPU_ID} -dm {0|1}
We can identify that TCC mode is enabled by using nvidia-smi. The following screenshot shows GPU operation mode in TCC: By looking at the right-hand side of the GPU name in the first column, we can confirm that TCC mode is enabled.

Testing the CUDA

Go here in CLI

1. To create the deviceQuery.exe file:

2. Go to the (default) directory C:\ProgramData\NVIDIA Corporation\CUDA Samples\v11.1\Utilities\deviceQuery. 

3. After running the device query

4. Visual Studio will open, then you need to build (Compile the code)

5. Then check this location and browse to this location with CLI

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v11.1\bin\win64\debug

6. Run the devicequery.exe file the code compiled

7. Now run the bandwidthtest.exe

Another test we did was to change the NVidia profile on the VM to 32q. We will repeat the test to see if we can get more Cuda Cores.

We were able to get more, but the Profile lowered the power of the vGPU. So, at this time, it really depended on what you need and your requirements.


Updating the Drivers on ESXi and Windows when updates are made available

1. Start by verifying if the latest driver supports our GPUs (V100) for the version of vSphere we are on (Currently 6.7) by going to the link below
https://docs.nvidia.com/grid/latest/grid-vgpu-release-notes-vmware-vsphere/index.html


2. If in the notes the driver is not validated and supported with our system, you can go to the link below to view other previous versions.  Start with the first version back and work backwards until you find a supported version. https://docs.nvidia.com/grid/


3. When you go to the NVIDIA Software downloads page, you will notice that you don’t see the previous versions. This is because it is on the “Featured” tab, click on the “Available” tab, and you will now see other versions.

4. Download the files and move them to “\\ServerName\citrix\Software Repository\MFA Software\DataScience\GPU Drivers

5. Once you have a driver which is verified and supported to be working with our environment, manually migrate VMs off of the first host or power them down. Next, on the host and go to Configure > System > Services and start the ESXi Shell and SSH Services.

6. Once there is no power on the VMs, put the host into maintenance mode.

esxcli system maintenance mode set –enable true

7. With the host in maintenance mode open WinSCP and copy the new VIB file to the storage locations shown below for each host. This is the location I used, however you should be able to use any location.

8. In SecureCRT run the command to verify the nvidia module is loaded.
vmkload_mod -l | grep nvidia

9. Next, run the command below to find the name of the VIB you will use to remove it.
esxcli software vib list | grep NVIDIA

10. You can now run the command to uninstall the previous VIB
esxcli software vib remove --vibname=(Name from 9a)
esxcli software vib remove --vibname=NVIDIA-VMware_ESXi_6.7_Host_Driver

11. Now you can install the new VIB file.
esxcli software vib install -v (Storage location for the host from 7a)/(New VIB Filename)
esxcli software vib install -v /vmfs/volumes/5ef5d45f-8d3669c0-32dd-e4434be284e0/NVIDIA-VMware-418.165.01-1OEM.670.0.0.8169922.x86_64.vib

**If this step fails, please see the troubleshooting section below.

12. After the new VIB is installed, run the command from step 8a to verify the NVIDIA module is loaded.
vmkload_mod -l | grep nvidia

13. If the module is loaded, then run the command below and verify that the output shows the new driver version.
nvidia-smi

**If this step fails, please see the troubleshooting section below.

14. If the version shows as the updated version, reboot the host. When it comes back up, remove it from maintenance and move the VMs back to it and power them on.

15. RDP into the devices that have a GPU on the host you updated and then copy the NVIDIA driver you put into  “\\Server\citrix\Software Repository\MFA Software\DataScience\GPU Drivers” to the device and perform the “Express Installation” do not reset settings or remove profiles unless there is a specific reason to do so.


Troubleshooting and Known issues 
(Continued from 11 and 13 above)

PROBLEMS UNINSTALLING A VIB FILE

If, while uninstalling a VIB, you receive the message “Cannot remove module nvidia: module symbols in use,” you can try the steps below:

  1. With the host in maintenance mode, connect to it with SecureCRT and run the commands below.
    1. /etc/init.d/xorg stop
    2. vmkload_mod -u nvidia
  2. Now reboot the host and run vmkload_mod -l | grep nvidia to verify the module is not loaded.
  3. Try to uninstall  the VIB.
  4. If the VIB uninstalls, run esxcli software vib list | grep NVIDIA and verify the previous VIB you were uninstalling is no longer listed (this is not all in alphabetical order, check the whole list).
  5. If the VIB file is removed, you can now run the command to install the new VIB.
  6. Once the new VIB is installed, run esxcfg-module -e nvidia to re-enable the NVIDIA module.
  7. Reboot the host and then reconnect and run the nvidia-smi to verify the new VIB version is updated.


NVML ERRORS

If you received any error messages for NVML it is related to the VIB, and please check the things below:

  1. Verify the VIB / driver version is compatible with the version of vSphere we are on and that it has been validated and is supported for the V100 video cards.
  2. If the VIB / Driver does say it is validated and supported, try to download and extract the file again and replace the previous file and try again to make sure the first file wasn’t corrupted.
  3. If you continue to have issues, find the next VIB / Driver version below the one you are trying and see if that works. Continue until you have a working version. If no working version higher than the one you are replacing can be found, contact NVIDIA or Vxrail support.

Thank you, Patrick Coble and Rody Kossen for the quick help as well.

Leave a Reply