NVIDIA vGPU with A16 #10076

meisenst-dnd · 2024-12-10T12:48:51Z

meisenst-dnd
Dec 10, 2024

Hi all,

I'm trying to enable NVIDIA vGPU with an A16 card, which is essentially multiple A2 chips on a single board (and, therefore, driver-compatible with the A2). With SR-IOV enabled, you can use up to 68 individual profiles on the card at the same time.

CloudStack has support built-in for the A2, but it doesn't seem to recognize the A16 as many A2s. I've set the offering to be deployed with the A2-1B profile, however, the virtual devices list A16 profiles instead:

root@[redacted]:/sys/class/mdev_bus/000:ce:00.4/mdev_support_types/pci-709# cat name
NVIDIA A16-1B

The management log confirms that CloudStack doesn't see the appropriate card present in any of the hosts (there are 6 hosts, all with an A16 in them):

2024-12-10 12:39:27,184 DEBUG [c.c.a.m.a.i.FirstFitRoutingAllocator] (API-Job-Executor-1:[ctx-3d64c55e, job-882, ctx-e8fb6f86, FirstFitRoutingAllocator]) (logid:4e08e7e3) Adding host [{"name":"[Redacted]",uuid":"13ac1a87-d02a-4356-a4fa-584804de1849"}] to avoid set, because this host does not have required GPU devices available.

The devices are listed in lspci as such:

d4:02.3 3D controller [0302]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)

Has anyone successfully used an A16 with CloudStack? If not, can support for the A16 be added?

Thanks,
Mark

Answered by rajujith

Dec 12, 2024

@meisenst-dnd GPU support as a first-class feature is a backlog item for CloudStack. A talk was given at CCC in Madrid last month regarding some of the issues and a feature proposal. The recording will be available on the YouTube channel for Apache cloudStack soon.

In the meantime, you can try some workaround methods to enable GPU with CloudStack and KVM. I am sharing my notes that should help but I haven't tested these steps.

https://gist.github.com/rajujith/4cc3f17379b63e86f73b041f2be75528
https://gist.github.com/rajujith/f3b3854ed77f2cab8dc4fb5e3ee260c4

References:
https://lab.piszki.pl/cloudstack-kvm-and-running-vm-with-vgpu/
https://www.shapeblue.com/cloudstack-feature-first-look-ena…

View full answer

rajujith · 2024-12-11T04:50:25Z

rajujith
Dec 11, 2024
Collaborator

@meisenst-dnd Although I haven't used this specific GPU card, I just want to give you a heads-up that the supported GPUs you see in the compute offerings work only with XenServer. I assume you are using XenServer. I do see 'NVIDIA RTX A2' in the list of supported GPUs. Can the hypervisor report A16 as 2 x A2 to CloudStack? Can you check the table below to see if the GPU is discovered?

select * from host_gpu_groups

Can you run a force reconnect on the XenServer host from CloudStack and check the management server logs to see if the GPUs are being discovered? You can find them in the log with a line matching 'Startup request from directly connected host'.

3 replies

meisenst-dnd Dec 11, 2024
Author

@meisenst-dnd Although I haven't used this specific GPU card, I just want to give you a heads-up that the supported GPUs you see in the compute offerings work only with XenServer. I assume you are using XenServer. I do see 'NVIDIA RTX A2' in the list of supported GPUs. Can the hypervisor report A16 as 2 x A2 to CloudStack? Can you check the table below to see if the GPU is discovered?

They only work with XenServer? This isn't mentioned anywhere, and is going to impact what I'm trying to do in a huge way (we are running KVM).

In fact, why is this option even showing up in a zone that is exclusively set up for KVM in the first place, if this is the case? Seems counterproductive, no?

In any case -- if the OS is reporting A16 (which it is), CloudStack would have to inherently understand that this is, in fact, the same as an A2. From a driver perspective, a virtual interface of an A16 is the same thing as an A2. This is something that would have to be changed in code.

If this is, indeed, a XenServer-only function, I will have to find another way to do vGPU, so this may all very well be a moot point.

In any case, there is nothing in host_gpu_groups, which I would expect if this is a XenServer-only function, unless this is supposed to be queried throug libvirt? Is there something I missed, or should these be detected if they are available through vfio (assuming that we can detect anything that isn't specifically in the code, because A16 isn't mentioned at all as far as I know)?

zap51 Dec 12, 2024

They only work with XenServer? This isn't mentioned anywhere, and is going to impact what I'm trying to do in a huge way (we are running KVM).

In fact, why is this option even showing up in a zone that is exclusively set up for KVM in the first place, if this is the case? Seems counterproductive, no?

Maybe the docs need some improvements.

I've not tested with vGPUs, but take at look at this ML to see if that helps in your case.

Regards,
Jayanth

rajujith Dec 12, 2024
Collaborator

@meisenst-dnd GPU support as a first-class feature is a backlog item for CloudStack. A talk was given at CCC in Madrid last month regarding some of the issues and a feature proposal. The recording will be available on the YouTube channel for Apache cloudStack soon.

In the meantime, you can try some workaround methods to enable GPU with CloudStack and KVM. I am sharing my notes that should help but I haven't tested these steps.

https://gist.github.com/rajujith/4cc3f17379b63e86f73b041f2be75528
https://gist.github.com/rajujith/f3b3854ed77f2cab8dc4fb5e3ee260c4

References:
https://lab.piszki.pl/cloudstack-kvm-and-running-vm-with-vgpu/
https://www.shapeblue.com/cloudstack-feature-first-look-enable-sending-of-arbitrary-configuration-data-to-vms/
#3839

Answer selected by meisenst-dnd

btzq · 2024-12-12T08:01:43Z

btzq
Dec 12, 2024

Dont we need an enterprise license with NVIDIA to use their vGPU? 🤔

We managed to get GPU working via Passthrough with Ubuntu as the Hypervisor and NVIDIA L4s. But it is done manually by the Admin and not with Cloudstack. Its done outside out cloudstack.

Admin has to manually assign the GPU to the specified guest VM. After that, it works but then the GPU, VM and Host are technically married together. Cant shut down and restart it in another Hpervisor with the same GPU because of GPU serial number not being the same.

0 replies

meisenst-dnd · 2024-12-12T12:19:59Z

meisenst-dnd
Dec 12, 2024
Author

Yes, licenses are required. We have them as we previously used this functionality with VMware.

I missed the bit in the docs where XenServer is specifically mentioned. That's on me.

Thanks, folks. I will look for another way.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA vGPU with A16 #10076

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

NVIDIA vGPU with A16 #10076

meisenst-dnd Dec 10, 2024

Replies: 3 comments · 3 replies

rajujith Dec 11, 2024 Collaborator

meisenst-dnd Dec 11, 2024 Author

zap51 Dec 12, 2024

rajujith Dec 12, 2024 Collaborator

btzq Dec 12, 2024

meisenst-dnd Dec 12, 2024 Author

meisenst-dnd
Dec 10, 2024

Replies: 3 comments 3 replies

rajujith
Dec 11, 2024
Collaborator

meisenst-dnd Dec 11, 2024
Author

rajujith Dec 12, 2024
Collaborator

btzq
Dec 12, 2024

meisenst-dnd
Dec 12, 2024
Author