Assign GPUs to containers via /dev/dri/by-path/ instead #1578

Open
opened 2025-11-20 05:12:41 -05:00 by saavagebueno · 7 comments
Owner

Originally created by @danielwsky on GitHub (Jul 26, 2025).

Originally assigned to: @MickLesk on GitHub.

📌 Task summary

Refactor the way cards and renders are assigned to container (and VMs maybe) trough /dev/dri/by-path/ directory

📋 Task details

I've recently been encountering issues with container initialization due to the absence of "card0" in /dev/dri. The specific error I'm encountering is:

TASK ERROR: Device /dev/dri/card0 does not exist

The root cause of this issue seems to be Proxmox's inconsistent assignment of the GPU, alternating between card0 and card1. After some troubleshooting, I discovered that the /dev/dri/by-path/ directory, which references GPUs by their PCI address, provides a more reliable solution. By adjusting the container configuration to search for the hardware location instead of relying on the card index (card0 or card1), I managed to resolve the issue.

Image

Given this understanding, I propose that the Proxmox VM and container creation script be improved to support GPU identification by PCI address. This would enhance reliability in configuring GPU passthrough, especially in environments with multiple GPUs.

Additionally, it would be highly beneficial if the script allowed users to select which GPU they want to use during the creation of a new VM or container. This would provide extra flexibility and control, particularly on systems equipped with several graphics units.

Please consider these suggestions to enhance the robustness and functionality of the current script:

PCI Address Identification: Change the method of GPU identification to utilize the PCI address found in /dev/dri/by-path/.
User-Selectable GPU: Include an option in the script that allows users to specify which GPU should be used when creating a VM or container.

These improvements would not only resolve the current issue but also increase the flexibility and reliability of the VM and container creation process on Proxmox.

Originally created by @danielwsky on GitHub (Jul 26, 2025). Originally assigned to: @MickLesk on GitHub. ### 📌 Task summary Refactor the way cards and renders are assigned to container (and VMs maybe) trough /dev/dri/by-path/ directory ### 📋 Task details I've recently been encountering issues with container initialization due to the absence of "card0" in /dev/dri. The specific error I'm encountering is: TASK ERROR: Device /dev/dri/card0 does not exist The root cause of this issue seems to be Proxmox's inconsistent assignment of the GPU, alternating between card0 and card1. After some troubleshooting, I discovered that the /dev/dri/by-path/ directory, which references GPUs by their PCI address, provides a more reliable solution. By adjusting the container configuration to search for the hardware location instead of relying on the card index (card0 or card1), I managed to resolve the issue. <img width="432" height="40" alt="Image" src="https://github.com/user-attachments/assets/f5887d67-086b-430e-ad09-6b71b0f1bf75" /> Given this understanding, I propose that the Proxmox VM and container creation script be improved to support GPU identification by PCI address. This would enhance reliability in configuring GPU passthrough, especially in environments with multiple GPUs. Additionally, it would be highly beneficial if the script allowed users to select which GPU they want to use during the creation of a new VM or container. This would provide extra flexibility and control, particularly on systems equipped with several graphics units. Please consider these suggestions to enhance the robustness and functionality of the current script: PCI Address Identification: Change the method of GPU identification to utilize the PCI address found in /dev/dri/by-path/. User-Selectable GPU: Include an option in the script that allows users to specify which GPU should be used when creating a VM or container. These improvements would not only resolve the current issue but also increase the flexibility and reliability of the VM and container creation process on Proxmox.
Author
Owner

@webmogul1 commented on GitHub (Aug 10, 2025):

@danielwsky Are you using plex by any chance?
I can't get it to work if I use /dev/dri/by-path.

Edit: I guess I found the problem: https://github.com/tteck/Proxmox/discussions/3235#discussioncomment-9929434

@webmogul1 commented on GitHub (Aug 10, 2025): @danielwsky Are you using plex by any chance? I can't get it to work if I use /dev/dri/by-path. Edit: I guess I found the problem: https://github.com/tteck/Proxmox/discussions/3235#discussioncomment-9929434
Author
Owner

@danielwsky commented on GitHub (Aug 11, 2025):

@danielwsky Are you using plex by any chance? I can't get it to work if I use /dev/dri/by-path.

Edit: I guess I found the problem: tteck/Proxmox#3235 (comment)

I'm using Jellyfin and Ollama. Did you use the whole pci address? Eg /dev/dri/by-path/pci-0000:xx:00.0-card
You should be able to see your device pci address by cd /dev/dri/by-path/ then ls

If you have more than one gpu, I would suggest to check lspci | grep VGA or something like that

@danielwsky commented on GitHub (Aug 11, 2025): > [@danielwsky](https://github.com/danielwsky) Are you using plex by any chance? I can't get it to work if I use /dev/dri/by-path. > > Edit: I guess I found the problem: [tteck/Proxmox#3235 (comment)](https://github.com/tteck/Proxmox/discussions/3235#discussioncomment-9929434) I'm using Jellyfin and Ollama. Did you use the whole pci address? Eg /dev/dri/by-path/pci-0000:xx:00.0-card You should be able to see your device pci address by cd /dev/dri/by-path/ then ls If you have more than one gpu, I would suggest to check lspci | grep VGA or something like that
Author
Owner

@webmogul1 commented on GitHub (Aug 11, 2025):

Yes, I did use the whole pci address.

This works.
Image

┌─(root@PLEX - 10.0.1.80)
└───[/dev/dri] # l
total 0
crw-rw---- 1 root video  226,   0 Aug 11 12:36 card0
crw-rw---- 1 root render 226, 128 Aug 11 12:36 renderD128

This does not work
Image

┌─(root@PLEX - 10.0.1.80)
└───[/dev/dri] # l
total 0
drwxr-xr-x 2 root root 80 Aug 11 13:31 by-path

┌─(root@PLEX - 10.0.1.80)
└───[/dev/dri/by-path] # l
total 0
crw-rw---- 1 root video  226,   0 Aug 11 13:31 pci-0000:01:00.0-card
crw-rw---- 1 root render 226, 128 Aug 11 13:31 pci-0000:01:00.0-render
@webmogul1 commented on GitHub (Aug 11, 2025): Yes, I did use the whole pci address. This works. <img width="383" height="54" alt="Image" src="https://github.com/user-attachments/assets/77792781-439c-4fbc-878c-75bd07882518" /> ``` ┌─(root@PLEX - 10.0.1.80) └───[/dev/dri] # l total 0 crw-rw---- 1 root video 226, 0 Aug 11 12:36 card0 crw-rw---- 1 root render 226, 128 Aug 11 12:36 renderD128 ``` This does not work <img width="508" height="57" alt="Image" src="https://github.com/user-attachments/assets/24d245bd-beee-4609-93c3-248dbabfc308" /> ``` ┌─(root@PLEX - 10.0.1.80) └───[/dev/dri] # l total 0 drwxr-xr-x 2 root root 80 Aug 11 13:31 by-path ┌─(root@PLEX - 10.0.1.80) └───[/dev/dri/by-path] # l total 0 crw-rw---- 1 root video 226, 0 Aug 11 13:31 pci-0000:01:00.0-card crw-rw---- 1 root render 226, 128 Aug 11 13:31 pci-0000:01:00.0-render ```
Author
Owner

@MickLesk commented on GitHub (Aug 11, 2025):

Have the Same issue at some LXC from Dev Repo, thats the reason why i dont merge this Feature into live

@MickLesk commented on GitHub (Aug 11, 2025): Have the Same issue at some LXC from Dev Repo, thats the reason why i dont merge this Feature into live
Author
Owner

@danielwsky commented on GitHub (Aug 11, 2025):

Yes, I did use the whole pci address.

This works. Image

┌─(root@PLEX - 10.0.1.80)
└───[/dev/dri] # l
total 0
crw-rw---- 1 root video  226,   0 Aug 11 12:36 card0
crw-rw---- 1 root render 226, 128 Aug 11 12:36 renderD128

This does not work Image

┌─(root@PLEX - 10.0.1.80)
└───[/dev/dri] # l
total 0
drwxr-xr-x 2 root root 80 Aug 11 13:31 by-path

┌─(root@PLEX - 10.0.1.80)
└───[/dev/dri/by-path] # l
total 0
crw-rw---- 1 root video  226,   0 Aug 11 13:31 pci-0000:01:00.0-card
crw-rw---- 1 root render 226, 128 Aug 11 13:31 pci-0000:01:00.0-render

That's weird, I don't have two gpus, but I've installed a RTX 2060 before.
Here's my setup:
Machinist X99 E5-RS9
Intel Xeon E5-2683 v4
4x16GB RAM
Intel Arc A770 16GB

Image Image

I've enabled Resizeble Bar too.
Actually, I started having the card0<>card1 flip when installing the Arc GPU and enabling ReBar. And after two weeks of troubleshooting I've found this "by-path" workaround.
I guess installing new pci devices mess with pci addressess.

Can you show your lspci | grep VGA result?
I would try to install Plex in my machine but I have no storage left.

@MickLesk , could this have something to do with grouping and access stuff? (I'm literally shooting in the dark, but linux problems are usually about access configs)

@danielwsky commented on GitHub (Aug 11, 2025): > Yes, I did use the whole pci address. > > This works. <img alt="Image" width="383" height="54" src="https://private-user-images.githubusercontent.com/5225701/476692967-77792781-439c-4fbc-878c-75bd07882518.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTQ5NTUyMzgsIm5iZiI6MTc1NDk1NDkzOCwicGF0aCI6Ii81MjI1NzAxLzQ3NjY5Mjk2Ny03Nzc5Mjc4MS00MzljLTRmYmMtODc4Yy03NWJkMDc4ODI1MTgucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDgxMSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTA4MTFUMjMyODU4WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NGJlOWFlNDFlODY2ZDU1MWZlZmNhODdiYzA5OTdlZmJkMzcxZDJlNDJjOTcyMDFkOGJiNGJhNWU5Nzc2OGYyMyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.vQpNgFwsLredaQPU8PCIqvt3GgDzX1d10povOlYn7m0"> > > ``` > ┌─(root@PLEX - 10.0.1.80) > └───[/dev/dri] # l > total 0 > crw-rw---- 1 root video 226, 0 Aug 11 12:36 card0 > crw-rw---- 1 root render 226, 128 Aug 11 12:36 renderD128 > ``` > > This does not work <img alt="Image" width="508" height="57" src="https://private-user-images.githubusercontent.com/5225701/476694271-24d245bd-beee-4609-93c3-248dbabfc308.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTQ5NTUyMzgsIm5iZiI6MTc1NDk1NDkzOCwicGF0aCI6Ii81MjI1NzAxLzQ3NjY5NDI3MS0yNGQyNDViZC1iZWVlLTQ2MDktOTNjMy0yNDhkYmFiZmMzMDgucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDgxMSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTA4MTFUMjMyODU4WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZDlkYzNhMmJiYmZlNThmM2Q2MDQwYjBiZTc0MGY0MGE5YTYwOTk1MjhjOTNmNGZkNjZhNzU0ZDliY2FjNmQxMSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.kBhp_FW_CZClM7yAPXt-TGtAu9xjBVpz5vFsaYwWWrA"> > > ``` > ┌─(root@PLEX - 10.0.1.80) > └───[/dev/dri] # l > total 0 > drwxr-xr-x 2 root root 80 Aug 11 13:31 by-path > > ┌─(root@PLEX - 10.0.1.80) > └───[/dev/dri/by-path] # l > total 0 > crw-rw---- 1 root video 226, 0 Aug 11 13:31 pci-0000:01:00.0-card > crw-rw---- 1 root render 226, 128 Aug 11 13:31 pci-0000:01:00.0-render > ``` That's weird, I don't have two gpus, but I've installed a RTX 2060 before. Here's my setup: Machinist X99 E5-RS9 Intel Xeon E5-2683 v4 4x16GB RAM Intel Arc A770 16GB <img width="484" height="63" alt="Image" src="https://github.com/user-attachments/assets/6c9212ec-6fe4-4b0f-8f39-bf7c81d1b877" /> <img width="1007" height="220" alt="Image" src="https://github.com/user-attachments/assets/dea50419-753b-4543-8a54-6651c98224db" /> I've enabled Resizeble Bar too. Actually, I started having the card0<>card1 flip when installing the Arc GPU and enabling ReBar. And after two weeks of troubleshooting I've found this "by-path" workaround. I guess installing new pci devices mess with pci addressess. Can you show your lspci | grep VGA result? I would try to install Plex in my machine but I have no storage left. @MickLesk , could this have something to do with grouping and access stuff? (I'm literally shooting in the dark, but linux problems are usually about access configs)
Author
Owner

@webmogul1 commented on GitHub (Aug 11, 2025):

┌─(root@PLEX - 10.0.1.80)
└───[~] # lspci | grep VGA
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 164e (rev d8)
@webmogul1 commented on GitHub (Aug 11, 2025): ``` ┌─(root@PLEX - 10.0.1.80) └───[~] # lspci | grep VGA 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 164e (rev d8) ```
Author
Owner

@FibreFoX commented on GitHub (Nov 19, 2025):

@danielwsky After upgrading the hardware of my system, I stumbled upon a similar problem, but not related to the community scripts, but Promxox in general.

You wrote The root cause of this issue seems to be Proxmox's inconsistent assignment of the GPU, alternating between card0 and card1, which resembles the same behaviour as I am experiencing it.

Maybe instead of solving this via /dev/dri/by-path, you could tame the kernel a bit, which was referenced at some archlinux post or some random website my search engine came up with.

The reason of this behaviour (as far as I understood) might be a component called "simpledrm", which gets loaded by the kernel itself on boot. It is possible to disable this via kernel boot parameters like explained at https://pve.proxmox.com/wiki/Host_Bootloader#sysboot_edit_kernel_cmdline

When using GRUB

  • edit Grub configuration on the Proxmox host: vi /etc/default/grub
  • add to the end of GRUP_CMDLINE_LINUX entry:
GRUP_CMDLINE_LINUX="(...) initcall_blacklist=simpledrm_platform_driver_init"

When using systemd-boot

  • edit Grub configuration on the Proxmox host: vi /etc/kernel/cmdline
  • add to the end of the parameters:
 initcall_blacklist=simpledrm_platform_driver_init

Then remember to generate the actual boot configuration:

proxmox-boot-tool refresh

After a restart of the system, you should be able to use /dev/dri/card0 "as normal" without it changing randomly.

But there is a downside that comes with this workaround: you will not be able anymore to see some boot text after selecting the kernel (due to the SimpleDRM driver not being loaded), but you can still see via dmesg.

I hope this helps 😺

@FibreFoX commented on GitHub (Nov 19, 2025): @danielwsky After upgrading the hardware of my system, I stumbled upon a similar problem, but not related to the community scripts, but Promxox in general. You wrote `The root cause of this issue seems to be Proxmox's inconsistent assignment of the GPU, alternating between card0 and card1`, which resembles the same behaviour as I am experiencing it. Maybe instead of solving this via `/dev/dri/by-path`, you could tame the kernel a bit, which was referenced at [some archlinux post](https://bbs.archlinux.org/viewtopic.php?id=287936) or [some random website](https://blog.lightwo.net/fix-gpu-identifier-randomly-setting-to-card0-or-card1-linux.html) my search engine came up with. The reason of this behaviour (as far as I understood) might be a component called "simpledrm", which gets loaded by the kernel itself on boot. It is possible to disable this via kernel boot parameters like explained at https://pve.proxmox.com/wiki/Host_Bootloader#sysboot_edit_kernel_cmdline **When using GRUB** * edit Grub configuration on the Proxmox host: `vi /etc/default/grub` * add to the end of GRUP_CMDLINE_LINUX entry: ``` GRUP_CMDLINE_LINUX="(...) initcall_blacklist=simpledrm_platform_driver_init" ``` **When using systemd-boot** * edit Grub configuration on the Proxmox host: `vi /etc/kernel/cmdline` * add to the end of the parameters: ``` initcall_blacklist=simpledrm_platform_driver_init ``` Then remember to generate the actual boot configuration: ``` proxmox-boot-tool refresh ``` After a restart of the system, you should be able to use `/dev/dri/card0` "as normal" without it changing randomly. But there is a downside that comes with this workaround: you will not be able anymore to see some boot text after selecting the kernel (due to the SimpleDRM driver not being loaded), but you can still see via `dmesg`. I hope this helps 😺
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/ProxmoxVE#1578