Hardware
| Component | Detail |
|---|---|
| Host | Intel NUC, Meteor Lake (Core Ultra 7 165H) |
| Proxmox | VE 9.1.6, kernel 6.17.13-2-pve |
| eGPU | Blackmagic eGPU Pro — AMD Radeon RX Vega 56, 8GB HBM2 |
| GPU PCIe ID | 1002:687f (GPU) / 1002:aaf8 (HDMI audio) |
| TB controller | Intel JHL7440 TB3 Bridge (Titan Ridge) — TB3 device on TB4 host |
| Target VM | VM 100 "ollama", Ubuntu 24.04, 6 vCPU, 16GB RAM |
Contents
1 IOMMU Verification
Before touching anything else, verify IOMMU groups. You need the GPU and its audio function in isolated groups — groups they don't share with other devices you'd want to keep on the host.
find /sys/kernel/iommu_groups/ -type l | sort -V
Without eGPU connected
| Group | BDF | Device | Status |
|---|---|---|---|
| 4 | 00:07.0 | TB4 PCIe Root Port #0 [8086:7ec4] | ✓ Isolated |
| 5 | 00:07.2 | TB4 PCIe Root Port #2 [8086:7ec6] | ✓ Isolated |
| 9 | 00:0d.0/2/3 | TB4 USB ctrl + NHI #0 + NHI #1 | Grouped — fine |
After hot-plugging the eGPU
| Group | BDF | Device |
|---|---|---|
| 19 | 2c:00.0 | Intel JHL7440 TB3 Bridge |
| 20–22 | 2d:01-04 | TB3 downstream bridges |
| 23 | 2e:00.0 | AMD Vega 10 PCIe Bridge |
| 24 | 2f:00.0 | AMD Vega 10 PCIe Bridge |
| 25 | 30:00.0 | Radeon RX Vega 56 — ✓ Isolated ← pass through |
| 26 | 30:00.1 | Vega 56 HDMI Audio — ✓ Isolated ← pass through |
| 27 | 31:00.0 | Intel JHL7540 TB3 USB Controller |
2 The Cable Problem
# What you see with a USB-C-only cable:
usb 3-3: Product: eGPU Pro
usb 3-3: Manufacturer: Blackmagic Design
usb 3-3: USB disconnect, device number 3 # immediately disconnects and reconnects
Why
USB-C is a connector standard, not a protocol. A USB-C cable can carry USB 3.x, DisplayPort, Power Delivery, or Thunderbolt — but only if it's physically certified for Thunderbolt. Thunderbolt uses a PCIe tunnel negotiated over the TB protocol layer. A plain USB-C cable has no Thunderbolt signaling capability, so the eGPU's management controller connects as a USB device, and the PCIe tunnel that would expose the GPU never opens.
This failure is especially confusing because lsusb does show "Blackmagic Design eGPU Pro" — it looks like progress. It isn't. The GPU will never appear without Thunderbolt signaling.
3 Thunderbolt Authorization
With a proper TB cable, the device shows up in dmesg:
thunderbolt 1-1: new device found, vendor=0x4 device=0xa153
thunderbolt 1-1: Blackmagic Design eGPU Pro
But lspci still shows no GPU. The TB device is visible at the thunderbolt sysfs layer, but the PCIe tunnel hasn't opened.
Why
Linux's Thunderbolt security model requires explicit OS-level authorization before establishing a PCIe tunnel. The kernel sees the device but deliberately holds the tunnel closed. This is "SL1" (Secure Connect Level 1) — the default on most systems.
# Check authorization state
cat /sys/bus/thunderbolt/devices/1-1/authorized
# 0 = unauthorized
# Authorize manually (for testing)
echo 1 > /sys/bus/thunderbolt/devices/1-1/authorized
# Wait ~3 seconds — GPU enumerates in lspci
This manual step works for initial testing. Automated authorization is handled by the udev rule and boot service in Phase 5.
4 vfio-pci Configuration
vfio-pci claims PCIe devices for userspace passthrough. When a device is bound to vfio-pci instead of its native driver, QEMU/KVM can present it directly to a VM guest.
/etc/modprobe.d/vfio.conf
options vfio-pci ids=1002:687f,1002:aaf8
This claims both the GPU and HDMI audio function automatically when they enumerate. Both must be claimed together — they share the same PCIe slot.
/etc/modules-load.d/vfio.conf
vendor-reset
vfio
vfio_iommu_type1
vfio_pci
Order matters: vendor-reset must load before vfio_pci. See Phase 7.
update-initramfs -u -k all
5 The Boot-With-eGPU-Connected Problem
Why
Thunderbolt PCIe tunneling requires userspace authorization. At boot, the TB4 controller enumerates the connected device, reserves bus numbers for the expected PCIe hierarchy, and waits. Various early boot components stall on this — QEMU device scanning, kernel PCI enumeration timeouts, or systemd waiting for expected devices. The root cause is always the same: PCIe device expected at boot, authorization required but not yet available.
The solution is four components: a udev rule for hot-plug, a boot service for devices already connected at startup, a pre-shutdown service to clean up before reboot, and the scripts they call. All three connection scenarios — hot-plug, boot with GPU connected, and reboot with GPU connected — are fully handled.
Component 1 — udev Rule for Hot-Plug
/etc/udev/rules.d/99-egpu-vfio.rules
ACTION=="add", SUBSYSTEM=="thunderbolt", ATTR{vendor_name}=="Blackmagic Design", RUN+="/usr/local/bin/egpu-attach.sh %k"
%k passes the kernel device name (e.g., 1-1) as an argument to the script.
Component 2 — Attach Script
/usr/local/bin/egpu-attach.sh
#!/bin/bash
LOGFILE=/var/log/egpu-attach.log
echo "[$(date)] eGPU attach triggered: $1" >> $LOGFILE
AUTH_PATH="/sys/bus/thunderbolt/devices/${1}/authorized"
if [ -f "$AUTH_PATH" ]; then
echo 1 > "$AUTH_PATH"
echo "[$(date)] Authorized $1" >> $LOGFILE
fi
sleep 3
for vendor_dev in "1002:687f" "1002:aaf8"; do
vendor=$(echo $vendor_dev | cut -d: -f1)
device=$(echo $vendor_dev | cut -d: -f2)
for syspath in /sys/bus/pci/devices/*/; do
v=$(cat ${syspath}vendor 2>/dev/null | sed 's/0x//')
d=$(cat ${syspath}device 2>/dev/null | sed 's/0x//')
if [ "$v" = "$vendor" ] && [ "$d" = "$device" ]; then
bdf=$(basename $syspath)
drv=$(readlink ${syspath}driver 2>/dev/null | xargs basename 2>/dev/null)
if [ "$drv" != "vfio-pci" ]; then
echo "$bdf" > /sys/bus/pci/devices/${bdf}/driver/unbind 2>/dev/null
echo "vfio-pci" > /sys/bus/pci/devices/${bdf}/driver_override
echo "$bdf" > /sys/bus/pci/drivers/vfio-pci/bind 2>/dev/null
echo "[$(date)] Bound $bdf to vfio-pci" >> $LOGFILE
fi
fi
done
done
echo "[$(date)] eGPU attach complete" >> $LOGFILE
The explicit bind loop is mostly redundant — options vfio-pci ids= claims devices automatically. The script's primary job is TB authorization and logging.
chmod +x /usr/local/bin/egpu-attach.sh
Component 3 — Boot Authorization Service
Handles the case where the eGPU is already connected when Proxmox boots.
/etc/systemd/system/egpu-boot-auth.service
[Unit]
Description=Blackmagic eGPU Pro - authorize devices connected at boot
After=multi-user.target
Wants=multi-user.target
[Service]
Type=oneshot
ExecStart=/usr/local/bin/egpu-boot-auth.sh
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
/usr/local/bin/egpu-boot-auth.sh
#!/bin/bash
LOGFILE=/var/log/egpu-attach.log
echo "[$(date)] eGPU boot-auth: scanning for connected devices" >> $LOGFILE
for dev in /sys/bus/thunderbolt/devices/*/; do
name=$(cat ${dev}device_name 2>/dev/null)
auth=$(cat ${dev}authorized 2>/dev/null)
if [ "$name" = "eGPU Pro" ] && [ "$auth" = "0" ]; then
echo 1 > ${dev}authorized 2>/dev/null
echo "[$(date)] Boot-authorized $(basename $dev)" >> $LOGFILE
sleep 5
echo "[$(date)] Boot-auth complete for $(basename $dev)" >> $LOGFILE
fi
done
This runs after multi-user.target, so systemd is far enough along that the PCIe enumeration triggered by TB authorization won't stall the boot.
Component 4 — Pre-Shutdown De-authorization Service
Most guides omit this. Without it, the next reboot hangs because the eGPU is still PCIe-active when the boot sequence starts. The shutdown service de-authorizes the TB device so the next boot sees no PCIe device to wait for.
/etc/systemd/system/egpu-shutdown.service
[Unit]
Description=Blackmagic eGPU Pro - de-authorize before shutdown/reboot
DefaultDependencies=no
Before=shutdown.target reboot.target halt.target
After=sysinit.target
[Service]
Type=oneshot
ExecStart=/bin/true
ExecStop=/usr/local/bin/egpu-detach.sh
RemainAfterExit=yes
TimeoutStopSec=15
[Install]
WantedBy=multi-user.target
/usr/local/bin/egpu-detach.sh
#!/bin/bash
LOGFILE=/var/log/egpu-attach.log
echo "[$(date)] eGPU detach triggered (pre-shutdown)" >> $LOGFILE
for vendor_dev in "1002:687f" "1002:aaf8"; do
vendor=$(echo $vendor_dev | cut -d: -f1)
device=$(echo $vendor_dev | cut -d: -f2)
for syspath in /sys/bus/pci/devices/*/; do
v=$(cat ${syspath}vendor 2>/dev/null | sed 's/0x//')
d=$(cat ${syspath}device 2>/dev/null | sed 's/0x//')
if [ "$v" = "$vendor" ] && [ "$d" = "$device" ]; then
bdf=$(basename $syspath)
echo "$bdf" > /sys/bus/pci/devices/${bdf}/driver/unbind 2>/dev/null
echo "[$(date)] Unbound $bdf from vfio-pci" >> $LOGFILE
fi
done
done
for dev in /sys/bus/thunderbolt/devices/*/; do
name=$(cat ${dev}device_name 2>/dev/null)
auth=$(cat ${dev}authorized 2>/dev/null)
if [ "$name" = "eGPU Pro" ] && [ "$auth" = "1" ]; then
echo 0 > ${dev}authorized 2>/dev/null
echo "[$(date)] De-authorized $(basename $dev)" >> $LOGFILE
fi
done
echo "[$(date)] eGPU detach complete" >> $LOGFILE
Enable Everything
chmod +x /usr/local/bin/egpu-detach.sh
systemctl daemon-reload
systemctl enable egpu-boot-auth.service egpu-shutdown.service
systemctl start egpu-shutdown.service
udevadm control --reload-rules
6 The NIC Renaming Problem
Why
Linux predictable network interface naming derives names partly from the device's PCI path. The Intel I226-LM NIC has a fixed physical slot, but its bus number shifts when the TB eGPU introduces additional PCIe bridges into the topology. Before eGPU: bus 86. After eGPU: bus 88. Same hardware, same MAC, different name.
Fix: pin the NIC by MAC address
# Get MAC address (run with eGPU connected)
ip link show enp88s0
# example: link/ether 88:ae:dd:72:13:25
/etc/udev/rules.d/10-stable-nic.rules
SUBSYSTEM=="net", ACTION=="add", ATTR{address}=="88:ae:dd:72:13:25", NAME="lan0"
/etc/network/interfaces — update to use lan0:
iface lan0 inet manual
auto vmbr0
iface vmbr0 inet static
address 192.168.6.69/22
gateway 192.168.4.1
bridge-ports lan0
bridge-stp off
bridge-fd 0
udevadm control --reload-rules && udevadm trigger
After this, the NIC enumerates as lan0 regardless of eGPU state and PCI bus numbering.
7 AMD Vega Reset Bug and vendor-reset
error writing '1' to '/sys/bus/pci/devices/0000:30:00.0/reset': Inappropriate ioctl for device
failed to reset PCI device '0000:30:00.0', but trying to continue
org.freedesktop.DBus.Error.NoReply: Did not receive a reply.
Why
QEMU requires that a PCIe device support Function Level Reset (FLR) before it can be passed through to a VM guest. AMD Vega 10 (the architecture behind Vega 56) does not implement FLR in hardware. The kernel attempts the reset via ioctl, gets EOPNOTSUPP, and QEMU crashes or hangs the host.
Fix: vendor-reset — a DKMS kernel module by Geoffrey McRae (gnif) that implements custom reset sequences for AMD GPUs lacking FLR. For Vega 10, it uses AMD BACO (Bus Active, Chip Off) power sequencing to properly reset the GPU state, hooking into the kernel's PCIe reset path.
Building vendor-reset on Kernel 6.17
The upstream repo hasn't been updated for kernel 6.x. Two patches are required.
Full build sequence
apt-get install -y pve-headers-$(uname -r) dkms git build-essential
cd /usr/src
git clone https://github.com/gnif/vendor-reset.git
cd vendor-reset
echo 'ccflags-y += -Wno-missing-prototypes' >> src/Makefile
sed -i 's|asm/unaligned.h|linux/unaligned.h|g' $(grep -rl 'asm/unaligned.h' .)
dkms install .
Secure Boot
If Secure Boot is enabled (common on NUC hardware), enroll a Machine Owner Key:
mokutil --import /var/lib/dkms/mok.pub
# Enter a one-time password, then reboot
# At blue UEFI screen: Enroll MOK → Continue → Yes → enter password → Reboot
Verify
lsmod | grep vendor_reset
# vendor_reset 32768 0
8 VM Configuration
qm set 100 -machine q35
qm set 100 -cpu host
qm set 100 -hostpci0 0000:30:00.0,pcie=1,rombar=1
qm set 100 -hostpci1 0000:30:00.1,pcie=1,rombar=0
machine: q35
Required for PCIe passthrough. The default i440fx machine type does not support PCIe slots.
cpu: host
Exposes actual host CPU feature flags to the guest. GPU drivers often require specific CPU features.
pcie=1
Use PCIe semantics for host-mapped device slots, not legacy PCI.
rombar=0 on audio
The HDMI audio function has no ROM bar in PCIe config space. Setting rombar=1 causes guest boot issues.
9 ROCm Deprecation and the Vulkan Solution
Everything works at the hardware and kernel level. This phase involves a completely different kind of failure: the software stack targeting the GPU is broken by an upstream deprecation decision.
Guest setup (Ubuntu 24.04)
The base Ubuntu 24.04 kernel doesn't include amdgpu. Install it:
apt-get install -y linux-modules-extra-$(uname -r)
apt-get install -y linux-firmware
modprobe amdgpu
ls /dev/dri/ # card0 renderD128
# Persist across reboots
echo 'amdgpu' > /etc/modules-load.d/amdgpu.conf
The ROCm problem
AMD Vega 10 (gfx900) was removed from ROCm in version 6.0 (released early 2024). Ollama 0.18.x bundles ROCm 7.2. The bundled libhipblas, librocblas, and libamdhip64 no longer contain pre-compiled GPU kernels for gfx900. When the runner initializes and attempts to load kernels for the detected GPU architecture, it segfaults because the kernel code for gfx900 is absent.
| ROCm Version | Vega 10 (gfx900) Status |
|---|---|
| ROCm 5.x | ✓ Full support |
| ROCm 6.0 | Deprecated, removed from compiled libraries |
| ROCm 7.x | Not present — segfault on init |
| Ollama 0.18+ | Bundles ROCm 7.x — affected |
The Vulkan solution
Ollama includes a Vulkan GPU compute backend alongside ROCm. Vulkan uses Mesa's RADV driver, which is community-maintained and supports all GCN/RDNA generations including Vega 10 — independent of AMD's ROCm lifecycle. Vulkan compute compiles shaders at runtime via SPIR-V. There's no pre-compiled kernel library to go missing.
# Install Vulkan support in the guest
apt-get install -y vulkan-tools libvulkan1 mesa-vulkan-drivers
# Verify GPU is visible
vulkaninfo 2>&1 | grep deviceName
# AMD Radeon RX Vega (RADV VEGA10)
# Enable Vulkan backend in Ollama
mkdir -p /etc/systemd/system/ollama.service.d
cat > /etc/systemd/system/ollama.service.d/rocm.conf << 'EOF'
[Service]
Environment=OLLAMA_VULKAN=1
EOF
systemctl daemon-reload && systemctl restart ollama
# ldconfig for Ollama's bundled libraries
echo '/usr/local/lib/ollama/rocm' > /etc/ld.so.conf.d/ollama-rocm.conf
echo '/usr/local/lib/ollama' > /etc/ld.so.conf.d/ollama.conf
ldconfig
Verify Ollama is using the GPU
journalctl -u ollama | grep -E "Vulkan|inference compute|vram"
# Expected output:
inference compute library=Vulkan name=Vulkan0
description="AMD Radeon RX Vega (RADV VEGA10)"
type=discrete total="8.0 GiB" available="8.0 GiB"
Verification Commands
Host-side
# GPU bound to vfio-pci
lspci -nnk | grep -A2 "1002:687f\|1002:aaf8"
# expected: Kernel driver in use: vfio-pci
# TB authorization state
cat /sys/bus/thunderbolt/devices/1-1/authorized
# expected: 1
# Watch attach/detach log
tail -f /var/log/egpu-attach.log
# vendor-reset loaded
lsmod | grep vendor_reset
# IOMMU groups after hot-plug
find /sys/kernel/iommu_groups/ -type l | sort -V | grep "2[cdef]:\|3[01]:"
Guest-side
# GPU visible
lspci | grep AMD
# DRM devices
ls /dev/dri/
# Vulkan
vulkaninfo 2>&1 | grep deviceName
# Ollama GPU detection
journalctl -u ollama | grep -E "vram|Vulkan|inference compute"
Troubleshooting
eGPU not appearing in lspci
1. Check the log: tail /var/log/egpu-attach.log
2. Verify TB device: ls /sys/bus/thunderbolt/devices/
3. If no TB device: USB-C-only cable — replace with certified TB3/TB4
4. Manual auth: echo 1 > /sys/bus/thunderbolt/devices/1-1/authorized
Boot hang with eGPU connected
This should not happen in normal operation — egpu-boot-auth.service handles boot-with-GPU-connected and egpu-shutdown.service handles clean reboot. If you hit this anyway:
1. Check shutdown service log: journalctl -u egpu-shutdown --since yesterday
2. Verify both services are active: systemctl is-active egpu-shutdown.service egpu-boot-auth.service
3. Check unit files for correct ExecStart=/bin/true + ExecStop pattern in egpu-shutdown
4. Emergency recovery if services are broken: boot without eGPU, hot-plug after boot, then debug the service
Network down after reboot with eGPU
NIC renamed due to PCI bus shift. Temporary fix: ip link set enp88s0 master vmbr0. Permanent fix: apply the udev stable naming rule in Phase 6.
VM start fails — destructive transaction error
Wrong service unit structure. The egpu-shutdown.service must have ExecStart=/bin/true, RemainAfterExit=yes, real work in ExecStop, and WantedBy=multi-user.target — not shutdown.target.
ROCm segfaults / total_vram=0
Expected with Vega 10 on Ollama 0.18+. ROCm 7.x dropped gfx900. Use Vulkan: add OLLAMA_VULKAN=1 to the systemd override, install mesa-vulkan-drivers in the guest. Do not try HSA_OVERRIDE_GFX_VERSION — it will not fix this.
vendor-reset fails to build
On kernel 6.x, both patches are required: the missing-prototypes flag and the unaligned.h header replacement. If Secure Boot is enabled, the module will silently fail to load — check dmesg | grep vendor and enroll the MOK key.
amdgpu module not found in guest
apt-get install -y linux-modules-extra-$(uname -r) linux-firmware then modprobe amdgpu.
Known Issues / Deprecated
ROCm 7.x: Vega 10 (gfx900) support removed. AMD dropped Vega 10 from ROCm in version 6.0. Ollama 0.18+ bundles ROCm 7.x. Any workflow relying on ROCm for Vega 56 inference is broken on current Ollama. The fix is the Vulkan backend, not a downgrade of Ollama.
vendor-reset: not updated for kernel 6.x. The gnif/vendor-reset repository has not been updated for Linux kernel 6.x API changes. The two patches in Phase 7 are required. If those patches stop working against a future kernel, the reset mechanism will fail and VM start will crash the host.
Thunderbolt PCIe tunneling and boot ordering
All three connection scenarios are fully supported by the systemd service stack in Phase 5:
- Hot-plug — udev fires
egpu-attach.shand authorizes the TB tunnel - Boot with GPU connected —
egpu-boot-auth.serviceauthorizes aftermulti-user.target - Reboot with GPU connected —
egpu-shutdown.servicede-authorizes before shutdown;egpu-boot-auth.servicere-authorizes on the next boot
Thunderbolt PCIe tunneling is fundamentally dynamic — authorization must happen in userspace after boot. The services above close this gap. A future kernel with automated TB authorization policy could simplify the implementation, but the current stack covers every realistic connection scenario.
Audio function reset warning
The HDMI audio function (30:00.1) cannot be reset between VM starts. If it accumulates bad state, the only recovery is a full de-authorization/re-authorization cycle of the TB device. In practice this has not caused issues.
Secure Boot and DKMS
DKMS modules require MOK enrollment on Secure Boot systems. Proxmox kernel updates trigger automatic DKMS rebuilds, signed with the same enrolled key — re-enrollment not required unless the key is rotated.
Changelog
| Date | Entry |
|---|---|
| 2026-03-18 | Initial implementation: TB4 authorization, vfio-pci, boot/shutdown services, vendor-reset (patched for kernel 6.17), NIC stable naming, VM 100 GPU passthrough, Vulkan backend for Ollama (ROCm 7.x drops gfx900). Achieved 85 tok/s on llama3.2. |