Proxmox TB4 eGPU Passthrough — Blackmagic eGPU Pro + Vega 56

Hardware

Component	Detail
Host	Intel NUC, Meteor Lake (Core Ultra 7 165H)
Proxmox	VE 9.1.6, kernel 6.17.13-2-pve
eGPU	Blackmagic eGPU Pro — AMD Radeon RX Vega 56, 8GB HBM2
GPU PCIe ID	`1002:687f` (GPU) / `1002:aaf8` (HDMI audio)
TB controller	Intel JHL7440 TB3 Bridge (Titan Ridge) — TB3 device on TB4 host
Target VM	VM 100 "ollama", Ubuntu 24.04, 6 vCPU, 16GB RAM

IOMMU Verification
The Cable Problem
Thunderbolt Authorization
vfio-pci Configuration
Boot-With-eGPU Problem
NIC Renaming Problem
AMD Vega Reset Bug
VM Configuration
ROCm Deprecation + Vulkan
Verification Commands
Troubleshooting
Known Issues
Changelog

1 IOMMU Verification

Before touching anything else, verify IOMMU groups. You need the GPU and its audio function in isolated groups — groups they don't share with other devices you'd want to keep on the host.

find /sys/kernel/iommu_groups/ -type l | sort -V

Without eGPU connected

Group	BDF	Device	Status
4	`00:07.0`	TB4 PCIe Root Port #0 [8086:7ec4]	✓ Isolated
5	`00:07.2`	TB4 PCIe Root Port #2 [8086:7ec6]	✓ Isolated
9	`00:0d.0/2/3`	TB4 USB ctrl + NHI #0 + NHI #1	Grouped — fine

After hot-plugging the eGPU

Group	BDF	Device
19	`2c:00.0`	Intel JHL7440 TB3 Bridge
20–22	`2d:01-04`	TB3 downstream bridges
23	`2e:00.0`	AMD Vega 10 PCIe Bridge
24	`2f:00.0`	AMD Vega 10 PCIe Bridge
25	`30:00.0`	Radeon RX Vega 56 — ✓ Isolated ← pass through
26	`30:00.1`	Vega 56 HDMI Audio — ✓ Isolated ← pass through
27	`31:00.0`	Intel JHL7540 TB3 USB Controller

2 The Cable Problem

# What you see with a USB-C-only cable:
usb 3-3: Product: eGPU Pro
usb 3-3: Manufacturer: Blackmagic Design
usb 3-3: USB disconnect, device number 3   # immediately disconnects and reconnects

Why

USB-C is a connector standard, not a protocol. A USB-C cable can carry USB 3.x, DisplayPort, Power Delivery, or Thunderbolt — but only if it's physically certified for Thunderbolt. Thunderbolt uses a PCIe tunnel negotiated over the TB protocol layer. A plain USB-C cable has no Thunderbolt signaling capability, so the eGPU's management controller connects as a USB device, and the PCIe tunnel that would expose the GPU never opens.

This failure is especially confusing because lsusb does show "Blackmagic Design eGPU Pro" — it looks like progress. It isn't. The GPU will never appear without Thunderbolt signaling.

3 Thunderbolt Authorization

With a proper TB cable, the device shows up in dmesg:

thunderbolt 1-1: new device found, vendor=0x4 device=0xa153
thunderbolt 1-1: Blackmagic Design eGPU Pro

But lspci still shows no GPU. The TB device is visible at the thunderbolt sysfs layer, but the PCIe tunnel hasn't opened.

Why

Linux's Thunderbolt security model requires explicit OS-level authorization before establishing a PCIe tunnel. The kernel sees the device but deliberately holds the tunnel closed. This is "SL1" (Secure Connect Level 1) — the default on most systems.

# Check authorization state
cat /sys/bus/thunderbolt/devices/1-1/authorized
# 0 = unauthorized

# Authorize manually (for testing)
echo 1 > /sys/bus/thunderbolt/devices/1-1/authorized
# Wait ~3 seconds — GPU enumerates in lspci

This manual step works for initial testing. Automated authorization is handled by the udev rule and boot service in Phase 5.

4 vfio-pci Configuration

vfio-pci claims PCIe devices for userspace passthrough. When a device is bound to vfio-pci instead of its native driver, QEMU/KVM can present it directly to a VM guest.

/etc/modprobe.d/vfio.conf

options vfio-pci ids=1002:687f,1002:aaf8

This claims both the GPU and HDMI audio function automatically when they enumerate. Both must be claimed together — they share the same PCIe slot.

/etc/modules-load.d/vfio.conf

vendor-reset
vfio
vfio_iommu_type1
vfio_pci

Order matters: vendor-reset must load before vfio_pci. See Phase 7.

update-initramfs -u -k all

5 The Boot-With-eGPU-Connected Problem

Why

Thunderbolt PCIe tunneling requires userspace authorization. At boot, the TB4 controller enumerates the connected device, reserves bus numbers for the expected PCIe hierarchy, and waits. Various early boot components stall on this — QEMU device scanning, kernel PCI enumeration timeouts, or systemd waiting for expected devices. The root cause is always the same: PCIe device expected at boot, authorization required but not yet available.

The solution is four components: a udev rule for hot-plug, a boot service for devices already connected at startup, a pre-shutdown service to clean up before reboot, and the scripts they call. All three connection scenarios — hot-plug, boot with GPU connected, and reboot with GPU connected — are fully handled.

Component 1 — udev Rule for Hot-Plug

/etc/udev/rules.d/99-egpu-vfio.rules

ACTION=="add", SUBSYSTEM=="thunderbolt", ATTR{vendor_name}=="Blackmagic Design", RUN+="/usr/local/bin/egpu-attach.sh %k"

%k passes the kernel device name (e.g., 1-1) as an argument to the script.

Component 2 — Attach Script

/usr/local/bin/egpu-attach.sh

#!/bin/bash
LOGFILE=/var/log/egpu-attach.log
echo "[$(date)] eGPU attach triggered: $1" >> $LOGFILE

AUTH_PATH="/sys/bus/thunderbolt/devices/${1}/authorized"
if [ -f "$AUTH_PATH" ]; then
    echo 1 > "$AUTH_PATH"
    echo "[$(date)] Authorized $1" >> $LOGFILE
fi

sleep 3

for vendor_dev in "1002:687f" "1002:aaf8"; do
    vendor=$(echo $vendor_dev | cut -d: -f1)
    device=$(echo $vendor_dev | cut -d: -f2)
    for syspath in /sys/bus/pci/devices/*/; do
        v=$(cat ${syspath}vendor 2>/dev/null | sed 's/0x//')
        d=$(cat ${syspath}device 2>/dev/null | sed 's/0x//')
        if [ "$v" = "$vendor" ] && [ "$d" = "$device" ]; then
            bdf=$(basename $syspath)
            drv=$(readlink ${syspath}driver 2>/dev/null | xargs basename 2>/dev/null)
            if [ "$drv" != "vfio-pci" ]; then
                echo "$bdf" > /sys/bus/pci/devices/${bdf}/driver/unbind 2>/dev/null
                echo "vfio-pci" > /sys/bus/pci/devices/${bdf}/driver_override
                echo "$bdf" > /sys/bus/pci/drivers/vfio-pci/bind 2>/dev/null
                echo "[$(date)] Bound $bdf to vfio-pci" >> $LOGFILE
            fi
        fi
    done
done
echo "[$(date)] eGPU attach complete" >> $LOGFILE

The explicit bind loop is mostly redundant — options vfio-pci ids= claims devices automatically. The script's primary job is TB authorization and logging.

chmod +x /usr/local/bin/egpu-attach.sh

Component 3 — Boot Authorization Service

Handles the case where the eGPU is already connected when Proxmox boots.

/etc/systemd/system/egpu-boot-auth.service

[Unit]
Description=Blackmagic eGPU Pro - authorize devices connected at boot
After=multi-user.target
Wants=multi-user.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/egpu-boot-auth.sh
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

/usr/local/bin/egpu-boot-auth.sh

#!/bin/bash
LOGFILE=/var/log/egpu-attach.log
echo "[$(date)] eGPU boot-auth: scanning for connected devices" >> $LOGFILE

for dev in /sys/bus/thunderbolt/devices/*/; do
    name=$(cat ${dev}device_name 2>/dev/null)
    auth=$(cat ${dev}authorized 2>/dev/null)
    if [ "$name" = "eGPU Pro" ] && [ "$auth" = "0" ]; then
        echo 1 > ${dev}authorized 2>/dev/null
        echo "[$(date)] Boot-authorized $(basename $dev)" >> $LOGFILE
        sleep 5
        echo "[$(date)] Boot-auth complete for $(basename $dev)" >> $LOGFILE
    fi
done

This runs after multi-user.target, so systemd is far enough along that the PCIe enumeration triggered by TB authorization won't stall the boot.

Component 4 — Pre-Shutdown De-authorization Service

Most guides omit this. Without it, the next reboot hangs because the eGPU is still PCIe-active when the boot sequence starts. The shutdown service de-authorizes the TB device so the next boot sees no PCIe device to wait for.

Critical: service unit design matters

You must use ExecStart=/bin/true with RemainAfterExit=yes and put the real work in ExecStop. If you set WantedBy=shutdown.target instead, systemd generates a destructive transaction conflict when starting a VM scope: "Transaction for 100.scope/start is destructive (shutdown.target has 'start' job queued)". The correct pattern: start with a no-op, stay active during runtime, run cleanup in ExecStop at shutdown.

/etc/systemd/system/egpu-shutdown.service

[Unit]
Description=Blackmagic eGPU Pro - de-authorize before shutdown/reboot
DefaultDependencies=no
Before=shutdown.target reboot.target halt.target
After=sysinit.target

[Service]
Type=oneshot
ExecStart=/bin/true
ExecStop=/usr/local/bin/egpu-detach.sh
RemainAfterExit=yes
TimeoutStopSec=15

[Install]
WantedBy=multi-user.target

/usr/local/bin/egpu-detach.sh

#!/bin/bash
LOGFILE=/var/log/egpu-attach.log
echo "[$(date)] eGPU detach triggered (pre-shutdown)" >> $LOGFILE

for vendor_dev in "1002:687f" "1002:aaf8"; do
    vendor=$(echo $vendor_dev | cut -d: -f1)
    device=$(echo $vendor_dev | cut -d: -f2)
    for syspath in /sys/bus/pci/devices/*/; do
        v=$(cat ${syspath}vendor 2>/dev/null | sed 's/0x//')
        d=$(cat ${syspath}device 2>/dev/null | sed 's/0x//')
        if [ "$v" = "$vendor" ] && [ "$d" = "$device" ]; then
            bdf=$(basename $syspath)
            echo "$bdf" > /sys/bus/pci/devices/${bdf}/driver/unbind 2>/dev/null
            echo "[$(date)] Unbound $bdf from vfio-pci" >> $LOGFILE
        fi
    done
done

for dev in /sys/bus/thunderbolt/devices/*/; do
    name=$(cat ${dev}device_name 2>/dev/null)
    auth=$(cat ${dev}authorized 2>/dev/null)
    if [ "$name" = "eGPU Pro" ] && [ "$auth" = "1" ]; then
        echo 0 > ${dev}authorized 2>/dev/null
        echo "[$(date)] De-authorized $(basename $dev)" >> $LOGFILE
    fi
done

echo "[$(date)] eGPU detach complete" >> $LOGFILE

Enable Everything

chmod +x /usr/local/bin/egpu-detach.sh
systemctl daemon-reload
systemctl enable egpu-boot-auth.service egpu-shutdown.service
systemctl start egpu-shutdown.service
udevadm control --reload-rules

6 The NIC Renaming Problem

Why

Linux predictable network interface naming derives names partly from the device's PCI path. The Intel I226-LM NIC has a fixed physical slot, but its bus number shifts when the TB eGPU introduces additional PCIe bridges into the topology. Before eGPU: bus 86. After eGPU: bus 88. Same hardware, same MAC, different name.

Fix: pin the NIC by MAC address

# Get MAC address (run with eGPU connected)
ip link show enp88s0
# example: link/ether 88:ae:dd:72:13:25

/etc/udev/rules.d/10-stable-nic.rules

SUBSYSTEM=="net", ACTION=="add", ATTR{address}=="88:ae:dd:72:13:25", NAME="lan0"

/etc/network/interfaces — update to use lan0:

iface lan0 inet manual

auto vmbr0
iface vmbr0 inet static
  address 192.168.6.69/22
  gateway 192.168.4.1
  bridge-ports lan0
  bridge-stp off
  bridge-fd 0

udevadm control --reload-rules && udevadm trigger

After this, the NIC enumerates as lan0 regardless of eGPU state and PCI bus numbering.

7 AMD Vega Reset Bug and vendor-reset

error writing '1' to '/sys/bus/pci/devices/0000:30:00.0/reset': Inappropriate ioctl for device
failed to reset PCI device '0000:30:00.0', but trying to continue
org.freedesktop.DBus.Error.NoReply: Did not receive a reply.

Why

QEMU requires that a PCIe device support Function Level Reset (FLR) before it can be passed through to a VM guest. AMD Vega 10 (the architecture behind Vega 56) does not implement FLR in hardware. The kernel attempts the reset via ioctl, gets EOPNOTSUPP, and QEMU crashes or hangs the host.

Fix: vendor-reset — a DKMS kernel module by Geoffrey McRae (gnif) that implements custom reset sequences for AMD GPUs lacking FLR. For Vega 10, it uses AMD BACO (Bus Active, Chip Off) power sequencing to properly reset the GPU state, hooking into the kernel's PCIe reset path.

Building vendor-reset on Kernel 6.17

The upstream repo hasn't been updated for kernel 6.x. Two patches are required.

Full build sequence

apt-get install -y pve-headers-$(uname -r) dkms git build-essential
cd /usr/src
git clone https://github.com/gnif/vendor-reset.git
cd vendor-reset
echo 'ccflags-y += -Wno-missing-prototypes' >> src/Makefile
sed -i 's|asm/unaligned.h|linux/unaligned.h|g' $(grep -rl 'asm/unaligned.h' .)
dkms install .

Secure Boot

If Secure Boot is enabled (common on NUC hardware), enroll a Machine Owner Key:

mokutil --import /var/lib/dkms/mok.pub
# Enter a one-time password, then reboot
# At blue UEFI screen: Enroll MOK → Continue → Yes → enter password → Reboot

Verify

lsmod | grep vendor_reset
# vendor_reset           32768  0

8 VM Configuration

qm set 100 -machine q35
qm set 100 -cpu host
qm set 100 -hostpci0 0000:30:00.0,pcie=1,rombar=1
qm set 100 -hostpci1 0000:30:00.1,pcie=1,rombar=0

machine: q35

Required for PCIe passthrough. The default i440fx machine type does not support PCIe slots.

cpu: host

Exposes actual host CPU feature flags to the guest. GPU drivers often require specific CPU features.

pcie=1

Use PCIe semantics for host-mapped device slots, not legacy PCI.

rombar=0 on audio

The HDMI audio function has no ROM bar in PCIe config space. Setting rombar=1 causes guest boot issues.

9 ROCm Deprecation and the Vulkan Solution

Everything works at the hardware and kernel level. This phase involves a completely different kind of failure: the software stack targeting the GPU is broken by an upstream deprecation decision.

Guest setup (Ubuntu 24.04)

The base Ubuntu 24.04 kernel doesn't include amdgpu. Install it:

apt-get install -y linux-modules-extra-$(uname -r)
apt-get install -y linux-firmware
modprobe amdgpu
ls /dev/dri/   # card0  renderD128

# Persist across reboots
echo 'amdgpu' > /etc/modules-load.d/amdgpu.conf

The ROCm problem

⚠ Deprecated

AMD Vega 10 (gfx900) was removed from ROCm in version 6.0 (released early 2024). Ollama 0.18.x bundles ROCm 7.2. The bundled libhipblas, librocblas, and libamdhip64 no longer contain pre-compiled GPU kernels for gfx900. When the runner initializes and attempts to load kernels for the detected GPU architecture, it segfaults because the kernel code for gfx900 is absent.

ROCm Version	Vega 10 (gfx900) Status
ROCm 5.x	✓ Full support
ROCm 6.0	Deprecated, removed from compiled libraries
ROCm 7.x	Not present — segfault on init
Ollama 0.18+	Bundles ROCm 7.x — affected

The Vulkan solution

Ollama includes a Vulkan GPU compute backend alongside ROCm. Vulkan uses Mesa's RADV driver, which is community-maintained and supports all GCN/RDNA generations including Vega 10 — independent of AMD's ROCm lifecycle. Vulkan compute compiles shaders at runtime via SPIR-V. There's no pre-compiled kernel library to go missing.

# Install Vulkan support in the guest
apt-get install -y vulkan-tools libvulkan1 mesa-vulkan-drivers

# Verify GPU is visible
vulkaninfo 2>&1 | grep deviceName
# AMD Radeon RX Vega (RADV VEGA10)

# Enable Vulkan backend in Ollama
mkdir -p /etc/systemd/system/ollama.service.d
cat > /etc/systemd/system/ollama.service.d/rocm.conf << 'EOF'
[Service]
Environment=OLLAMA_VULKAN=1
EOF
systemctl daemon-reload && systemctl restart ollama

# ldconfig for Ollama's bundled libraries
echo '/usr/local/lib/ollama/rocm' > /etc/ld.so.conf.d/ollama-rocm.conf
echo '/usr/local/lib/ollama' > /etc/ld.so.conf.d/ollama.conf
ldconfig

Verify Ollama is using the GPU

journalctl -u ollama | grep -E "Vulkan|inference compute|vram"

# Expected output:
inference compute  library=Vulkan  name=Vulkan0
description="AMD Radeon RX Vega (RADV VEGA10)"
type=discrete  total="8.0 GiB"  available="8.0 GiB"

Result — llama3.2 inference speed

85 tok/s

vs ~6-8 tok/s CPU-only — roughly 10-15x improvement via Vulkan backend

Verification Commands

Host-side

# GPU bound to vfio-pci
lspci -nnk | grep -A2 "1002:687f\|1002:aaf8"
# expected: Kernel driver in use: vfio-pci

# TB authorization state
cat /sys/bus/thunderbolt/devices/1-1/authorized
# expected: 1

# Watch attach/detach log
tail -f /var/log/egpu-attach.log

# vendor-reset loaded
lsmod | grep vendor_reset

# IOMMU groups after hot-plug
find /sys/kernel/iommu_groups/ -type l | sort -V | grep "2[cdef]:\|3[01]:"

Guest-side

# GPU visible
lspci | grep AMD

# DRM devices
ls /dev/dri/

# Vulkan
vulkaninfo 2>&1 | grep deviceName

# Ollama GPU detection
journalctl -u ollama | grep -E "vram|Vulkan|inference compute"

Troubleshooting

eGPU not appearing in lspci

1. Check the log: tail /var/log/egpu-attach.log
2. Verify TB device: ls /sys/bus/thunderbolt/devices/
3. If no TB device: USB-C-only cable — replace with certified TB3/TB4
4. Manual auth: echo 1 > /sys/bus/thunderbolt/devices/1-1/authorized

Boot hang with eGPU connected

This should not happen in normal operation — egpu-boot-auth.service handles boot-with-GPU-connected and egpu-shutdown.service handles clean reboot. If you hit this anyway:

1. Check shutdown service log: journalctl -u egpu-shutdown --since yesterday
2. Verify both services are active: systemctl is-active egpu-shutdown.service egpu-boot-auth.service
3. Check unit files for correct ExecStart=/bin/true + ExecStop pattern in egpu-shutdown
4. Emergency recovery if services are broken: boot without eGPU, hot-plug after boot, then debug the service

Network down after reboot with eGPU

NIC renamed due to PCI bus shift. Temporary fix: ip link set enp88s0 master vmbr0. Permanent fix: apply the udev stable naming rule in Phase 6.

VM start fails — destructive transaction error

Wrong service unit structure. The egpu-shutdown.service must have ExecStart=/bin/true, RemainAfterExit=yes, real work in ExecStop, and WantedBy=multi-user.target — not shutdown.target.

ROCm segfaults / total_vram=0

Expected with Vega 10 on Ollama 0.18+. ROCm 7.x dropped gfx900. Use Vulkan: add OLLAMA_VULKAN=1 to the systemd override, install mesa-vulkan-drivers in the guest. Do not try HSA_OVERRIDE_GFX_VERSION — it will not fix this.

vendor-reset fails to build

On kernel 6.x, both patches are required: the missing-prototypes flag and the unaligned.h header replacement. If Secure Boot is enabled, the module will silently fail to load — check dmesg | grep vendor and enroll the MOK key.

amdgpu module not found in guest

apt-get install -y linux-modules-extra-$(uname -r) linux-firmware then modprobe amdgpu.

Known Issues / Deprecated

⚠ Deprecated

ROCm 7.x: Vega 10 (gfx900) support removed. AMD dropped Vega 10 from ROCm in version 6.0. Ollama 0.18+ bundles ROCm 7.x. Any workflow relying on ROCm for Vega 56 inference is broken on current Ollama. The fix is the Vulkan backend, not a downgrade of Ollama.

⚠ Unmaintained

vendor-reset: not updated for kernel 6.x. The gnif/vendor-reset repository has not been updated for Linux kernel 6.x API changes. The two patches in Phase 7 are required. If those patches stop working against a future kernel, the reset mechanism will fail and VM start will crash the host.

Thunderbolt PCIe tunneling and boot ordering

All three connection scenarios are fully supported by the systemd service stack in Phase 5:

Hot-plug — udev fires egpu-attach.sh and authorizes the TB tunnel
Boot with GPU connected — egpu-boot-auth.service authorizes after multi-user.target
Reboot with GPU connected — egpu-shutdown.service de-authorizes before shutdown; egpu-boot-auth.service re-authorizes on the next boot

Thunderbolt PCIe tunneling is fundamentally dynamic — authorization must happen in userspace after boot. The services above close this gap. A future kernel with automated TB authorization policy could simplify the implementation, but the current stack covers every realistic connection scenario.

Audio function reset warning

The HDMI audio function (30:00.1) cannot be reset between VM starts. If it accumulates bad state, the only recovery is a full de-authorization/re-authorization cycle of the TB device. In practice this has not caused issues.

Secure Boot and DKMS

DKMS modules require MOK enrollment on Secure Boot systems. Proxmox kernel updates trigger automatic DKMS rebuilds, signed with the same enrolled key — re-enrollment not required unless the key is rotated.

Changelog

Date	Entry
2026-03-18	Initial implementation: TB4 authorization, vfio-pci, boot/shutdown services, vendor-reset (patched for kernel 6.17), NIC stable naming, VM 100 GPU passthrough, Vulkan backend for Ollama (ROCm 7.x drops gfx900). Achieved 85 tok/s on llama3.2.

Blackmagic eGPU Pro + Proxmox

Hardware

Contents

1 IOMMU Verification

Without eGPU connected

After hot-plugging the eGPU

2 The Cable Problem

Why

3 Thunderbolt Authorization

Why

4 vfio-pci Configuration

/etc/modprobe.d/vfio.conf

/etc/modules-load.d/vfio.conf

5 The Boot-With-eGPU-Connected Problem

Why

Component 1 — udev Rule for Hot-Plug

Component 2 — Attach Script

Component 3 — Boot Authorization Service

Component 4 — Pre-Shutdown De-authorization Service

Enable Everything

6 The NIC Renaming Problem

Why

Fix: pin the NIC by MAC address

7 AMD Vega Reset Bug and vendor-reset

Why

Building vendor-reset on Kernel 6.17

Full build sequence

Secure Boot

Verify

8 VM Configuration

machine: q35

cpu: host

pcie=1

rombar=0 on audio

9 ROCm Deprecation and the Vulkan Solution

Guest setup (Ubuntu 24.04)

The ROCm problem

The Vulkan solution

Verify Ollama is using the GPU

Verification Commands

Host-side

Guest-side

Troubleshooting

eGPU not appearing in lspci

Boot hang with eGPU connected

Network down after reboot with eGPU

VM start fails — destructive transaction error

ROCm segfaults / total_vram=0

vendor-reset fails to build

amdgpu module not found in guest

Known Issues / Deprecated

Thunderbolt PCIe tunneling and boot ordering

Audio function reset warning

Secure Boot and DKMS

Changelog