Local LLM + RAG Pipeline on Ubuntu 24.04

★ Overview

This guide walks through deploying Ollama on an Ubuntu 24.04 VM, exposing it to the local network, adding Open WebUI as a chat frontend, standing up Chroma as a vector database, and building a RAG ingestion pipeline that indexes files from a Mac. No data leaves your network at any point.

Ubuntu 24.04.4 LTS

Host

Proxmox VE guest

vCPU / RAM

6 vCPU · 16 GB

Disk

100 GB

SSH alias

ollama

Inference

Phase 1: CPU-only

Install Ollama
Expose the API to the Network
Install Docker
Deploy Chroma
Deploy Open WebUI
Pull the Embedding Model
RAG Ingestion Pipeline
Architecture Overview
Troubleshooting
Phase 2 — GPU Passthrough

1 Install Ollama

ssh ollama
curl -fsSL https://ollama.com/install.sh | sh

The installer handles the full setup:

Installs binaries to /usr/local/
Creates an ollama system user, adds to render and video groups
Registers and starts ollama.service via systemd
Prints a warning if no GPU is detected — CPU-only mode is fully functional

Verify the installation

ollama --version
sudo systemctl status ollama --no-pager

Pull a model

ollama pull llama3.2

Test inference

curl http://localhost:11434/api/generate \
  -d '{"model":"llama3.2","prompt":"Hello","stream":false}'

2 Expose the API to the Network

By default Ollama binds to 127.0.0.1:11434 (loopback only). To reach it from other machines on the network, override via a systemd drop-in:

sudo mkdir -p /etc/systemd/system/ollama.service.d
cat <<EOF | sudo tee /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
EOF

sudo systemctl daemon-reload
sudo systemctl restart ollama

Verify

sudo ss -tlnp | grep 11434
# Expected: LISTEN 0  4096  *:11434  *:*

Test from another machine

curl http://{VM_IP}:11434/
# Returns: Ollama is running

3 Install Docker

curl -fsSL https://get.docker.com | sudo sh
sudo usermod -aG docker {username}

Log out and back in (or run newgrp docker) for group membership to take effect.

4 Deploy Chroma

Chroma stores vector embeddings for the RAG pipeline. Run it as a persistent Docker container:

docker run -d \
  --name chroma \
  --restart always \
  -p 8000:8000 \
  -v chroma-data:/chroma/chroma \
  chromadb/chroma

Heartbeat check

curl http://{VM_IP}:8000/api/v2/heartbeat

5 Deploy Open WebUI

Open WebUI is a ChatGPT-style frontend that connects to Ollama for inference and Chroma for RAG.

docker run -d \
  --name open-webui \
  --restart always \
  --add-host=host.docker.internal:host-gateway \
  -p 3000:8080 \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  -e VECTOR_DB=chroma \
  -e CHROMA_HTTP_HOST=host.docker.internal \
  -e CHROMA_HTTP_PORT=8000 \
  -e RAG_EMBEDDING_ENGINE=ollama \
  -e RAG_EMBEDDING_MODEL=nomic-embed-text \
  -e RAG_OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main

Access the interface at http://{VM_IP}:3000. On first visit, create a local admin account — no external accounts or cloud services required.

Verify Ollama connectivity from inside the container

docker exec open-webui curl -s http://host.docker.internal:11434/api/tags

6 Pull the Embedding Model

RAG requires a separate embedding model. nomic-embed-text is recommended: fast, accurate, and efficient on CPU.

ollama pull nomic-embed-text

What embedding models do

Embedding models convert text into high-dimensional vectors that represent semantic meaning. They have no concept of conversation — they only transform text into numbers. nomic-embed-text runs silently in the background during RAG queries. You never interact with it directly.

RAG query flow

Your question
  → nomic-embed-text → query vector
  → Chroma similarity search → relevant file chunks
  → llama3.2 (question + chunks as context) → answer

7 RAG Ingestion Pipeline

The ingestion script runs on the source machine (Mac), walks a directory, extracts and chunks text, embeds via Ollama, and upserts into Chroma.

Install dependencies

pip3 install --break-system-packages requests pypdf python-docx

How it works

Walks /Users/{username} recursively (or a specified subdirectory)
Skips non-content directories — package caches, build artifacts, media libraries, IDE caches
Extracts text based on file type
Chunks text into ~800 character segments with 150 character overlap
Sends all chunks for a file to Ollama in a single batch embed request
Upserts into Chroma with metadata: source path, chunk index, modified time, extension
On each full run: removes orphaned chunks (files deleted from disk) before indexing

Supported file types

Category	Extensions
Docs	`.txt`, `.md`, `.rst`, `.org`, `.tex`
Code	`.py`, `.js`, `.ts`, `.swift`, `.go`, `.rs`, `.c`, `.cpp`, `.sh`, `.sql`, and more
Config	`.json`, `.yaml`, `.toml`, `.ini`, `.csv`, `.xml`
Web	`.html`, `.css`, `.scss`, `.svelte`, `.vue`
Documents	`.pdf` (pypdf), `.docx` (python-docx)

Skipped directories

Version control: .git
Package caches: node_modules, go/pkg, .cargo, .gem, .gradle, .m2, .npm, .yarn
Build output: dist, build, DerivedData, target, .next
Apple media/system: Music/Music/Media.localized, Photos Library.photoslibrary, Library/Caches, Library/Developer, Library/Containers
IDE: Xcode, iOS DeviceSupport
Misc: __pycache__, venv, .venv, tmp, temp

Run the index

# Full index (background, recommended for large home directories)
nohup python3 ~/rag-ingest.py > ~/rag-ingest.log 2>&1 &

# Monitor progress
tail -f ~/rag-ingest.log

# Check chunk count
python3 ~/rag-ingest.py --stats

CLI flags

Flag	Description
`--path /some/dir`	Index a specific directory instead of the default
`--stats`	Show total chunk count in Chroma
`--delete /path/to/file`	Remove a specific file's chunks from the index
`--cleanup`	Remove orphaned chunks only, no indexing

Idempotency and cleanup

Chunk IDs are derived from a SHA-256 hash of the file path plus the chunk index. Re-running the script overwrites existing chunks — it never creates duplicates.

On every full run, the script first scans all source paths stored in Chroma and removes any whose files no longer exist on disk. This keeps the index accurate without manual intervention.

Removing sensitive files

Files can be removed from the index at any time without re-indexing everything:

python3 ~/rag-ingest.py --delete /Users/{username}/path/to/sensitive/file

8 Architecture Overview

All components communicate over the local network. The source machine handles text extraction and chunking; the VM runs all inference and stores vectors.

Data flow Source machine (/Users/{username}) └── rag-ingest.py ├── text extraction + chunking ├── → Ollama:{VM_IP}:11434 (nomic-embed-text → vectors) └── → Chroma:{VM_IP}:8000 (upsert chunks + vectors) Browser └── Open WebUI → http://{VM_IP}:3000 ├── → Ollama (host.docker.internal:11434) — chat inference ├── → Ollama (host.docker.internal:11434) — query embedding └── → Chroma (host.docker.internal:8000) — vector similarity search All traffic stays on the local network.

↯ Troubleshooting

Symptom	Cause	Fix
"does not support chat" in Open WebUI	`nomic-embed-text` selected as chat model	Switch to a chat model (e.g. `llama3.2`) in the model dropdown
"Connect a model" on Open WebUI load	`OLLAMA_BASE_URL=http://127.0.0.1` resolves to container	Use `http://host.docker.internal:11434`
Embed timeouts during ingest	CPU embedding is slow; Ollama busy	Script retries 3x with backoff; reduce file sizes or add `--path` to index in batches
Chroma 410 Gone errors	Chroma v2 uses different API paths	Update to `/api/v2/tenants/default_tenant/databases/default_database/...`
VM IP not in ARP table after boot	DHCP lease held by router, not Proxmox	Ping-sweep the subnet: `for i in $(seq 1 254); do ping -c1 -W1 192.168.X.$i &>/dev/null & done; wait; arp -an`
Ingest stuck on one directory for hours	Package cache or media library in path	Kill the process, add the directory to `SKIP_DIRS` in the script, restart

2 Phase 2 — GPU Passthrough

✓ Changelog

2026-03-18 Initial setup and RAG pipeline documented