★ Overview
This guide walks through deploying Ollama on an Ubuntu 24.04 VM, exposing it to the local network, adding Open WebUI as a chat frontend, standing up Chroma as a vector database, and building a RAG ingestion pipeline that indexes files from a Mac. No data leaves your network at any point.
1 Install Ollama
ssh ollama
curl -fsSL https://ollama.com/install.sh | sh
The installer handles the full setup:
- Installs binaries to
/usr/local/ - Creates an
ollamasystem user, adds torenderandvideogroups - Registers and starts
ollama.servicevia systemd - Prints a warning if no GPU is detected — CPU-only mode is fully functional
Verify the installation
ollama --version
sudo systemctl status ollama --no-pager
Pull a model
ollama pull llama3.2
Test inference
curl http://localhost:11434/api/generate \
-d '{"model":"llama3.2","prompt":"Hello","stream":false}'
2 Expose the API to the Network
By default Ollama binds to 127.0.0.1:11434 (loopback only). To reach it from other machines on the network, override via a systemd drop-in:
sudo mkdir -p /etc/systemd/system/ollama.service.d
cat <<EOF | sudo tee /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
EOF
sudo systemctl daemon-reload
sudo systemctl restart ollama
Verify
sudo ss -tlnp | grep 11434
# Expected: LISTEN 0 4096 *:11434 *:*
Test from another machine
curl http://{VM_IP}:11434/
# Returns: Ollama is running
3 Install Docker
curl -fsSL https://get.docker.com | sudo sh
sudo usermod -aG docker {username}
Log out and back in (or run newgrp docker) for group membership to take effect.
4 Deploy Chroma
Chroma stores vector embeddings for the RAG pipeline. Run it as a persistent Docker container:
docker run -d \
--name chroma \
--restart always \
-p 8000:8000 \
-v chroma-data:/chroma/chroma \
chromadb/chroma
Heartbeat check
curl http://{VM_IP}:8000/api/v2/heartbeat
5 Deploy Open WebUI
Open WebUI is a ChatGPT-style frontend that connects to Ollama for inference and Chroma for RAG.
docker run -d \
--name open-webui \
--restart always \
--add-host=host.docker.internal:host-gateway \
-p 3000:8080 \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
-e VECTOR_DB=chroma \
-e CHROMA_HTTP_HOST=host.docker.internal \
-e CHROMA_HTTP_PORT=8000 \
-e RAG_EMBEDDING_ENGINE=ollama \
-e RAG_EMBEDDING_MODEL=nomic-embed-text \
-e RAG_OLLAMA_BASE_URL=http://host.docker.internal:11434 \
-v open-webui:/app/backend/data \
ghcr.io/open-webui/open-webui:main
Access the interface at http://{VM_IP}:3000. On first visit, create a local admin account — no external accounts or cloud services required.
Verify Ollama connectivity from inside the container
docker exec open-webui curl -s http://host.docker.internal:11434/api/tags
6 Pull the Embedding Model
RAG requires a separate embedding model. nomic-embed-text is recommended: fast, accurate, and efficient on CPU.
ollama pull nomic-embed-text
What embedding models do
Embedding models convert text into high-dimensional vectors that represent semantic meaning. They have no concept of conversation — they only transform text into numbers. nomic-embed-text runs silently in the background during RAG queries. You never interact with it directly.
RAG query flow
→ nomic-embed-text → query vector
→ Chroma similarity search → relevant file chunks
→ llama3.2 (question + chunks as context) → answer
7 RAG Ingestion Pipeline
The ingestion script runs on the source machine (Mac), walks a directory, extracts and chunks text, embeds via Ollama, and upserts into Chroma.
Install dependencies
pip3 install --break-system-packages requests pypdf python-docx
How it works
- Walks
/Users/{username}recursively (or a specified subdirectory) - Skips non-content directories — package caches, build artifacts, media libraries, IDE caches
- Extracts text based on file type
- Chunks text into ~800 character segments with 150 character overlap
- Sends all chunks for a file to Ollama in a single batch embed request
- Upserts into Chroma with metadata: source path, chunk index, modified time, extension
- On each full run: removes orphaned chunks (files deleted from disk) before indexing
Supported file types
| Category | Extensions |
|---|---|
| Docs | .txt, .md, .rst, .org, .tex |
| Code | .py, .js, .ts, .swift, .go, .rs, .c, .cpp, .sh, .sql, and more |
| Config | .json, .yaml, .toml, .ini, .csv, .xml |
| Web | .html, .css, .scss, .svelte, .vue |
| Documents | .pdf (pypdf), .docx (python-docx) |
Skipped directories
- Version control:
.git - Package caches:
node_modules,go/pkg,.cargo,.gem,.gradle,.m2,.npm,.yarn - Build output:
dist,build,DerivedData,target,.next - Apple media/system:
Music/Music/Media.localized,Photos Library.photoslibrary,Library/Caches,Library/Developer,Library/Containers - IDE:
Xcode,iOS DeviceSupport - Misc:
__pycache__,venv,.venv,tmp,temp
Run the index
# Full index (background, recommended for large home directories)
nohup python3 ~/rag-ingest.py > ~/rag-ingest.log 2>&1 &
# Monitor progress
tail -f ~/rag-ingest.log
# Check chunk count
python3 ~/rag-ingest.py --stats
CLI flags
| Flag | Description |
|---|---|
--path /some/dir |
Index a specific directory instead of the default |
--stats |
Show total chunk count in Chroma |
--delete /path/to/file |
Remove a specific file's chunks from the index |
--cleanup |
Remove orphaned chunks only, no indexing |
Idempotency and cleanup
Chunk IDs are derived from a SHA-256 hash of the file path plus the chunk index. Re-running the script overwrites existing chunks — it never creates duplicates.
On every full run, the script first scans all source paths stored in Chroma and removes any whose files no longer exist on disk. This keeps the index accurate without manual intervention.
Removing sensitive files
Files can be removed from the index at any time without re-indexing everything:
python3 ~/rag-ingest.py --delete /Users/{username}/path/to/sensitive/file
8 Architecture Overview
All components communicate over the local network. The source machine handles text extraction and chunking; the VM runs all inference and stores vectors.
↯ Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| "does not support chat" in Open WebUI | nomic-embed-text selected as chat model |
Switch to a chat model (e.g. llama3.2) in the model dropdown |
| "Connect a model" on Open WebUI load | OLLAMA_BASE_URL=http://127.0.0.1 resolves to container |
Use http://host.docker.internal:11434 |
| Embed timeouts during ingest | CPU embedding is slow; Ollama busy | Script retries 3x with backoff; reduce file sizes or add --path to index in batches |
| Chroma 410 Gone errors | Chroma v2 uses different API paths | Update to /api/v2/tenants/default_tenant/databases/default_database/... |
| VM IP not in ARP table after boot | DHCP lease held by router, not Proxmox | Ping-sweep the subnet: for i in $(seq 1 254); do ping -c1 -W1 192.168.X.$i &>/dev/null & done; wait; arp -an |
| Ingest stuck on one directory for hours | Package cache or media library in path | Kill the process, add the directory to SKIP_DIRS in the script, restart |