Ollama Model Reference — Local LLM Catalog

Installed on This Stack

Model	Size	Type	Notes
llama3.2:latest	2.0 GB	chat	Meta's compact baseline — fast, general-purpose
llama3.1:70b	~40 GB	chat	Full-scale 70B — needs 48GB+ RAM or GPU VRAM
codellama:7b	3.8 GB	code	Code completion and explanation, any language
mistral:7b	4.4 GB	chat	Fast and capable — strong all-rounder at 7B scale
deepseek-coder:6.7b	3.8 GB	code	Chinese open-source code model — fully local, no callbacks
nomic-embed-text	274 MB	embed	Text embeddings only — powers the RAG pipeline

RAM Requirements at a Glance

Model Size	RAM Needed (Q4)	Notes
3B	~4 GB	Runs on almost anything
7B	~6–8 GB	Comfortable on 8GB+
13B	~10–12 GB	16GB recommended
34B	~24–28 GB	32GB+ needed
70B	~38–42 GB	48GB+ recommended

llama3.1:70b on CPU: Without enough RAM, the model memory-maps to disk and runs at 1–2 tok/s — technically functional, not practical for chat. With 48GB RAM and the Vega 56 via Vulkan, it loads fully and runs at usable speeds. Use smaller models (llama3.1, mistral:7b) for interactive sessions where speed matters.

Catalog — Chat & Instruction

Model	Params	Size	Best For
llama3.2	3B	2.0 GB	Fast general chat, edge/embedded use
llama3.2:1b	1B	1.3 GB	Minimum viable — very fast, basic quality
llama3.1	8B	4.7 GB	Solid general-purpose, strong instruction following
llama3.1:70b	70B	40 GB	Highest quality open model at this class
mistral	7B	4.1 GB	Punches above its weight, Apache 2.0 licensed
mistral-nemo	12B	7.1 GB	Mistral + Nvidia collaboration, strong context handling
mixtral	8x7B MoE	26 GB	Mixture-of-experts — high quality, memory-hungry
gemma2	9B	5.4 GB	Google's open model — competitive at 9B
gemma2:2b	2B	1.6 GB	Smallest Gemma, good for lightweight tasks
phi4	14B	8.9 GB	Microsoft's small model with outsized reasoning ability
phi4-mini	3.8B	2.5 GB	Ultra-efficient, strong at math and structured tasks
qwen2.5	7B	4.7 GB	Alibaba — multilingual, strong coding
qwen2.5:72b	72B	47 GB	Top-tier open model, competitive with GPT-4 class
command-r	35B	20 GB	Cohere — RAG-optimized, strong retrieval tasks
command-r-plus	104B	59 GB	Cohere flagship — excellent at tool use and RAG

Catalog — Code Models

Model	Params	Size	Best For
codellama	7B	3.8 GB	Meta's code model — multi-language, solid baseline
codellama:13b	13B	7.4 GB	Better quality code, still manageable size
codellama:34b	34B	19 GB	Best CodeLlama tier — approaches GPT-4 on code
deepseek-coder	6.7B	3.8 GB	Strong code model, fully local — see note below
deepseek-coder:33b	33B	19 GB	Competes with GPT-4 on HumanEval benchmarks
deepseek-coder-v2	16B	8.9 GB	V2 architecture — significant improvement over v1
codegemma	7B	5.0 GB	Google's code model, Apache 2.0
starcoder2	15B	9.1 GB	BigCode open model, trained on 600+ languages
qwen2.5-coder	7B	4.7 GB	Strong recent code model from Alibaba
qwen2.5-coder:32b	32B	19 GB	One of the best open-source code models available

Catalog — Reasoning & Math

Model	Params	Size	Best For
deepseek-r1	7B	4.7 GB	DeepSeek reasoning model — chain-of-thought, math
deepseek-r1:14b	14B	9.0 GB	Better reasoning, runs on 16GB
deepseek-r1:70b	70B	43 GB	Strong reasoning — needs 48GB+
phi4	14B	8.9 GB	Microsoft reasoning-focused — punches well above its size
qwq	32B	20 GB	Alibaba QwQ — strong at math, long reasoning chains

Catalog — Embedding Models

Embedding models convert text to vectors. They cannot generate responses. They are the backbone of any RAG pipeline — use them to index documents and search by semantic similarity.

Model	Size	Notes
nomic-embed-text	274 MB	Fast local embeddings — powers this RAG stack
nomic-embed-text:v1.5	274 MB	Updated version, same size, better retrieval performance
mxbai-embed-large	669 MB	Higher quality embeddings, slower
snowflake-arctic-embed	137 MB	Strong retrieval performance, compact
all-minilm	46 MB	Smallest viable option — lower quality at extremes

Catalog — Vision & Multimodal

Model	Params	Size	Best For
llava	7B	4.5 GB	Image + text — describe images, answer visual questions
llava:13b	13B	8.0 GB	Better visual quality, same interface
llava-phi3	3.8B	2.9 GB	Compact multimodal, good speed
moondream	1.8B	1.7 GB	Lightweight image model — edge/embedded use
bakllava	7B	4.5 GB	LLaVA variant with Mistral backbone

Model Notes

DeepSeek-Coder — Privacy and Trust

Short version: safe to use locally. DeepSeek is a Chinese AI lab. The concern some raise relates to their hosted API product — not the open-weight models. The model weights distributed via Ollama are static files. There is no telemetry, no runtime network connection, and no callback to DeepSeek's infrastructure. The model runs entirely offline once pulled. MIT/Apache 2.0 licensed.

nomic-embed-text — Embedding vs. Chat

This model is embedding-only. It takes text and returns a vector — it cannot generate responses. That's what powers the RAG pipeline: convert a query to a vector, find the closest matching document chunks in Chroma, inject those chunks into the prompt context for the chat model.

If you select it as a chat model in Open WebUI, you'll get a "does not support chat" error. This is expected. It's not broken — it just serves a different purpose.

llama3.1:70b — Practical Expectations

On this stack with the Vega 56 GPU (8GB VRAM) via Vulkan, Ollama offloads as many layers as VRAM allows and runs the remainder on CPU. With 48GB system RAM, the full model fits in memory. Expect 20–40 tok/s depending on context length. Without GPU, CPU-only drops to 1–5 tok/s — technically functional, not suitable for interactive chat.

For fast interactive sessions: use llama3.1 (8B) or mistral:7b. Reserve 70B for tasks where output quality matters more than speed — long-form writing, complex analysis, summarization.

CLI Reference

# List installed models
ollama list

# Pull a model
ollama pull mistral:7b

# Run a model interactively
ollama run mistral:7b

# Remove a model
ollama rm codellama:7b

# Show model metadata
ollama show llama3.2

# Check which models are currently loaded in memory
curl http://localhost:11434/api/ps | python3 -m json.tool

# Monitor a pull in progress
tail -f /tmp/llama70b-pull.log

Ollama Model Catalog