📑 model reference

Ollama Model Catalog

Curated reference covering installed models and notable options from the Ollama library — sizes, RAM requirements, and what each model is actually good for.

Installed on This Stack

Model Size Type Notes
llama3.2:latest 2.0 GB chat Meta's compact baseline — fast, general-purpose
llama3.1:70b ~40 GB chat Full-scale 70B — needs 48GB+ RAM or GPU VRAM
codellama:7b 3.8 GB code Code completion and explanation, any language
mistral:7b 4.4 GB chat Fast and capable — strong all-rounder at 7B scale
deepseek-coder:6.7b 3.8 GB code Chinese open-source code model — fully local, no callbacks
nomic-embed-text 274 MB embed Text embeddings only — powers the RAG pipeline

RAM Requirements at a Glance

Model Size RAM Needed (Q4) Notes
3B~4 GBRuns on almost anything
7B~6–8 GBComfortable on 8GB+
13B~10–12 GB16GB recommended
34B~24–28 GB32GB+ needed
70B~38–42 GB48GB+ recommended
llama3.1:70b on CPU: Without enough RAM, the model memory-maps to disk and runs at 1–2 tok/s — technically functional, not practical for chat. With 48GB RAM and the Vega 56 via Vulkan, it loads fully and runs at usable speeds. Use smaller models (llama3.1, mistral:7b) for interactive sessions where speed matters.

Catalog — Chat & Instruction

Model Params Size Best For
llama3.2 3B2.0 GB Fast general chat, edge/embedded use
llama3.2:1b 1B1.3 GB Minimum viable — very fast, basic quality
llama3.1 8B4.7 GB Solid general-purpose, strong instruction following
llama3.1:70b 70B40 GB Highest quality open model at this class
mistral 7B4.1 GB Punches above its weight, Apache 2.0 licensed
mistral-nemo 12B7.1 GB Mistral + Nvidia collaboration, strong context handling
mixtral 8x7B MoE26 GB Mixture-of-experts — high quality, memory-hungry
gemma2 9B5.4 GB Google's open model — competitive at 9B
gemma2:2b 2B1.6 GB Smallest Gemma, good for lightweight tasks
phi4 14B8.9 GB Microsoft's small model with outsized reasoning ability
phi4-mini 3.8B2.5 GB Ultra-efficient, strong at math and structured tasks
qwen2.5 7B4.7 GB Alibaba — multilingual, strong coding
qwen2.5:72b 72B47 GB Top-tier open model, competitive with GPT-4 class
command-r 35B20 GB Cohere — RAG-optimized, strong retrieval tasks
command-r-plus 104B59 GB Cohere flagship — excellent at tool use and RAG

Catalog — Code Models

Model Params Size Best For
codellama 7B3.8 GB Meta's code model — multi-language, solid baseline
codellama:13b 13B7.4 GB Better quality code, still manageable size
codellama:34b 34B19 GB Best CodeLlama tier — approaches GPT-4 on code
deepseek-coder 6.7B3.8 GB Strong code model, fully local — see note below
deepseek-coder:33b 33B19 GB Competes with GPT-4 on HumanEval benchmarks
deepseek-coder-v2 16B8.9 GB V2 architecture — significant improvement over v1
codegemma 7B5.0 GB Google's code model, Apache 2.0
starcoder2 15B9.1 GB BigCode open model, trained on 600+ languages
qwen2.5-coder 7B4.7 GB Strong recent code model from Alibaba
qwen2.5-coder:32b 32B19 GB One of the best open-source code models available

Catalog — Reasoning & Math

Model Params Size Best For
deepseek-r1 7B4.7 GB DeepSeek reasoning model — chain-of-thought, math
deepseek-r1:14b 14B9.0 GB Better reasoning, runs on 16GB
deepseek-r1:70b 70B43 GB Strong reasoning — needs 48GB+
phi4 14B8.9 GB Microsoft reasoning-focused — punches well above its size
qwq 32B20 GB Alibaba QwQ — strong at math, long reasoning chains

Catalog — Embedding Models

Embedding models convert text to vectors. They cannot generate responses. They are the backbone of any RAG pipeline — use them to index documents and search by semantic similarity.

Model Size Notes
nomic-embed-text 274 MB Fast local embeddings — powers this RAG stack
nomic-embed-text:v1.5 274 MB Updated version, same size, better retrieval performance
mxbai-embed-large 669 MB Higher quality embeddings, slower
snowflake-arctic-embed 137 MB Strong retrieval performance, compact
all-minilm 46 MB Smallest viable option — lower quality at extremes

Catalog — Vision & Multimodal

Model Params Size Best For
llava 7B4.5 GB Image + text — describe images, answer visual questions
llava:13b 13B8.0 GB Better visual quality, same interface
llava-phi3 3.8B2.9 GB Compact multimodal, good speed
moondream 1.8B1.7 GB Lightweight image model — edge/embedded use
bakllava 7B4.5 GB LLaVA variant with Mistral backbone

Model Notes

DeepSeek-Coder — Privacy and Trust

Short version: safe to use locally. DeepSeek is a Chinese AI lab. The concern some raise relates to their hosted API product — not the open-weight models. The model weights distributed via Ollama are static files. There is no telemetry, no runtime network connection, and no callback to DeepSeek's infrastructure. The model runs entirely offline once pulled. MIT/Apache 2.0 licensed.

nomic-embed-text — Embedding vs. Chat

This model is embedding-only. It takes text and returns a vector — it cannot generate responses. That's what powers the RAG pipeline: convert a query to a vector, find the closest matching document chunks in Chroma, inject those chunks into the prompt context for the chat model.

If you select it as a chat model in Open WebUI, you'll get a "does not support chat" error. This is expected. It's not broken — it just serves a different purpose.

llama3.1:70b — Practical Expectations

On this stack with the Vega 56 GPU (8GB VRAM) via Vulkan, Ollama offloads as many layers as VRAM allows and runs the remainder on CPU. With 48GB system RAM, the full model fits in memory. Expect 20–40 tok/s depending on context length. Without GPU, CPU-only drops to 1–5 tok/s — technically functional, not suitable for interactive chat.

For fast interactive sessions: use llama3.1 (8B) or mistral:7b. Reserve 70B for tasks where output quality matters more than speed — long-form writing, complex analysis, summarization.


CLI Reference

# List installed models ollama list # Pull a model ollama pull mistral:7b # Run a model interactively ollama run mistral:7b # Remove a model ollama rm codellama:7b # Show model metadata ollama show llama3.2 # Check which models are currently loaded in memory curl http://localhost:11434/api/ps | python3 -m json.tool # Monitor a pull in progress tail -f /tmp/llama70b-pull.log