Installed on This Stack
| Model |
Size |
Type |
Notes |
| llama3.2:latest |
2.0 GB |
chat |
Meta's compact baseline — fast, general-purpose |
| llama3.1:70b |
~40 GB |
chat |
Full-scale 70B — needs 48GB+ RAM or GPU VRAM |
| codellama:7b |
3.8 GB |
code |
Code completion and explanation, any language |
| mistral:7b |
4.4 GB |
chat |
Fast and capable — strong all-rounder at 7B scale |
| deepseek-coder:6.7b |
3.8 GB |
code |
Chinese open-source code model — fully local, no callbacks |
| nomic-embed-text |
274 MB |
embed |
Text embeddings only — powers the RAG pipeline |
RAM Requirements at a Glance
| Model Size |
RAM Needed (Q4) |
Notes |
| 3B | ~4 GB | Runs on almost anything |
| 7B | ~6–8 GB | Comfortable on 8GB+ |
| 13B | ~10–12 GB | 16GB recommended |
| 34B | ~24–28 GB | 32GB+ needed |
| 70B | ~38–42 GB | 48GB+ recommended |
llama3.1:70b on CPU: Without enough RAM, the model memory-maps to disk and runs at 1–2 tok/s — technically functional, not practical for chat. With 48GB RAM and the Vega 56 via Vulkan, it loads fully and runs at usable speeds. Use smaller models (llama3.1, mistral:7b) for interactive sessions where speed matters.
Catalog — Chat & Instruction
| Model |
Params |
Size |
Best For |
| llama3.2 |
3B | 2.0 GB |
Fast general chat, edge/embedded use |
| llama3.2:1b |
1B | 1.3 GB |
Minimum viable — very fast, basic quality |
| llama3.1 |
8B | 4.7 GB |
Solid general-purpose, strong instruction following |
| llama3.1:70b |
70B | 40 GB |
Highest quality open model at this class |
| mistral |
7B | 4.1 GB |
Punches above its weight, Apache 2.0 licensed |
| mistral-nemo |
12B | 7.1 GB |
Mistral + Nvidia collaboration, strong context handling |
| mixtral |
8x7B MoE | 26 GB |
Mixture-of-experts — high quality, memory-hungry |
| gemma2 |
9B | 5.4 GB |
Google's open model — competitive at 9B |
| gemma2:2b |
2B | 1.6 GB |
Smallest Gemma, good for lightweight tasks |
| phi4 |
14B | 8.9 GB |
Microsoft's small model with outsized reasoning ability |
| phi4-mini |
3.8B | 2.5 GB |
Ultra-efficient, strong at math and structured tasks |
| qwen2.5 |
7B | 4.7 GB |
Alibaba — multilingual, strong coding |
| qwen2.5:72b |
72B | 47 GB |
Top-tier open model, competitive with GPT-4 class |
| command-r |
35B | 20 GB |
Cohere — RAG-optimized, strong retrieval tasks |
| command-r-plus |
104B | 59 GB |
Cohere flagship — excellent at tool use and RAG |
Catalog — Code Models
| Model |
Params |
Size |
Best For |
| codellama |
7B | 3.8 GB |
Meta's code model — multi-language, solid baseline |
| codellama:13b |
13B | 7.4 GB |
Better quality code, still manageable size |
| codellama:34b |
34B | 19 GB |
Best CodeLlama tier — approaches GPT-4 on code |
| deepseek-coder |
6.7B | 3.8 GB |
Strong code model, fully local — see note below |
| deepseek-coder:33b |
33B | 19 GB |
Competes with GPT-4 on HumanEval benchmarks |
| deepseek-coder-v2 |
16B | 8.9 GB |
V2 architecture — significant improvement over v1 |
| codegemma |
7B | 5.0 GB |
Google's code model, Apache 2.0 |
| starcoder2 |
15B | 9.1 GB |
BigCode open model, trained on 600+ languages |
| qwen2.5-coder |
7B | 4.7 GB |
Strong recent code model from Alibaba |
| qwen2.5-coder:32b |
32B | 19 GB |
One of the best open-source code models available |
Catalog — Reasoning & Math
| Model |
Params |
Size |
Best For |
| deepseek-r1 |
7B | 4.7 GB |
DeepSeek reasoning model — chain-of-thought, math |
| deepseek-r1:14b |
14B | 9.0 GB |
Better reasoning, runs on 16GB |
| deepseek-r1:70b |
70B | 43 GB |
Strong reasoning — needs 48GB+ |
| phi4 |
14B | 8.9 GB |
Microsoft reasoning-focused — punches well above its size |
| qwq |
32B | 20 GB |
Alibaba QwQ — strong at math, long reasoning chains |
Catalog — Embedding Models
Embedding models convert text to vectors. They cannot generate responses. They are the backbone of any RAG pipeline — use them to index documents and search by semantic similarity.
| Model |
Size |
Notes |
| nomic-embed-text |
274 MB |
Fast local embeddings — powers this RAG stack |
| nomic-embed-text:v1.5 |
274 MB |
Updated version, same size, better retrieval performance |
| mxbai-embed-large |
669 MB |
Higher quality embeddings, slower |
| snowflake-arctic-embed |
137 MB |
Strong retrieval performance, compact |
| all-minilm |
46 MB |
Smallest viable option — lower quality at extremes |
Catalog — Vision & Multimodal
| Model |
Params |
Size |
Best For |
| llava |
7B | 4.5 GB |
Image + text — describe images, answer visual questions |
| llava:13b |
13B | 8.0 GB |
Better visual quality, same interface |
| llava-phi3 |
3.8B | 2.9 GB |
Compact multimodal, good speed |
| moondream |
1.8B | 1.7 GB |
Lightweight image model — edge/embedded use |
| bakllava |
7B | 4.5 GB |
LLaVA variant with Mistral backbone |
Model Notes
DeepSeek-Coder — Privacy and Trust
Short version: safe to use locally. DeepSeek is a Chinese AI lab. The concern some raise relates to their hosted API product — not the open-weight models. The model weights distributed via Ollama are static files. There is no telemetry, no runtime network connection, and no callback to DeepSeek's infrastructure. The model runs entirely offline once pulled. MIT/Apache 2.0 licensed.
nomic-embed-text — Embedding vs. Chat
This model is embedding-only. It takes text and returns a vector — it cannot generate responses. That's what powers the RAG pipeline: convert a query to a vector, find the closest matching document chunks in Chroma, inject those chunks into the prompt context for the chat model.
If you select it as a chat model in Open WebUI, you'll get a "does not support chat" error. This is expected. It's not broken — it just serves a different purpose.
llama3.1:70b — Practical Expectations
On this stack with the Vega 56 GPU (8GB VRAM) via Vulkan, Ollama offloads as many layers as VRAM allows and runs the remainder on CPU. With 48GB system RAM, the full model fits in memory. Expect 20–40 tok/s depending on context length. Without GPU, CPU-only drops to 1–5 tok/s — technically functional, not suitable for interactive chat.
For fast interactive sessions: use llama3.1 (8B) or mistral:7b. Reserve 70B for tasks where output quality matters more than speed — long-form writing, complex analysis, summarization.
CLI Reference
ollama list
ollama pull mistral:7b
ollama run mistral:7b
ollama rm codellama:7b
ollama show llama3.2
curl http://localhost:11434/api/ps | python3 -m json.tool
tail -f /tmp/llama70b-pull.log