Running AI models on your own infrastructure instead of calling cloud APIs gives you three things that no hosted service can: complete data privacy, predictable costs, and the freedom to choose any model. The trade-off is that you need the right hardware and a basic understanding of how large language models use memory.
This guide covers the practical side of self-hosting: what hardware you actually need, which models are worth running locally in 2026, and how the costs compare to cloud APIs. If you want a hands-on tutorial with Docker Compose and DeployHQ, see our step-by-step guide to self-hosting Open WebUI and Ollama on a VPS.
Why self-host AI models?
Data stays on your servers
When you call the OpenAI or Anthropic API, every prompt and response passes through their servers. For most use cases that is fine — but if you work with customer PII, medical records, legal documents, or proprietary code, sending that data to a third party may violate compliance requirements or internal security policies.
Self-hosted models process everything locally. The data never leaves your network.
Predictable, fixed costs
Cloud API pricing scales with usage. A team of 20 developers using GPT-4o for code review can easily spend $500–2,000/month in API fees. A self-hosted 8B model on a $50/month VPS handles unlimited requests at a fixed cost — the model does not meter tokens.
Full control over models and behaviour
You choose which model to run, how it is configured, and what system prompts it uses. You can fine-tune models on your own data, swap models without changing application code, and run multiple models side-by-side for different tasks.
Understanding the hardware: why VRAM matters most
The single most important factor in self-hosting AI is GPU VRAM (video memory). A language model must be loaded into memory before it can generate text. If it fits entirely in VRAM, you get fast inference (30–50 tokens/second). If it overflows to system RAM, inference drops to 1–5 tokens/second — unusable for interactive chat.
The rule of thumb: you need roughly 0.5 GB of VRAM per billion parameters when using 4-bit quantisation (the standard for self-hosting).
Hardware tiers
| Tier | RAM | GPU | Models you can run | Monthly VPS cost |
|---|---|---|---|---|
| Starter | 8 GB | None (CPU only) | 1B–3B models (Llama 3.2 1B, Phi-3 Mini) | $5–15 |
| Developer | 16 GB | None or 8 GB VRAM | 7B–8B models (Llama 3.1 8B, Mistral 7B) | $25–50 |
| Professional | 32 GB | 16–24 GB VRAM | 13B–30B models (Qwen 2.5 14B, CodeLlama 34B) | $80–200 |
| Enterprise | 64 GB+ | 48 GB+ VRAM (or multi-GPU) | 70B+ models (Llama 3.1 70B, DeepSeek V3) | $300+ |
Not sure if your hardware is enough? Use Can I Run AI? to check whether a specific model will fit on your machine before downloading anything. It estimates VRAM and RAM requirements based on your hardware specs and the model's size.
CPU-only is viable for small models. A 7B model quantised to 4-bit runs on 8 GB of system RAM at ~5 tokens/second. That is slow for chat but acceptable for batch processing, summarisation, or code review where latency is less critical.
Quantisation: the key to fitting models in less memory
Full-precision models use 16 bits per parameter (FP16). A 70B model at FP16 needs ~140 GB — far more than any single consumer GPU. Quantisation reduces precision to 4 or 5 bits with minimal quality loss:
| Quantisation | Memory per 1B params | 7B model | 70B model | Quality impact |
|---|---|---|---|---|
| FP16 (full) | ~2 GB | ~14 GB | ~140 GB | Baseline |
| Q8 (8-bit) | ~1 GB | ~7 GB | ~70 GB | Negligible |
| Q5_K_M (5-bit) | ~0.65 GB | ~4.5 GB | ~45 GB | Very minor |
| Q4_K_M (4-bit) | ~0.5 GB | ~3.5 GB | ~35 GB | Minor on most benchmarks |
Q4_K_M is the sweet spot for self-hosting: it fits models in roughly a quarter of the full-precision memory while retaining 95%+ of benchmark performance. Ollama uses this quantisation level by default.
Best models for self-hosting in 2026
The open-source model landscape moves fast. Here are the current leaders by use case:
General purpose
| Model | Parameters | Min VRAM | Strengths |
|---|---|---|---|
| Llama 3.2 3B | 3B | 4 GB RAM (CPU) | Fast, lightweight, good for simple tasks |
| Llama 3.1 8B | 8B | 8 GB | Best quality/speed ratio for most use cases |
| Qwen 2.5 14B | 14B | 12 GB | Strong reasoning, excellent multilingual support |
| Llama 3.1 70B | 70B | 40 GB | Near-GPT-4 quality, requires serious hardware |
Code generation
| Model | Parameters | Min VRAM | Strengths |
|---|---|---|---|
| DeepSeek Coder V2 | 16B | 12 GB | Top coding benchmarks, excellent at refactoring |
| Qwen 2.5 Coder 7B | 7B | 8 GB | Strong code completion, fits on consumer hardware |
| CodeLlama 34B | 34B | 24 GB | Large context window, good at complex codebases |
Reasoning and analysis
| Model | Parameters | Min VRAM | Strengths |
|---|---|---|---|
| DeepSeek R1 | 70B | 40 GB | Chain-of-thought reasoning, MIT licensed |
| Qwen 3.5 | 32B | 24 GB | Highest GPQA scores among open models |
| GLM-5 | 40B active | 24 GB | Strong across all benchmarks, MIT licensed |
Frontier models (enterprise hardware only)
Models like DeepSeek V3.2 (671B MoE, 37B active), Kimi K2.5 (1T MoE, 32B active), and GLM-5 (744B total) compete with GPT-4o and Claude on benchmarks. They require multi-GPU setups (8x H200 or similar) and are realistic only for organisations with dedicated ML infrastructure.
Runtime tools for self-hosting
You do not interact with model weights directly. A runtime tool loads the model, handles quantisation, and serves an API. Here are the main options:
| Tool | Best for | GPU support | API compatibility |
|---|---|---|---|
| Ollama | Simplicity, single-server | NVIDIA, AMD, Apple Silicon | OpenAI-compatible |
| vLLM | High-throughput production | NVIDIA, AMD | OpenAI-compatible |
| llama.cpp | Maximum hardware flexibility | NVIDIA, AMD, Apple, CPU | Custom + OpenAI-compatible |
| LocalAI | Drop-in OpenAI replacement | NVIDIA, AMD, CPU | OpenAI-compatible |
| TGI | HuggingFace ecosystem | NVIDIA | Custom |
Ollama is the easiest starting point. It handles model downloading, quantisation, and serving in a single binary. Combined with Open WebUI, it provides a full ChatGPT-like interface.
For a complete walkthrough of setting up Ollama + Open WebUI with Docker Compose, Nginx, and TLS, see our guide: How to Self-Host Your Own AI Chat Interface on a VPS.
Cost comparison: self-hosted vs. cloud APIs
Here is a realistic cost comparison for a team of 10 developers using AI for code review and chat, processing roughly 2 million tokens per day:
| Self-hosted (Llama 3.1 8B) | OpenAI GPT-4o | Anthropic Claude Sonnet | |
|---|---|---|---|
| Monthly compute | $50 (VPS with 16 GB RAM) | ~$600 (at $2.50/1M input + $10/1M output) | ~$540 (at $3/1M input + $15/1M output) |
| Quality | Good for most tasks | Excellent | Excellent |
| Privacy | Full — data stays local | Data processed by OpenAI | Data processed by Anthropic |
| Latency | ~10–20 tokens/sec (CPU) | ~50–80 tokens/sec | ~50–80 tokens/sec |
| Scaling cost | Fixed | Linear with usage | Linear with usage |
The break-even point is roughly 500K tokens/day. Below that, cloud APIs are simpler and cheaper. Above that, self-hosting saves money every month — and the savings grow as usage increases.
For teams that need both privacy and quality, a hybrid approach works well: run a local model for routine tasks (code review, summarisation, drafting) and call cloud APIs only for complex reasoning tasks. Open WebUI supports this natively — you can configure both local Ollama models and cloud API keys in the same interface.
Deploying and managing self-hosted AI with DeployHQ
Once your AI stack is running, you need a way to manage configuration changes, model updates, and Nginx rules without SSH-ing into the server every time.
DeployHQ automates this by deploying from a Git repository to your VPS via SSH. Push a change to your repo (updated docker-compose.yml, new Nginx config, model pull script) and DeployHQ handles the rest.
Key DeployHQ features for AI deployments:
- SSH commands run after each deploy — restart Docker containers, pull new models
- Config files inject
.envsecrets without committing them to Git - Build pipelines run build steps before deploying
- Automatic deploys on every push to your main branch
For the full setup walkthrough with Docker Compose files and deploy scripts, see our Open WebUI + Ollama VPS guide.
Security best practices
Self-hosting gives you control, but also responsibility:
- Network isolation: bind model APIs to
127.0.0.1— never expose Ollama or vLLM directly to the internet - Reverse proxy with TLS: use Nginx or Caddy to terminate HTTPS in front of your model API
- Access control: Open WebUI supports user accounts with role-based access; disable public signup
- Update regularly: model runtimes (Ollama, vLLM) receive frequent security patches
- Monitor resource usage: a runaway inference request can exhaust RAM; set memory limits in Docker
- Protect API keys: if bridging to cloud APIs, use environment variables, never hardcode keys
Related guides
- How to Self-Host Your Own AI Chat Interface on a VPS with Open WebUI and Ollama — hands-on Docker Compose tutorial with DeployHQ
- How to Install DeepSeek on Your Cloud Server with Ollama LLM — DeepSeek-specific deployment
- Running Generative AI Models with Ollama and Open WebUI Using DeployHQ — alternative deployment approach
- What Is Docker? A Beginner's Guide to Containerisation and Deployment — Docker fundamentals
If you have questions or need help, reach out at support@deployhq.com or on Twitter/X.