Self-Hosting AI Models: Hardware Requirements, Model Selection, and Deployment Guide

AI, Devops & Infrastructure, Tutorials, and VPS

Self-Hosting AI Models: Hardware Requirements, Model Selection, and Deployment Guide

Running AI models on your own infrastructure instead of calling cloud APIs gives you three things that no hosted service can: complete data privacy, predictable costs, and the freedom to choose any model. The trade-off is that you need the right hardware and a basic understanding of how large language models use memory.

This guide covers the practical side of self-hosting: what hardware you actually need, which models are worth running locally in 2026, and how the costs compare to cloud APIs. If you want a hands-on tutorial with Docker Compose and DeployHQ, see our step-by-step guide to self-hosting Open WebUI and Ollama on a VPS.

Why self-host AI models?

Data stays on your servers

When you call the OpenAI or Anthropic API, every prompt and response passes through their servers. For most use cases that is fine — but if you work with customer PII, medical records, legal documents, or proprietary code, sending that data to a third party may violate compliance requirements or internal security policies.

Self-hosted models process everything locally. The data never leaves your network.

Predictable, fixed costs

Cloud API pricing scales with usage. A team of 20 developers using GPT-4o for code review can easily spend $500–2,000/month in API fees. A self-hosted 8B model on a $50/month VPS handles unlimited requests at a fixed cost — the model does not meter tokens.

Full control over models and behaviour

You choose which model to run, how it is configured, and what system prompts it uses. You can fine-tune models on your own data, swap models without changing application code, and run multiple models side-by-side for different tasks.

Understanding the hardware: why VRAM matters most

The single most important factor in self-hosting AI is GPU VRAM (video memory). A language model must be loaded into memory before it can generate text. If it fits entirely in VRAM, you get fast inference (30–50 tokens/second). If it overflows to system RAM, inference drops to 1–5 tokens/second — unusable for interactive chat.

The rule of thumb: you need roughly 0.5 GB of VRAM per billion parameters when using 4-bit quantisation (the standard for self-hosting).

Hardware tiers

Tier RAM GPU Models you can run Monthly VPS cost
Starter 8 GB None (CPU only) 1B–3B models (Llama 3.2 1B, Phi-3 Mini) $5–15
Developer 16 GB None or 8 GB VRAM 7B–8B models (Llama 3.1 8B, Mistral 7B) $25–50
Professional 32 GB 16–24 GB VRAM 13B–30B models (Qwen 2.5 14B, CodeLlama 34B) $80–200
Enterprise 64 GB+ 48 GB+ VRAM (or multi-GPU) 70B+ models (Llama 3.1 70B, DeepSeek V3) $300+

Not sure if your hardware is enough? Use Can I Run AI? to check whether a specific model will fit on your machine before downloading anything. It estimates VRAM and RAM requirements based on your hardware specs and the model's size.

CPU-only is viable for small models. A 7B model quantised to 4-bit runs on 8 GB of system RAM at ~5 tokens/second. That is slow for chat but acceptable for batch processing, summarisation, or code review where latency is less critical.

Quantisation: the key to fitting models in less memory

Full-precision models use 16 bits per parameter (FP16). A 70B model at FP16 needs ~140 GB — far more than any single consumer GPU. Quantisation reduces precision to 4 or 5 bits with minimal quality loss:

Quantisation Memory per 1B params 7B model 70B model Quality impact
FP16 (full) ~2 GB ~14 GB ~140 GB Baseline
Q8 (8-bit) ~1 GB ~7 GB ~70 GB Negligible
Q5_K_M (5-bit) ~0.65 GB ~4.5 GB ~45 GB Very minor
Q4_K_M (4-bit) ~0.5 GB ~3.5 GB ~35 GB Minor on most benchmarks

Q4_K_M is the sweet spot for self-hosting: it fits models in roughly a quarter of the full-precision memory while retaining 95%+ of benchmark performance. Ollama uses this quantisation level by default.

Best models for self-hosting in 2026

The open-source model landscape moves fast. Here are the current leaders by use case:

General purpose

Model Parameters Min VRAM Strengths
Llama 3.2 3B 3B 4 GB RAM (CPU) Fast, lightweight, good for simple tasks
Llama 3.1 8B 8B 8 GB Best quality/speed ratio for most use cases
Qwen 2.5 14B 14B 12 GB Strong reasoning, excellent multilingual support
Llama 3.1 70B 70B 40 GB Near-GPT-4 quality, requires serious hardware

Code generation

Model Parameters Min VRAM Strengths
DeepSeek Coder V2 16B 12 GB Top coding benchmarks, excellent at refactoring
Qwen 2.5 Coder 7B 7B 8 GB Strong code completion, fits on consumer hardware
CodeLlama 34B 34B 24 GB Large context window, good at complex codebases

Reasoning and analysis

Model Parameters Min VRAM Strengths
DeepSeek R1 70B 40 GB Chain-of-thought reasoning, MIT licensed
Qwen 3.5 32B 24 GB Highest GPQA scores among open models
GLM-5 40B active 24 GB Strong across all benchmarks, MIT licensed

Frontier models (enterprise hardware only)

Models like DeepSeek V3.2 (671B MoE, 37B active), Kimi K2.5 (1T MoE, 32B active), and GLM-5 (744B total) compete with GPT-4o and Claude on benchmarks. They require multi-GPU setups (8x H200 or similar) and are realistic only for organisations with dedicated ML infrastructure.

Runtime tools for self-hosting

You do not interact with model weights directly. A runtime tool loads the model, handles quantisation, and serves an API. Here are the main options:

Tool Best for GPU support API compatibility
Ollama Simplicity, single-server NVIDIA, AMD, Apple Silicon OpenAI-compatible
vLLM High-throughput production NVIDIA, AMD OpenAI-compatible
llama.cpp Maximum hardware flexibility NVIDIA, AMD, Apple, CPU Custom + OpenAI-compatible
LocalAI Drop-in OpenAI replacement NVIDIA, AMD, CPU OpenAI-compatible
TGI HuggingFace ecosystem NVIDIA Custom

Ollama is the easiest starting point. It handles model downloading, quantisation, and serving in a single binary. Combined with Open WebUI, it provides a full ChatGPT-like interface.

For a complete walkthrough of setting up Ollama + Open WebUI with Docker Compose, Nginx, and TLS, see our guide: How to Self-Host Your Own AI Chat Interface on a VPS.

Cost comparison: self-hosted vs. cloud APIs

Here is a realistic cost comparison for a team of 10 developers using AI for code review and chat, processing roughly 2 million tokens per day:

Self-hosted (Llama 3.1 8B) OpenAI GPT-4o Anthropic Claude Sonnet
Monthly compute $50 (VPS with 16 GB RAM) ~$600 (at $2.50/1M input + $10/1M output) ~$540 (at $3/1M input + $15/1M output)
Quality Good for most tasks Excellent Excellent
Privacy Full — data stays local Data processed by OpenAI Data processed by Anthropic
Latency ~10–20 tokens/sec (CPU) ~50–80 tokens/sec ~50–80 tokens/sec
Scaling cost Fixed Linear with usage Linear with usage

The break-even point is roughly 500K tokens/day. Below that, cloud APIs are simpler and cheaper. Above that, self-hosting saves money every month — and the savings grow as usage increases.

For teams that need both privacy and quality, a hybrid approach works well: run a local model for routine tasks (code review, summarisation, drafting) and call cloud APIs only for complex reasoning tasks. Open WebUI supports this natively — you can configure both local Ollama models and cloud API keys in the same interface.

Deploying and managing self-hosted AI with DeployHQ

Once your AI stack is running, you need a way to manage configuration changes, model updates, and Nginx rules without SSH-ing into the server every time.

DeployHQ automates this by deploying from a Git repository to your VPS via SSH. Push a change to your repo (updated docker-compose.yml, new Nginx config, model pull script) and DeployHQ handles the rest.

Key DeployHQ features for AI deployments:

  • SSH commands run after each deploy — restart Docker containers, pull new models
  • Config files inject .env secrets without committing them to Git
  • Build pipelines run build steps before deploying
  • Automatic deploys on every push to your main branch

For the full setup walkthrough with Docker Compose files and deploy scripts, see our Open WebUI + Ollama VPS guide.

Security best practices

Self-hosting gives you control, but also responsibility:

  • Network isolation: bind model APIs to 127.0.0.1 — never expose Ollama or vLLM directly to the internet
  • Reverse proxy with TLS: use Nginx or Caddy to terminate HTTPS in front of your model API
  • Access control: Open WebUI supports user accounts with role-based access; disable public signup
  • Update regularly: model runtimes (Ollama, vLLM) receive frequent security patches
  • Monitor resource usage: a runaway inference request can exhaust RAM; set memory limits in Docker
  • Protect API keys: if bridging to cloud APIs, use environment variables, never hardcode keys

If you have questions or need help, reach out at support@deployhq.com or on Twitter/X.