Self-hosting AI in 2026 is genuinely viable for the first time. Open-weights models are good enough, inference tooling is mature, and the hardware story is no longer "buy four H100s and hope." This guide walks through what's worth running yourself, what hardware you need, and where you should keep paying for hosted instead.

We'll be specific about hardware, specific about cost, and specific about where each path falls apart.

When to self-host (and when not to)

Self-hosting wins when at least one of these is true:

  • Your data can't leave your network. Healthcare, finance, defense, legal, anything with strict residency.
  • You're processing massive batch volumes. Above ~5–10M tokens/day, hosted economics get rough.
  • You're tuning models on proprietary data. You need the weights, not just an API.
  • Latency to a closer GPU matters. Edge cases — voice, real-time agents.

Hosted wins when:

  • You're below 1M tokens/day. Hosted is cheaper, full stop.
  • You need frontier-tier quality. No self-hostable model matches GPT-5.5 or Claude on hard tasks.
  • Your team can't operate GPU infrastructure. This is honest, not a slight. Self-hosting GPU compute is its own job.

If you're reading this and saying "we're somewhere in the middle," that's most teams. The hybrid approach below is built for you.

The minimum hardware tiers

Three concrete tiers, each with one model class and one inference engine.

Tier 1 — single consumer GPU (RTX 4090 / 5090)

  • What runs well: Llama 4 70B at 4-bit (Q4_K_M), small fine-tunes, image generation models (SDXL, Flux Schnell).
  • Inference engine: Ollama for ease, llama.cpp for control, ComfyUI/Forge for image.
  • What it costs: ~$2k for the GPU, ~$3.5k for the full system. ~$80–$120/month in power at moderate use.
  • What you can ship on this: A team Slack bot, a content drafting workflow, a research assistant for a few users, an image generation pipeline.
  • What it can't do: Serve more than 5–10 concurrent users on a 70B, run a 405B at all, fine-tune anything serious.

Tier 2 — dual H100 server

  • What runs well: Llama 4 70B full precision, 405B at heavy quantization, all popular open coding models, multi-user inference workloads.
  • Inference engine: vLLM for serving, TensorRT-LLM for last-mile speed, PEFT/Axolotl for fine-tuning.
  • What it costs: ~$60k–$80k for the GPUs, $80k+ for the full server. Cloud rental ~$5–$8/hour for 2x H100.
  • What you can ship on this: A 50–200 user internal AI platform, a moderate-scale agent product, batch document processing at 50–100M tokens/day.
  • What it can't do: Run Llama 4 405B at full precision, serve thousands of concurrent users.

Tier 3 — 8x H100 node (or equivalent MI300X)

  • What runs well: Llama 4 405B at full precision, large-scale fine-tuning, high-concurrency serving.
  • Inference engine: vLLM in tensor-parallel mode, NVLink interconnect required.
  • What it costs: $300k+ to own outright. ~$25–$35/hour to rent on Lambda/CoreWeave/Together.
  • What you can ship on this: An enterprise-grade self-hosted alternative to GPT-5/Claude, large team fine-tunes, frontier-adjacent applications.
  • What it can't do: Match the very best frontier hosted models on the hardest tasks.

The model layer — what to actually run

For most teams, the model picks are simpler than the literature suggests.

General-purpose chat / agent / drafting

  • Llama 4 70B-Instruct — the default. Runs anywhere, quality is good enough for 80% of internal AI workloads, license is permissive.
  • Llama 4 405B-Instruct — the upgrade when 70B isn't enough and you have the hardware.
  • Mistral Large 3 — strong multilingual alternative, slightly different licensing.
  • Qwen 3 series — best Chinese-language performance; top-tier multilingual generally.

Coding-specific

  • DeepSeek Coder V3 — best self-hostable coding model. Outperforms Llama 4 70B significantly on code, behind Claude on real-world tasks.
  • Qwen 3 Coder 32B — strong fast alternative when DeepSeek Coder is too heavy.

Don't bother running general-purpose models on coding workloads when DeepSeek Coder is sitting there for free.

Image generation

  • Flux.1 — current generation winner for general work.
  • Stable Diffusion 3.5 Large — when you need open-weights with broader fine-tune ecosystem.
  • Flux Schnell — when you need speed (4 steps).

Voice

  • Whisper Large v3 Turbo — best open-source ASR.
  • Coqui XTTS v2 or F5-TTS — best open TTS, both fast enough for streaming.

Embeddings

  • bge-m3 or nomic-embed-v2 — both strong, both open. Pick one and standardize.

The inference engine layer

Three engines cover almost every case.

Ollama — for ease

If your job is "stand up a model and use it," install Ollama. It runs Llama, Qwen, Mistral, DeepSeek, basically every open weight, with one-command installs and a clean API. It's not the fastest engine, but the operational simplicity is worth a 20% performance hit for most teams.

vLLM — for serving

When you have multiple users hitting the same GPU and you care about throughput, run vLLM. It does paged attention, continuous batching, and tensor parallelism well. Most production self-hosted AI deployments end up here.

llama.cpp — for control and CPU/edge

When you need to run on weird hardware (CPUs, ARM, mobile), or you want the absolute lowest memory footprint, run llama.cpp. It's also the engine behind Ollama under the hood.

Skip the rest unless you have a specific reason. TensorRT-LLM is fastest but operationally heavy. SGLang is excellent but more setup. MLX-LM is great if you're on Apple Silicon and only Apple Silicon.

The wrapper layer — what to actually build with

For agent and tooling layers on top of the model:

  • LiteLLM — translates everything into the OpenAI API shape. Lets you swap hosted/self-hosted without rewriting integration code.
  • LangChain or LlamaIndex — load-bearing for retrieval-heavy workloads. Avoid for simple chat-style features; the abstraction tax isn't worth it.
  • Ollama's REST API — fine for prototypes, swap to vLLM under LiteLLM for production.

What to keep paying for

Self-hosting works best when you also keep paying for hosted in the right places.

Frontier hosted models for hard reasoning, complex agent loops, and product-critical user-facing AI features. Pay Claude or OpenAI here.

Hosted Whisper / hosted ElevenLabs / hosted MJ when the volume is low and you don't want operations burden. Self-hosting these only wins at scale.

A managed embedding API (OpenAI text-embedding-3-large, Voyage v3) for the search and retrieval surface. Self-hosting embeddings adds complexity without enough ROI for most teams.

A real reference stack

The stack we run for ANN Live's editorial production:

  • Hosted frontier: Claude (Anthropic API) for long-form writing, GPT-5.5 for cheap iterative work.
  • Self-hosted on a single 4090: Llama 4 70B for internal tooling — script summarization, transcript cleanup, brief generation, search queries that don't need to be perfect.
  • Self-hosted on the same 4090: Whisper Large v3 Turbo for transcript generation off our recorded videos.
  • Hosted Replicate for occasional image generation when Flux quality matters.
  • Hosted ElevenLabs and MiniMax for voice (we tested self-hosted Coqui; quality wasn't there for production).

Total monthly cost: ~$300 in hosted spend, ~$120 in power for the 4090 box. About 60% of our editorial AI work runs locally; the 40% that goes hosted is the work where quality matters most.

The trap to avoid

The most common self-hosting trap is over-investing too early. People buy two H100s, spend three months getting a 405B running, and then realize their actual workload would have run fine on a 4090 with a 70B at one tenth the cost.

Start small. Run Ollama with a 70B on whatever GPU you can borrow. Use it for two weeks. Notice the workloads where it falls down. Buy more GPU only for those workloads.

The rest of your AI stack should be hosted until proven otherwise.

Where this guide goes next

We'll publish a follow-up on fine-tuning Llama 4 on real production data — when it's worth it, what it costs, and how to actually do it without a research team. Subscribe to the briefing if you want it when it lands.