Cloud APIs win for light usage and frontier reasoning. Local fleet inference wins when you run multiple agents, care about data privacy, or want zero marginal cost at scale. The smart answer is usually both.
Cloud LLM APIs are hosted inference services from providers like OpenAI (GPT-4o, GPT-5), Anthropic (Claude Opus, Sonnet), and Google (Gemini). You send requests over the internet, pay per token, and get access to the largest frontier models without managing any hardware. Setup is an API key and an SDK. Scaling is instant. The tradeoff is cost, latency, rate limits, and data leaving your network.
Ollama Herd is an open-source smart multimodal AI router that turns multiple Ollama instances across Apple Silicon devices into one intelligent endpoint. It routes LLMs, embeddings, image generation, speech-to-text, and vision with a 7-signal scoring engine, mDNS auto-discovery, and an 8-tab real-time dashboard. Two commands to set up, zero config files. pip install ollama-herd or brew install ollama-herd.
The most common objection to local inference isn't another tool — it's "just use OpenAI" or "just use Claude." Cloud APIs are easy, fast, and the models are the best available.
That argument holds for one user, one agent, light usage. It falls apart at fleet scale.
This page makes the case for when local fleet inference (via Ollama Herd) beats cloud APIs, when it doesn't, and why the smart answer is usually both.
One developer, one agent, a few hundred calls per day. At $3/MTok (Sonnet), that's $10-30/month. A Mac Mini costs $599. Payback period: 20+ months. Cloud is the obvious choice.
Eight agents running 24/7. Each makes 200-400 calls/day. Realistic fleet consumption: 5-20M tokens/day.
| Scenario | Cloud cost/month | Local cost/month | Break-even |
|---|---|---|---|
| Light (1 agent, casual) | $10-30 | $4 electricity | 20+ months |
| Medium (3 agents, daily) | $150-400 | $8 electricity | 5-10 months |
| Heavy (8 agents, 24/7) | $450-1,800 | $15 electricity | 1-4 months |
After break-even, every month is pure savings. Year two saves $864-$14,220. The hardware lasts 5-7 years.
Every additional agent increases cloud costs linearly. Local costs stay flat. The eighth agent costs exactly the same to run as the first: zero marginal cost.
| Feature | Cloud APIs | Ollama Herd (local fleet) |
|---|---|---|
| Setup | API key + SDK | pip install ollama-herd + 2 commands |
| Cost model | Per-token (scales linearly) | Fixed hardware + electricity (flat) |
| Marginal cost per request | $0.003-0.06 | $0 |
| Model quality (frontier) | Best available (GPT-5, Claude Opus) | Open-source (85-95% of frontier) |
| Model quality (routine) | Overkill for 80% of tasks | Right-sized per task |
| Latency | Network round-trip (100-500ms) | LAN only (1-5ms overhead) |
| Rate limits | Yes — 429 errors at scale | None |
| Data privacy | Data leaves your network | Everything stays on LAN |
| Uptime | Provider outages affect you | Your hardware, your uptime |
| Concurrent requests | Throttled by provider | Limited only by your hardware |
| Multimodal | Provider-dependent | LLM + image gen + STT + embeddings routed locally |
| Observability | Limited (usage dashboard) | Full traces, 17 health checks, 8-tab dashboard |
| Offline capability | None | Full functionality without internet |
| Model choice | Provider's catalog | Any Ollama model + mflux + Qwen3-ASR |
| Fine-tuning | Limited/expensive | Full control over model selection |
| Retries | You pay for retries | Free — retry as much as needed |
Be honest about this:
In late 2023, the best open-source model scored 70.5% on MMLU. GPT-4 scored 88%. A 17.5-point gap.
By 2026, that gap is effectively zero on knowledge benchmarks and single digits on most reasoning tasks:
The realistic workload mix for agents:
The smart architecture: Route the 80% locally (free), the 15% to larger local models (still free), and only send the 5% to cloud APIs (cheap because it's 5%, not 100%).
You don't have to choose. Ollama Herd handles local inference. LiteLLM or your agent framework handles cloud API calls. The agent decides which requests need frontier quality and which can run locally.
# Agent decides based on task complexity
if task.requires_frontier_reasoning:
response = openai_client.chat(model="claude-opus-4") # cloud
else:
response = openai_client.chat(
model="deepseek-r1:70b",
base_url="http://herd-router:11435/v1" # local fleet
)
This captures the best of both worlds: frontier quality when you need it, zero cost for everything else.
Cloud APIs are where everyone starts, just like single Ollama is where everyone starts locally. The question is when the economics stop making sense.
One agent, light use: Cloud wins. Don't buy hardware.
Multiple agents, daily use: Do the math. If your cloud bill exceeds $150/month, a Mac Mini pays for itself in under a year.
Agent fleet, 24/7: Local wins decisively. Zero marginal cost beats per-token pricing every time at scale. Use cloud for the 5% of tasks that genuinely need frontier models.
The future isn't cloud OR local — it's cloud AND local, with intelligent routing deciding which requests go where. Ollama Herd is the local half of that equation.
pip install ollama-herd # or: brew install ollama-herd
herd # start the router
herd-node # on each device
You don't have to go all-local overnight. The hybrid approach works best:
pip install ollama-herd, start herd + herd-node on your Macs. Pull models you use most: curl http://router:11435/api/pull -d '{"name":"deepseek-r1:70b"}'http://router:11435/v1 for the 80% of tasks that don't need frontier models. Keep cloud API keys for the 5% that do.Start with one Mac. Add devices as you see the savings. Every additional machine is zero marginal cost.
For 80-95% of daily tasks — classification, extraction, code generation, summarization, embeddings — open-source models running locally match cloud quality. The gap only matters for the hardest frontier reasoning tasks (top 5%), where models like Claude Opus and GPT-5 still lead. The smart approach is routing the routine work locally and sending only the hardest problems to the cloud.
It depends on volume. At 3 agents running daily, cloud costs $150-400/month while local electricity costs about $8/month after a one-time hardware investment. At 8 agents running 24/7, cloud costs $450-1,800/month. After break-even (1-4 months at heavy usage), every month is pure savings. Year two saves $864-$14,220.
Yes. This is the recommended hybrid approach. Your agent framework decides which requests need frontier quality and sends those to the cloud. Everything else goes to your local fleet via Herd at zero marginal cost. LiteLLM or your agent framework handles the cloud routing; Herd handles the local routing.
Yes. Herd exposes an OpenAI-compatible API, so any tool, agent framework, or script that works with OpenAI's API can point at Herd by changing the base URL. No code changes beyond the endpoint.
Your local fleet keeps running. Herd operates entirely on your LAN with no internet dependency. Agents that use the hybrid approach can fall back to local inference when cloud APIs are unreachable.