Single Ollama is perfect for one machine. The moment you add a second device, Ollama Herd turns your idle hardware into a unified AI fleet with zero-config discovery, intelligent routing, and thermal-aware load balancing.
Ollama is an open-source tool for running large language models locally on your machine. You install it, run ollama serve, and interact with models through a local API on port 11434. It handles model downloading, quantization, GPU acceleration, and memory management — all on a single device. Simple, fast, and free.
Ollama Herd is an open-source smart multimodal AI router that turns multiple Ollama instances across Apple Silicon devices into one intelligent endpoint. It routes LLMs, embeddings, image generation, speech-to-text, and vision with a 7-signal scoring engine, mDNS auto-discovery, and an 8-tab real-time dashboard. Two commands to set up, zero config files. pip install ollama-herd or brew install ollama-herd.
Most people start with Ollama on one machine. It works great — ollama run llama3.3:70b and you're talking to a local model in seconds. No cloud, no API keys, no subscription.
The question isn't whether single Ollama works. It does. The question is what happens when:
Ollama Herd doesn't replace Ollama. It connects multiple Ollama instances into one intelligent endpoint.
| Feature | Single Ollama | Ollama Herd |
|---|---|---|
| Setup | ollama serve | herd + herd-node (2 commands) |
| Devices | 1 | Unlimited (auto-discovered via mDNS) |
| Model routing | Manual (you pick the model) | Automatic (best node selected per request) |
| Concurrent requests | Queued on one machine | Distributed across fleet |
| Load balancing | None | 7-signal scoring (thermal, memory, queue, latency, affinity, availability, context) |
| Failover | None — if it's down, it's down | Auto-retry on different node before first chunk |
| Model fallbacks | None | Client-specified backup models tried automatically |
| Queue management | Single queue | Per node:model queues with rebalancing |
| Thermal awareness | None — runs until it throttles | Routes away from hot machines |
| Memory awareness | None — loads until OOM | Scores by memory fit, dynamic ceiling |
| Meeting detection | None | Pauses inference when camera/mic active (macOS) |
| Capacity learning | None | 168-slot weekly behavioral model per device |
| Image generation | Ollama native models only | mflux (FLUX.1) + DiffusionKit + Ollama native, capability-routed |
| Speech-to-text | Not supported | Qwen3-ASR routed to capable nodes |
| Embeddings | Single node | Routed to nodes with embedding models |
| Dashboard | None | 8-tab real-time UI with SSE |
| Health monitoring | None | 17 automated health checks |
| Request tracing | None | Every request traced to SQLite |
| Per-tag analytics | None | Tag requests, see usage by app/tool |
| Context optimization | Manual num_ctx | Tracks actual usage, auto-adjusts to save VRAM |
| Benchmarking | None | Smart benchmark across all model types |
| API compatibility | Ollama API | Ollama API + OpenAI API (both formats) |
| Thinking models | Manual config | Auto-detected, token budget inflated 4x |
Be honest — not everyone needs Herd:
The trigger to switch: The moment you have a second machine doing nothing, or the moment you run an agent framework that makes parallel LLM calls, single Ollama becomes the bottleneck.
You have a Mac Studio with 192GB and a MacBook Pro with 36GB. The Studio runs your big model. The MacBook does nothing. That's 36GB of unified memory — enough for a 32B model — contributing zero value.
With Herd: Both machines serve requests. Big models route to the Studio, small models to the MacBook. Every device contributes what it can.
CrewAI, LangChain, OpenClaw, Aider — these frameworks make rapid sequential or parallel LLM calls. On single Ollama, each call queues behind the last. A 5-agent pipeline with 4 calls each = 20 requests, all serialized.
With Herd: Requests distribute across the fleet. The Mac Studio handles the reasoning model, the MacBook handles the summarizer, the Mini handles embeddings — simultaneously.
Your MacBook is running inference for 30 minutes. Fans spin up. The chip throttles. Token generation drops from 40 tok/s to 15 tok/s. You don't even notice until the response takes forever.
With Herd: The scoring engine sees the MacBook's thermal state deteriorating and routes new requests to cooler machines. The MacBook recovers while other devices take the load.
You need llama3.3:70b for coding and nomic-embed-text for RAG. On a 64GB machine, loading both means one evicts the other constantly (model thrashing). Cold-loading a 70B model takes 15-30 seconds each time.
With Herd: The 70B model stays hot on the big machine. Embeddings run on the smaller machine. No eviction, no cold-loading delay.
Single Ollama has no dashboard, no health checks, no request tracing. When something is slow, you don't know why — is the model thrashing? Is memory pressure high? Is the queue backed up?
With Herd: 8-tab dashboard, 17 health checks, SQLite traces you can query, per-tag analytics showing which tools consume the most resources.
The beauty of Herd is that the upgrade from single Ollama is minimal:
# What you're doing now
ollama serve
# Add Herd (on the same machine or a different one)
pip install ollama-herd # or: brew install ollama-herd
herd # starts the router
# On each machine (including this one)
herd-node # discovers the router via mDNS
Your existing Ollama installation, models, and configuration stay exactly the same. Herd sits in front of Ollama, not instead of it. Every tool that currently points at localhost:11434 just needs to point at router-ip:11435 instead.
Running multiple Ollama instances without Herd means:
You can solve some of these with nginx reverse proxy rules and manual scripting. But that's rebuilding what Herd already does, without the scoring engine, capacity learning, or health monitoring.
Single Ollama: free.
Ollama Herd: also free. Open source. MIT licensed.
The only cost is the 2 minutes it takes to install and start. If you have a second machine, there's no reason not to try it.
Single Ollama is where everyone starts. It's great for what it is — local inference on one machine. Ollama Herd is where you go when you outgrow one machine, when agents need more throughput, when your spare hardware should be contributing instead of sitting idle.
The question isn't "should I switch from Ollama?" — you're not switching. You're adding an orchestration layer that makes all your Ollama instances work together. Ollama is the engine. Herd is the fleet manager.
If you have one Mac and no concurrent demand, stay with single Ollama. The moment you have two machines or an agent workload, Herd pays for itself in the first hour.
pip install ollama-herd # or: brew install ollama-herd
herd # start the router
herd-node # on each device
Probably not. Single Ollama handles one machine beautifully. But the moment you add a second device or start running agent frameworks that make parallel LLM calls, you will hit the limits of a single queue on a single machine. Herd is a 2-minute install, so the barrier is low when the time comes.
No. Herd sits in front of Ollama, not instead of it. Your existing Ollama installation, models, and configuration stay exactly the same. Herd is the orchestration layer that connects multiple Ollama instances into one smart endpoint.
Yes. Herd exposes both the Ollama API and the OpenAI-compatible API. Any tool currently pointing at localhost:11434 just needs to point at your router's address on port 11435 instead.
mDNS (multicast DNS) auto-discovery. Start herd-node on any device on your local network and the router finds it automatically. No IP addresses to configure, no config files to edit.
Herd detects the failure and automatically retries the request on the next-best available node — before the first token is sent to the client. With single Ollama, if it is down, it is down.