Ollama Herd vs Single Ollama: Why Add Fleet Routing? [2026]

Q: Do I need Ollama Herd if I only have one Mac?

Probably not. Single Ollama handles one machine well, but the moment you add a second device or run agent frameworks with parallel LLM calls, Herd is a 2-minute install.

Q: Does Ollama Herd replace Ollama?

No. Herd sits in front of Ollama as an orchestration layer. Your existing Ollama installation, models, and configuration stay exactly the same.

Q: How does Herd discover my other machines?

mDNS (multicast DNS) auto-discovery. Start herd-node on any device on your local network and the router finds it automatically with no IP addresses or config files to edit.

What is Single Ollama?

Ollama is an open-source tool for running large language models locally on your machine. You install it, run ollama serve, and interact with models through a local API on port 11434. It handles model downloading, quantization, GPU acceleration, and memory management — all on a single device. Simple, fast, and free.

What is Ollama Herd?

Ollama Herd is an open-source smart multimodal AI router that turns multiple Ollama instances across Apple Silicon devices into one intelligent endpoint. It routes LLMs, embeddings, image generation, speech-to-text, and vision with a 7-signal scoring engine, mDNS auto-discovery, and an 8-tab real-time dashboard. Two commands to set up, zero config files. pip install ollama-herd or brew install ollama-herd.

Overview

Most people start with Ollama on one machine. It works great — ollama run llama3.3:70b and you're talking to a local model in seconds. No cloud, no API keys, no subscription.

The question isn't whether single Ollama works. It does. The question is what happens when:

You have a second machine sitting idle
You need to run multiple models concurrently
An agent pipeline makes 20 LLM calls in sequence and each one queues behind the last
You want image generation AND chat AND embeddings but one machine can't do all three well
Your MacBook overheats during a long coding session and inference slows to a crawl
Someone else in your house/office wants to use AI at the same time

Ollama Herd doesn't replace Ollama. It connects multiple Ollama instances into one intelligent endpoint.

Feature Comparison

Feature	Single Ollama	Ollama Herd
Setup	`ollama serve`	`herd` + `herd-node` (2 commands)
Devices	1	Unlimited (auto-discovered via mDNS)
Model routing	Manual (you pick the model)	Automatic (best node selected per request)
Concurrent requests	Queued on one machine	Distributed across fleet
Load balancing	None	7-signal scoring (thermal, memory, queue, latency, affinity, availability, context)
Failover	None — if it's down, it's down	Auto-retry on different node before first chunk
Model fallbacks	None	Client-specified backup models tried automatically
Queue management	Single queue	Per node:model queues with rebalancing
Thermal awareness	None — runs until it throttles	Routes away from hot machines
Memory awareness	None — loads until OOM	Scores by memory fit, dynamic ceiling
Meeting detection	None	Pauses inference when camera/mic active (macOS)
Capacity learning	None	168-slot weekly behavioral model per device
Image generation	Ollama native models only	mflux (FLUX.1) + DiffusionKit + Ollama native, capability-routed
Speech-to-text	Not supported	Qwen3-ASR routed to capable nodes
Embeddings	Single node	Routed to nodes with embedding models
Dashboard	None	8-tab real-time UI with SSE
Health monitoring	None	30+ automated health checks
Request tracing	None	Every request traced to SQLite
Per-tag analytics	None	Tag requests, see usage by app/tool
Context optimization	Manual `num_ctx`	Tracks actual usage, auto-adjusts to save VRAM
Benchmarking	None	Smart benchmark across all model types
API compatibility	Ollama API	Ollama API + OpenAI API (both formats)
Thinking models	Manual config	Auto-detected, token budget inflated 4x

When Single Ollama Is Enough

Be honest — not everyone needs Herd:

One machine, one user, light usage. If you run a few prompts a day on a Mac with enough RAM for your model, single Ollama is perfect. No overhead, no complexity.
Single large model. If you only ever use one model and it fits comfortably in memory, there's nothing to route.
No concurrent demand. If requests never overlap (you wait for each response before asking the next question), queuing isn't a bottleneck.
No multimodal needs. If you only do text chat — no image gen, no transcription, no embeddings — the routing advantages are smaller.

The trigger to switch: The moment you have a second machine doing nothing, or the moment you run an agent framework that makes parallel LLM calls, single Ollama becomes the bottleneck.

Where Single Ollama Breaks Down

1. Idle hardware

You have a Mac Studio with 192GB and a MacBook Pro with 36GB. The Studio runs your big model. The MacBook does nothing. That's 36GB of unified memory — enough for a 32B model — contributing zero value.

With Herd: Both machines serve requests. Big models route to the Studio, small models to the MacBook. Every device contributes what it can.

2. Agent bottleneck

CrewAI, LangChain, OpenClaw, Aider — these frameworks make rapid sequential or parallel LLM calls. On single Ollama, each call queues behind the last. A 5-agent pipeline with 4 calls each = 20 requests, all serialized.

With Herd: Requests distribute across the fleet. The Mac Studio handles the reasoning model, the MacBook handles the summarizer, the Mini handles embeddings — simultaneously.

3. Thermal throttling

Your MacBook is running inference for 30 minutes. Fans spin up. The chip throttles. Token generation drops from 40 tok/s to 15 tok/s. You don't even notice until the response takes forever.

With Herd: The scoring engine sees the MacBook's thermal state deteriorating and routes new requests to cooler machines. The MacBook recovers while other devices take the load.

4. Model contention

You need llama3.3:70b for coding and nomic-embed-text for RAG. On a 64GB machine, loading both means one evicts the other constantly (model thrashing). Cold-loading a 70B model takes 15-30 seconds each time.

With Herd: The 70B model stays hot on the big machine. Embeddings run on the smaller machine. No eviction, no cold-loading delay.

5. No observability

Single Ollama has no dashboard, no health checks, no request tracing. When something is slow, you don't know why — is the model thrashing? Is memory pressure high? Is the queue backed up?

With Herd: 8-tab dashboard, 30+ health checks, SQLite traces you can query, per-tag analytics showing which tools consume the most resources.

The Upgrade Path

The beauty of Herd is that the upgrade from single Ollama is minimal:

# What you're doing now
ollama serve

# Add Herd (on the same machine or a different one)
pip install ollama-herd   # or: brew install ollama-herd
herd                      # starts the router

# On each machine (including this one)
herd-node                 # discovers the router via mDNS

Your existing Ollama installation, models, and configuration stay exactly the same. Herd sits in front of Ollama, not instead of it. Every tool that currently points at localhost:11434 just needs to point at router-ip:11435 instead.

What You Don't Get Without Herd

Running multiple Ollama instances without Herd means:

Manual model placement — you decide which model goes where, and update every client when it changes
Manual failover — if a machine goes down, clients break until you reconfigure them
No load awareness — you're guessing which machine is less busy
No thermal protection — no one tells your MacBook to stop when it's overheating
No queue management — requests pile up on whichever machine the client happens to target
No cross-device analytics — no unified view of what's happening across your fleet
No multimodal routing — you manually remember which machine has mflux, which has Qwen3-ASR
No capacity learning — the system never adapts to your usage patterns

You can solve some of these with nginx reverse proxy rules and manual scripting. But that's rebuilding what Herd already does, without the scoring engine, capacity learning, or health monitoring.

Cost

Single Ollama: free.
Ollama Herd: also free. Open source. MIT licensed.

The only cost is the 2 minutes it takes to install and start. If you have a second machine, there's no reason not to try it.

Bottom Line

Single Ollama is where everyone starts. It's great for what it is — local inference on one machine. Ollama Herd is where you go when you outgrow one machine, when agents need more throughput, when your spare hardware should be contributing instead of sitting idle.

The question isn't "should I switch from Ollama?" — you're not switching. You're adding an orchestration layer that makes all your Ollama instances work together. Ollama is the engine. Herd is the fleet manager.

If you have one Mac and no concurrent demand, stay with single Ollama. The moment you have two machines or an agent workload, Herd pays for itself in the first hour.

Getting Started

pip install ollama-herd    # or: brew install ollama-herd
herd                       # start the router
herd-node                  # on each device

FAQ

Do I need Ollama Herd if I only have one Mac?

Probably not. Single Ollama handles one machine beautifully. But the moment you add a second device or start running agent frameworks that make parallel LLM calls, you will hit the limits of a single queue on a single machine. Herd is a 2-minute install, so the barrier is low when the time comes.

Does Ollama Herd replace Ollama?

No. Herd sits in front of Ollama, not instead of it. Your existing Ollama installation, models, and configuration stay exactly the same. Herd is the orchestration layer that connects multiple Ollama instances into one smart endpoint.

Will my existing tools still work?

Yes. Herd exposes both the Ollama API and the OpenAI-compatible API. Any tool currently pointing at localhost:11434 just needs to point at your router's address on port 11435 instead.

How does Herd discover my other machines?

mDNS (multicast DNS) auto-discovery. Start herd-node on any device on your local network and the router finds it automatically. No IP addresses to configure, no config files to edit.

What happens if a node goes down?

Herd detects the failure and automatically retries the request on the next-best available node — before the first token is sent to the client. With single Ollama, if it is down, it is down.

Ollama Herd vs Single Ollama Instance