Yes. Ollama Herd is open-source under the MIT license with no paid tiers, no API keys, and no subscriptions.

Ollama Herd vs exo: Fleet Routing vs Model Sharding [2026]

Q: How does Ollama Herd compare to exo for running multiple models?

Herd is built for multi-model workloads, routing LLMs, embeddings, image generation, and STT across your fleet simultaneously. exo focuses on running one model at a time across the cluster.

Q: Does Ollama Herd require Thunderbolt connections?

No. Ollama Herd works over standard WiFi or Ethernet since it routes requests to devices rather than sharding model layers between them.

What is exo?

exo (~42K GitHub stars) is an open-source distributed inference framework built by EXO Labs. It splits a single large AI model across multiple devices using tensor and pipeline parallelism, allowing you to run models that would not fit on any one machine. exo targets Apple Silicon clusters connected via Thunderbolt or network, and exposes an OpenAI-compatible API endpoint. It is the leading project for model-sharding across consumer hardware.

What is Ollama Herd?

Ollama Herd is an open-source smart multimodal AI router that turns multiple Ollama instances across Apple Silicon devices into one intelligent endpoint. It routes LLMs, embeddings, image generation, speech-to-text, and vision with a 7-signal scoring engine, mDNS auto-discovery, and an 8-tab real-time dashboard. Two commands to set up, zero config files. pip install ollama-herd or brew install ollama-herd.

How exo Works

exo takes a model too large for one machine and splits it into shards distributed across multiple devices using two parallelism strategies:

Pipeline parallelism: Splits model layers sequentially across devices. Device A runs layers 1-20, device B runs layers 21-40. Simple but sequential — each device waits for the previous one.
Tensor parallelism: Splits individual layers across devices so they compute in parallel. Higher throughput but requires fast interconnect (Thunderbolt/RDMA).

exo auto-discovers peers on the network, builds a topology, and exposes an OpenAI-compatible API endpoint. Underneath, it uses MLX on Apple Silicon and supports heterogeneous device mixes (different Mac models, different memory sizes).

Performance numbers: ~1.8x speedup on 2 devices, ~3.2x speedup on 4 devices for single-request latency. Multi-request throughput scales better — a 3-device cluster handles ~2.2x the tokens per second of a single device.

Feature Comparison

Feature	exo	Ollama Herd
Core approach	Model sharding (tensor/pipeline parallelism)	Request routing (7-signal scoring)
Primary use case	Run models too large for one device	Route requests to best available device
Model types	LLMs only	LLMs, embeddings, image gen, STT
Device discovery	Automatic peer discovery	mDNS auto-discovery
API compatibility	OpenAI-compatible	OpenAI + Ollama dual API
Backend	MLX, tinygrad	Ollama (any backend Ollama supports)
Queue management	None — single model focus	Per-node queue depth tracking
Health monitoring	Basic peer status	30+ health checks, 7-signal scoring
Load balancing	N/A (all devices serve one model)	Adaptive capacity learning per model/node
Dashboard	Minimal web UI	8-tab dashboard (fleet, models, routing, benchmarks)
Benchmarking	Manual	Smart benchmark with statistical analysis
Context optimization	None	Dynamic context window optimization
Meeting detection	None	Detects video calls, reduces load on busy machines
Multi-model serving	One model at a time across the cluster	Many models across many nodes simultaneously
Setup	`pip install exo` + run on each device	`pip install ollama-herd` on one machine
Config required	None (auto-topology)	None (mDNS auto-discovery)
Interconnect	Benefits from Thunderbolt/RDMA	Standard network (WiFi or Ethernet)
Tests	Limited	1000+ tests, 30+ health checks
License	GPL-3.0	MIT

Where exo Wins

Running models that don't fit on one machine. If you need to run Llama 3.1 405B and your biggest Mac has 192GB RAM, exo is the only option. Herd can't help — it routes to nodes, it doesn't split models.
Maximum single-request throughput for huge models. Tensor parallelism across Thunderbolt-connected Macs gives near-linear speedup for large model inference. A single request to a 70B model is faster on 2 exo nodes than on 1 Herd node.
Simplicity of mental model for single-model use. If your entire use case is "run one big model as fast as possible," exo's model is simpler: shard it and go.
Thunderbolt/RDMA optimization. exo has invested heavily in low-latency device-to-device communication. Up to 99% latency reduction with Thunderbolt 5 RDMA compared to TCP.

Where Ollama Herd Wins

Multi-model, multi-user workloads. Real teams don't run one model. They run coding assistants, embeddings for RAG, image generation, and speech-to-text — often simultaneously. Herd routes each request to the best node for that specific model.
Intelligent routing. 7-signal scoring (VRAM, queue depth, historical latency, model affinity, context fit, thermal state, capacity) means requests go to the right machine, not just any machine.
Multimodal support. Four model types (LLMs, embeddings, image gen, STT, vision) with type-aware routing. exo is LLM-only.
Operational visibility. 8-tab dashboard showing fleet status, model distribution, routing decisions, and benchmark results. Know what's happening across your fleet at a glance.
Adaptive capacity learning. Herd learns each node's actual performance per model over time and adjusts routing. No manual tuning needed.
Meeting detection. Automatically reduces load on machines running video calls. Small feature, huge quality-of-life improvement for real teams.
Queue management. Tracks per-node queue depth and avoids piling requests on busy nodes. exo has no concept of request queuing — it's one model, one cluster.
Ollama ecosystem. Works with the full Ollama model library and tooling. No special model format or conversion needed.
Setup simplicity at fleet scale. Install on one machine, it discovers the rest. exo requires running the process on every participating device.

Why They're Complementary

exo and Herd operate at different layers of the stack:

+-----------------------------------------+
|           Ollama Herd (routing)          |
|  Routes requests to the best node       |
+----------+----------+-------------------+
|  Mac #1  |  Mac #2  |  exo cluster      |
|  Ollama  |  Ollama  |  (Mac #3 + #4)    |
|  7B-70B  |  7B-70B  |  Running 405B     |
+----------+----------+-------------------+

An exo cluster running a large sharded model can expose an OpenAI-compatible endpoint. Herd can route to that endpoint as if it were any other node. This gives you the best of both worlds:

Small/medium models get routed across individual Ollama nodes by Herd
Huge models get served by the exo cluster, also accessible through Herd
One unified API for all your applications

When to Choose

Scenario	Choose
Need to run a model too large for any single device	exo
Team of 2-10 people sharing a fleet of Macs	Ollama Herd
Multiple model types (LLM + embeddings + image gen)	Ollama Herd
Maximum throughput for one huge model	exo
Need operational visibility and health monitoring	Ollama Herd
Thunderbolt-connected Mac cluster for one workload	exo
WiFi/Ethernet fleet serving diverse workloads	Ollama Herd
Want both large model access and smart routing	Both together

Bottom Line

exo is a distributed compute layer — it makes small machines act like one big machine. Ollama Herd is a distributed routing layer — it makes many machines serve many users intelligently. They don't compete; they solve adjacent problems.

The typical exo user has 2-4 Macs hardwired together running one frontier model. The typical Herd user has 3-8 Macs on a network running a dozen different models for a team. When you need both (large model access + fleet routing), run them together.

Getting Started

Ollama Herd works alongside exo — you can try it without changing your existing setup. If you already have Ollama running on your Macs, Herd discovers them automatically and starts routing in under two minutes.

pip install ollama-herd    # or: brew install ollama-herd
herd                       # start router
herd-node                  # on each device

Switching from exo to Ollama Herd

exo and Ollama Herd solve different problems, so you likely won't "switch" — you may use both. But if you want fleet routing instead of model sharding:

Install Ollama on each device — exo uses its own runtime, Herd uses Ollama. Run ollama serve and pull your models on each machine.
Install Ollama Herd — pip install ollama-herd on your router machine, then herd to start.
Start node agents — run herd-node on each device. They discover the router automatically via mDNS.

Your existing tools just need to point at http://router-ip:11435 instead of the exo endpoint. If you still want to run one massive model across devices, keep exo — and register the exo cluster as a Herd node for the best of both worlds.

FAQ

Is Ollama Herd a good alternative to exo?

They solve different problems. exo shards one large model across devices so you can run models that do not fit on a single machine. Ollama Herd routes requests across a fleet of devices, picking the best node for each request. If you need multi-model routing rather than single-model sharding, Herd is the right choice.

Can I use Ollama Herd with exo?

Yes. An exo cluster exposes an OpenAI-compatible endpoint, which Herd can route to as if it were any other node. This gives you large-model sharding via exo and intelligent fleet routing via Herd through a single unified API.

How does Ollama Herd compare to exo for running multiple models?

Herd is built for multi-model workloads. It routes LLMs, embeddings, image generation, speech-to-text, and vision across your fleet simultaneously, picking the best device for each request type. exo focuses on running one model at a time across the cluster.

Does Ollama Herd require Thunderbolt connections?

No. Ollama Herd works over standard WiFi or Ethernet. It routes requests to devices rather than sharding model layers, so it does not need the high-bandwidth interconnect that exo benefits from.

Is Ollama Herd free?

Yes. Ollama Herd is open-source under the MIT license. No paid tiers, no API keys, no subscriptions.

Ollama Herd vs exo