exo makes one big model from many devices. Herd makes many devices serve many models intelligently. They solve fundamentally different problems — and work great together.
exo (~42K GitHub stars) is an open-source distributed inference framework built by EXO Labs. It splits a single large AI model across multiple devices using tensor and pipeline parallelism, allowing you to run models that would not fit on any one machine. exo targets Apple Silicon clusters connected via Thunderbolt or network, and exposes an OpenAI-compatible API endpoint. It is the leading project for model-sharding across consumer hardware.
Ollama Herd is an open-source smart multimodal AI router that turns multiple Ollama instances across Apple Silicon devices into one intelligent endpoint. It routes LLMs, embeddings, image generation, speech-to-text, and vision with a 7-signal scoring engine, mDNS auto-discovery, and an 8-tab real-time dashboard. Two commands to set up, zero config files. pip install ollama-herd or brew install ollama-herd.
exo takes a model too large for one machine and splits it into shards distributed across multiple devices using two parallelism strategies:
exo auto-discovers peers on the network, builds a topology, and exposes an OpenAI-compatible API endpoint. Underneath, it uses MLX on Apple Silicon and supports heterogeneous device mixes (different Mac models, different memory sizes).
Performance numbers: ~1.8x speedup on 2 devices, ~3.2x speedup on 4 devices for single-request latency. Multi-request throughput scales better — a 3-device cluster handles ~2.2x the tokens per second of a single device.
| Feature | exo | Ollama Herd |
|---|---|---|
| Core approach | Model sharding (tensor/pipeline parallelism) | Request routing (7-signal scoring) |
| Primary use case | Run models too large for one device | Route requests to best available device |
| Model types | LLMs only | LLMs, embeddings, image gen, STT |
| Device discovery | Automatic peer discovery | mDNS auto-discovery |
| API compatibility | OpenAI-compatible | OpenAI + Ollama dual API |
| Backend | MLX, tinygrad | Ollama (any backend Ollama supports) |
| Queue management | None — single model focus | Per-node queue depth tracking |
| Health monitoring | Basic peer status | 17 health checks, 7-signal scoring |
| Load balancing | N/A (all devices serve one model) | Adaptive capacity learning per model/node |
| Dashboard | Minimal web UI | 8-tab dashboard (fleet, models, routing, benchmarks) |
| Benchmarking | Manual | Smart benchmark with statistical analysis |
| Context optimization | None | Dynamic context window optimization |
| Meeting detection | None | Detects video calls, reduces load on busy machines |
| Multi-model serving | One model at a time across the cluster | Many models across many nodes simultaneously |
| Setup | pip install exo + run on each device | pip install ollama-herd on one machine |
| Config required | None (auto-topology) | None (mDNS auto-discovery) |
| Interconnect | Benefits from Thunderbolt/RDMA | Standard network (WiFi or Ethernet) |
| Tests | Limited | 480+ tests, 17 health checks |
| License | GPL-3.0 | MIT |
exo and Herd operate at different layers of the stack:
+-----------------------------------------+
| Ollama Herd (routing) |
| Routes requests to the best node |
+----------+----------+-------------------+
| Mac #1 | Mac #2 | exo cluster |
| Ollama | Ollama | (Mac #3 + #4) |
| 7B-70B | 7B-70B | Running 405B |
+----------+----------+-------------------+
An exo cluster running a large sharded model can expose an OpenAI-compatible endpoint. Herd can route to that endpoint as if it were any other node. This gives you the best of both worlds:
| Scenario | Choose |
|---|---|
| Need to run a model too large for any single device | exo |
| Team of 2-10 people sharing a fleet of Macs | Ollama Herd |
| Multiple model types (LLM + embeddings + image gen) | Ollama Herd |
| Maximum throughput for one huge model | exo |
| Need operational visibility and health monitoring | Ollama Herd |
| Thunderbolt-connected Mac cluster for one workload | exo |
| WiFi/Ethernet fleet serving diverse workloads | Ollama Herd |
| Want both large model access and smart routing | Both together |
exo is a distributed compute layer — it makes small machines act like one big machine. Ollama Herd is a distributed routing layer — it makes many machines serve many users intelligently. They don't compete; they solve adjacent problems.
The typical exo user has 2-4 Macs hardwired together running one frontier model. The typical Herd user has 3-8 Macs on a network running a dozen different models for a team. When you need both (large model access + fleet routing), run them together.
Ollama Herd works alongside exo — you can try it without changing your existing setup. If you already have Ollama running on your Macs, Herd discovers them automatically and starts routing in under two minutes.
pip install ollama-herd # or: brew install ollama-herd
herd # start router
herd-node # on each device
exo and Ollama Herd solve different problems, so you likely won't "switch" — you may use both. But if you want fleet routing instead of model sharding:
ollama serve and pull your models on each machine.pip install ollama-herd on your router machine, then herd to start.herd-node on each device. They discover the router automatically via mDNS.Your existing tools just need to point at http://router-ip:11435 instead of the exo endpoint. If you still want to run one massive model across devices, keep exo — and register the exo cluster as a Herd node for the best of both worlds.
They solve different problems. exo shards one large model across devices so you can run models that do not fit on a single machine. Ollama Herd routes requests across a fleet of devices, picking the best node for each request. If you need multi-model routing rather than single-model sharding, Herd is the right choice.
Yes. An exo cluster exposes an OpenAI-compatible endpoint, which Herd can route to as if it were any other node. This gives you large-model sharding via exo and intelligent fleet routing via Herd through a single unified API.
Herd is built for multi-model workloads. It routes LLMs, embeddings, image generation, speech-to-text, and vision across your fleet simultaneously, picking the best device for each request type. exo focuses on running one model at a time across the cluster.
No. Ollama Herd works over standard WiFi or Ethernet. It routes requests to devices rather than sharding model layers, so it does not need the high-bandwidth interconnect that exo benefits from.
Yes. Ollama Herd is open-source under the MIT license. No paid tiers, no API keys, no subscriptions.