DIY nginx round-robin works on day one and breaks on day thirty. Building feature parity with Herd's 7-signal scoring, thermal awareness, multimodal routing, and fleet dashboard takes 3–6 months of weekends. Herd takes 2 minutes to install.
DIY routing is the most common first attempt at distributing Ollama requests across multiple machines. It typically starts with an nginx upstream block that round-robins requests across backend servers, then evolves into health-check scripts, model-aware routing in Python, and eventually a homegrown dashboard. Each stage adds maintenance burden and still lacks the fleet intelligence that purpose-built routers provide.
Ollama Herd is an open-source smart multimodal AI router that turns multiple Ollama instances across Apple Silicon devices into one intelligent endpoint. It routes LLMs, embeddings, image generation, speech-to-text, and vision with a 7-signal scoring engine, mDNS auto-discovery, and an 8-tab real-time dashboard. Two commands to set up, zero config files. pip install ollama-herd or brew install ollama-herd.
The most common "competitor" to Ollama Herd isn't another product — it's the impulse to build something yourself. Every engineer with multiple Macs running Ollama has the same thought: "I'll just throw nginx in front of these and round-robin requests." It works on day one. It breaks on day thirty.
Most DIY setups follow a predictable evolution:
upstream ollama {
server mac-mini-1:11434;
server mac-mini-2:11434;
server macbook:11434;
}
Round-robin distribution. No health checks. No awareness of which models live where. Works fine until one machine is asleep or running a different model.
A cron job or bash script that pings /api/tags on each node, updates an nginx config, and reloads. Now you handle nodes going offline, but you still don't know which node is best for a given request.
A Python Flask/FastAPI app that queries each node for available models and routes by model name. You now have a basic model registry, but it gets stale, doesn't handle concurrent requests well, and has no retry logic.
You realize you have no visibility. You start logging to a file. Then you want graphs. Then you want to see queue depth. Then you want tracing. Each addition is another weekend project that you maintain forever.
| Feature | DIY / nginx | Ollama Herd |
|---|---|---|
| Basic load balancing | Round-robin or random | 7-signal weighted scoring |
| Health checks | Manual scripting required | 16 built-in health checks |
| Model discovery | Query each node manually | mDNS auto-discovery, zero config |
| Thermal awareness | Not available | Reads macOS thermal state, avoids throttled nodes |
| GPU/memory monitoring | Custom scripts per node | Real-time VRAM and system memory tracking |
| Queue management | Not available | Per-node queue depth tracking with backpressure |
| Capacity learning | Not available | Adaptive capacity engine learns node performance over time |
| Meeting detection | Not available | Detects active meetings, reduces load on presenter's machine |
| Auto-retry on failure | Manual implementation | Automatic retry with next-best node selection |
| Model fallbacks | Not available | Falls back to compatible alternative models |
| Multimodal routing | Separate configs per model type | Unified routing for LLMs, embeddings, image gen, STT |
| Request tracing | DIY logging | Built-in distributed tracing across nodes |
| Dashboard | Build your own | 8-tab dashboard out of the box |
| API compatibility | Whatever you build | OpenAI + Ollama API compatible |
| Dynamic context optimization | Not available | Adjusts context window based on node capabilities |
| Smart benchmarking | Not available | Benchmarks nodes to calibrate scoring weights |
| Adding a new node | Edit configs, restart services | Automatic via mDNS — just start Ollama |
The gap between "requests go to different machines" and "requests go to the right machine" is enormous:
DIY routing has a specific failure mode: it works perfectly on day 1 and degrades as your fleet changes.
Every fleet change — adding a node, removing a node, pulling a new model, changing hardware — requires manual config updates. Herd handles all of these automatically via mDNS discovery and real-time model polling.
| Capability | DIY Time | Herd Time |
|---|---|---|
| Basic round-robin | 30 minutes | 2 minutes (pip install ollama-herd && herd) |
| Round-robin + health checks | 1–3 days | Included |
| Model-aware routing | 1–2 weeks | Included |
| Performance-based scoring | 2–4 weeks | Included |
| Thermal + memory awareness | Not practical without macOS internals | Included |
| Queue management + backpressure | 1–2 weeks | Included |
| Dashboard + observability | 2–4 weeks | Included |
| Multimodal routing (5 model types) | Multiply above by 4 | Included |
| Auto-retry + fallbacks | 1 week | Included |
| Total for feature parity | 3–6 months of weekends | 2 minutes |
DIY routing is a valid choice for simple, static, single-model-type setups — or as a learning exercise. But the engineering effort to match Herd's capabilities is measured in months, not hours. The question isn't "can I build this myself?" — of course you can. The question is "is building and maintaining a custom AI routing engine the best use of my engineering time?"
For most people running local AI on Apple Silicon, the answer is: install Herd in 2 minutes and spend those engineering weekends on the things you're actually building with AI.
pip install ollama-herd # or: brew install ollama-herd
herd # start the router
herd-node # on each device
If you've been running nginx round-robin or custom scripts:
pip install ollama-herd on the machine currently running your proxy/scripts.herd. It listens on port 11435. Update your clients to point here instead of your nginx/script endpoint.herd-node on each Ollama machine. Herd discovers them via mDNS — no need to maintain your list of backend IPs.You can run Herd alongside your existing setup during transition. Once you've verified it works, remove the old scripts. Your health checks, retry logic, and load balancing are now handled automatically — and they're better than what you built, because Herd has 7-signal scoring and 480 tests behind it.
Basic round-robin takes 30 minutes. Adding health checks takes 1–3 days. Model-aware routing takes 1–2 weeks. Reaching feature parity with Herd (scoring, thermal awareness, multimodal routing, dashboard, retries) takes 3–6 months of weekends — and then you maintain it forever.
Yes. Herd uses the standard Ollama and OpenAI APIs, so any client pointing at your DIY proxy can switch to Herd by changing the endpoint URL. Your Ollama installations and models stay exactly the same.
For Ollama routing, yes and more. Herd handles load balancing, health checks, retries, and model-aware routing — plus scoring intelligence, thermal awareness, and multimodal support that nginx cannot provide. For general web serving or reverse proxy needs unrelated to Ollama, nginx remains the right tool.
Herd covers the vast majority of routing scenarios through its 7-signal scoring and configuration options. If you need deeply custom logic — like routing based on request content analysis or integration with an internal scheduling system — DIY may still make sense. But evaluate whether Herd's configuration handles your needs before building from scratch.
Absolutely. Building a load balancer teaches valuable lessons about distributed systems, health checking, and failure modes. Build it, learn from it, then decide whether maintaining it long-term is worth the ongoing effort compared to a 2-minute Herd install.