Compare

Ollama Herd vs DIY Scripts & nginx Round-Robin

DIY nginx round-robin works on day one and breaks on day thirty. Building feature parity with Herd's 7-signal scoring, thermal awareness, multimodal routing, and fleet dashboard takes 3–6 months of weekends. Herd takes 2 minutes to install.

What are DIY Scripts and nginx Round-Robin?

DIY routing is the most common first attempt at distributing Ollama requests across multiple machines. It typically starts with an nginx upstream block that round-robins requests across backend servers, then evolves into health-check scripts, model-aware routing in Python, and eventually a homegrown dashboard. Each stage adds maintenance burden and still lacks the fleet intelligence that purpose-built routers provide.

What is Ollama Herd?

Ollama Herd is an open-source smart multimodal AI router that turns multiple Ollama instances across Apple Silicon devices into one intelligent endpoint. It routes LLMs, embeddings, image generation, speech-to-text, and vision with a 7-signal scoring engine, mDNS auto-discovery, and an 8-tab real-time dashboard. Two commands to set up, zero config files. pip install ollama-herd or brew install ollama-herd.

Overview

The most common "competitor" to Ollama Herd isn't another product — it's the impulse to build something yourself. Every engineer with multiple Macs running Ollama has the same thought: "I'll just throw nginx in front of these and round-robin requests." It works on day one. It breaks on day thirty.

What DIY Typically Looks Like

Most DIY setups follow a predictable evolution:

Stage 1 — nginx upstream block (30 minutes)

upstream ollama {
    server mac-mini-1:11434;
    server mac-mini-2:11434;
    server macbook:11434;
}

Round-robin distribution. No health checks. No awareness of which models live where. Works fine until one machine is asleep or running a different model.

Stage 2 — Health check scripts (1–3 days)

A cron job or bash script that pings /api/tags on each node, updates an nginx config, and reloads. Now you handle nodes going offline, but you still don't know which node is best for a given request.

Stage 3 — Model-aware routing (1–2 weeks)

A Python Flask/FastAPI app that queries each node for available models and routes by model name. You now have a basic model registry, but it gets stale, doesn't handle concurrent requests well, and has no retry logic.

Stage 4 — The "I need a dashboard" moment (ongoing)

You realize you have no visibility. You start logging to a file. Then you want graphs. Then you want to see queue depth. Then you want tracing. Each addition is another weekend project that you maintain forever.

Feature Comparison

FeatureDIY / nginxOllama Herd
Basic load balancingRound-robin or random7-signal weighted scoring
Health checksManual scripting required16 built-in health checks
Model discoveryQuery each node manuallymDNS auto-discovery, zero config
Thermal awarenessNot availableReads macOS thermal state, avoids throttled nodes
GPU/memory monitoringCustom scripts per nodeReal-time VRAM and system memory tracking
Queue managementNot availablePer-node queue depth tracking with backpressure
Capacity learningNot availableAdaptive capacity engine learns node performance over time
Meeting detectionNot availableDetects active meetings, reduces load on presenter's machine
Auto-retry on failureManual implementationAutomatic retry with next-best node selection
Model fallbacksNot availableFalls back to compatible alternative models
Multimodal routingSeparate configs per model typeUnified routing for LLMs, embeddings, image gen, STT
Request tracingDIY loggingBuilt-in distributed tracing across nodes
DashboardBuild your own8-tab dashboard out of the box
API compatibilityWhatever you buildOpenAI + Ollama API compatible
Dynamic context optimizationNot availableAdjusts context window based on node capabilities
Smart benchmarkingNot availableBenchmarks nodes to calibrate scoring weights
Adding a new nodeEdit configs, restart servicesAutomatic via mDNS — just start Ollama

What DIY Gets You

What DIY Misses

The gap between "requests go to different machines" and "requests go to the right machine" is enormous:

The Maintenance Burden

DIY routing has a specific failure mode: it works perfectly on day 1 and degrades as your fleet changes.

Every fleet change — adding a node, removing a node, pulling a new model, changing hardware — requires manual config updates. Herd handles all of these automatically via mDNS discovery and real-time model polling.

Time-to-Build Comparison

CapabilityDIY TimeHerd Time
Basic round-robin30 minutes2 minutes (pip install ollama-herd && herd)
Round-robin + health checks1–3 daysIncluded
Model-aware routing1–2 weeksIncluded
Performance-based scoring2–4 weeksIncluded
Thermal + memory awarenessNot practical without macOS internalsIncluded
Queue management + backpressure1–2 weeksIncluded
Dashboard + observability2–4 weeksIncluded
Multimodal routing (5 model types)Multiply above by 4Included
Auto-retry + fallbacks1 weekIncluded
Total for feature parity3–6 months of weekends2 minutes

When DIY Makes Sense

When to Just Use Herd

Bottom Line

DIY routing is a valid choice for simple, static, single-model-type setups — or as a learning exercise. But the engineering effort to match Herd's capabilities is measured in months, not hours. The question isn't "can I build this myself?" — of course you can. The question is "is building and maintaining a custom AI routing engine the best use of my engineering time?"

For most people running local AI on Apple Silicon, the answer is: install Herd in 2 minutes and spend those engineering weekends on the things you're actually building with AI.

Getting Started

pip install ollama-herd    # or: brew install ollama-herd
herd                       # start the router
herd-node                  # on each device

Replacing Your DIY Setup with Ollama Herd

If you've been running nginx round-robin or custom scripts:

  1. Install Ollama Herdpip install ollama-herd on the machine currently running your proxy/scripts.
  2. Start the routerherd. It listens on port 11435. Update your clients to point here instead of your nginx/script endpoint.
  3. Start node agents — run herd-node on each Ollama machine. Herd discovers them via mDNS — no need to maintain your list of backend IPs.

You can run Herd alongside your existing setup during transition. Once you've verified it works, remove the old scripts. Your health checks, retry logic, and load balancing are now handled automatically — and they're better than what you built, because Herd has 7-signal scoring and 480 tests behind it.

Frequently Asked Questions

How long does it take to build a DIY Ollama load balancer?

Basic round-robin takes 30 minutes. Adding health checks takes 1–3 days. Model-aware routing takes 1–2 weeks. Reaching feature parity with Herd (scoring, thermal awareness, multimodal routing, dashboard, retries) takes 3–6 months of weekends — and then you maintain it forever.

Can I start with DIY and switch to Herd later?

Yes. Herd uses the standard Ollama and OpenAI APIs, so any client pointing at your DIY proxy can switch to Herd by changing the endpoint URL. Your Ollama installations and models stay exactly the same.

Does Herd handle everything nginx does?

For Ollama routing, yes and more. Herd handles load balancing, health checks, retries, and model-aware routing — plus scoring intelligence, thermal awareness, and multimodal support that nginx cannot provide. For general web serving or reverse proxy needs unrelated to Ollama, nginx remains the right tool.

What if I need custom routing logic?

Herd covers the vast majority of routing scenarios through its 7-signal scoring and configuration options. If you need deeply custom logic — like routing based on request content analysis or integration with an internal scheduling system — DIY may still make sense. But evaluate whether Herd's configuration handles your needs before building from scratch.

Is the DIY approach viable for a learning exercise?

Absolutely. Building a load balancer teaches valuable lessons about distributed systems, health checking, and failure modes. Build it, learn from it, then decide whether maintaining it long-term is worth the ongoing effort compared to a 2-minute Herd install.

See Also

Star on GitHub → Get started in 60 seconds