Use Cases

Who Uses Ollama Herd

Real scenarios showing how developers, teams, and enthusiasts turn multiple devices into one smart AI endpoint.

Solo Developer with 2+ Machines

The pain: You have a Mac Studio for heavy work and a MacBook for portability. When you're running Aider or Continue.dev on the MacBook, it heats up, fans spin, and inference slows down. Meanwhile the Mac Studio sits idle. You keep SSH-ing between machines or manually switching base URLs.

With Herd: Point all your tools at http://router-ip:11435. The Mac Studio handles the heavy models (70B+), the MacBook handles quick tasks (7B–14B). When the MacBook is in a Zoom call, requests automatically route to the Mac Studio. When you're at your desk with both machines free, they share the load.

Example Setup

Agent-Heavy Workflows

The pain: You're running CrewAI crews, LangChain chains, or OpenClaw agents that fire dozens of concurrent LLM requests. A single Ollama instance queues them all sequentially. A 5-agent crew that should take 2 minutes takes 10 because every request waits in line.

With Herd: Concurrent requests fan out across your fleet. Agent #1 goes to the Mac Studio, agent #2 goes to the MacBook, agent #3 goes to the Mac Mini. Throughput scales linearly with machines. Auto-retry means agent failures don't crash the crew — the router re-routes to the next best node.

Example Setup

Small Team / Office

The pain: Your team has 4–5 Macs. Everyone runs Ollama locally, but nobody's machine is powerful enough for the big models. People share a "team Mac Studio" by manually coordinating who's using it. No visibility into who's queued where.

With Herd: One router, all machines as nodes. Everyone points their tools at the same URL. The router handles contention — no manual coordination. The dashboard shows who's using what, queue depths, and per-app analytics (via request tagging). The Mac Studio handles the big models, personal laptops handle lightweight tasks.

Example Setup

Home Lab Enthusiast

The pain: You've accumulated hardware — a Mac Mini, an older MacBook, maybe a Linux box with an NVIDIA GPU. You want a unified local AI setup but every tool assumes a single machine. Managing multiple Ollama instances manually is tedious.

With Herd: Every device joins the fleet automatically via mDNS. Mix and match platforms — macOS, Linux, Windows. The router knows each device's capabilities and routes accordingly. NVIDIA GPU boxes handle what they're good at, Apple Silicon handles the rest. Image generation routes to the Mac with mflux installed. Embeddings route to whichever node has the model loaded.

Example Setup

Multimodal AI Pipeline

The pain: You need LLM inference, embeddings for RAG, image generation, and speech-to-text. Each service runs on a different port, different machine, different API. Your application code is full of conditional routing logic.

With Herd: One endpoint handles all four model types. The router knows which nodes can serve which modality and routes accordingly. Your app talks to one URL for everything.

Example Setup

Is This For You?

Herd is a great fit if:

Herd is probably overkill if:

Getting started takes 60 seconds:

pip install ollama-herd
herd                    # on your router machine
herd-node               # on each device