Quickstart — Ollama Herd Guides

Prerequisites

Two or more machines on the same local network
Ollama installed on each machine
At least one model pulled (e.g., ollama pull llama3.2:3b)
Python 3.10+ on the router machine

Step 1: Install Ollama Herd

On the machine you want as the router (typically your most powerful device):

pip install ollama-herd

Or with Homebrew (macOS/Linux):

brew tap geeks-accelerator/ollama-herd
brew install ollama-herd

Step 2: Start the Router

herd

The router starts on port 11435. You'll see:

Ollama Herd ready on port 11435

Step 3: Start Node Agents

On each device running Ollama (including the router machine if it also runs Ollama):

pip install ollama-herd
herd-node

The node discovers the router automatically via mDNS:

Discovered router at 10.0.0.100:11435
Heartbeat sent: 2 models loaded, 128GB available

Can't use mDNS? Connect directly: herd-node --router-url http://10.0.0.100:11435

Step 4: Verify the Fleet

Check that nodes are online:

curl -s http://localhost:11435/fleet/status | python3 -m json.tool

You should see your nodes listed with their models, memory, and status.

Or open the dashboard in your browser:

http://localhost:11435/dashboard

Step 5: Send Your First Request

OpenAI format:

curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Hello from the fleet!"}],
    "stream": false
  }'

Ollama format:

curl http://localhost:11435/api/chat -d '{
  "model": "llama3.2:3b",
  "messages": [{"role": "user", "content": "Hello from the fleet!"}],
  "stream": false
}'

The router scores all available nodes and routes the request to the best one. Check the response headers to see which node handled it:

curl -v http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "Which node am I on?"}], "stream": false}' \
  2>&1 | grep X-Fleet

< X-Fleet-Node: mac-studio-ultra
< X-Fleet-Score: 85

Step 6: Use with Your Tools

Point any OpenAI-compatible tool at the router — no code changes needed:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="llama3.2:3b",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

Replace localhost with the router's LAN IP if connecting from another machine.

What Just Happened

Your node agents discovered the router via mDNS
Each node sends heartbeats every 5 seconds with system state (CPU, memory, thermal, loaded models)
Your request hit the router, which scored all nodes on 7 signals
The highest-scoring node received the request through its dedicated queue
The response streamed back through the router to your client

Next Steps

Core Concepts — Understand the mental model behind scoring, queues, and capacity
Integrations — Connect Open WebUI, LangChain, CrewAI, and other tools
Deployment — Production setup, monitoring, and tuning
Dashboard — Open http://localhost:11435/dashboard to see your fleet in real time

Upgrading

pip install --upgrade ollama-herd
# or: brew upgrade ollama-herd

Restart the router and node agents after upgrading. See CHANGELOG for what's new.

Coming From Another Tool?

See how Ollama Herd compares:

Coming from single Ollama — what changes and what stays the same
Using Open WebUI? — point it at Herd for intelligent routing
Switching from cloud APIs? — the economics of local fleet inference
Compared to exo — fleet routing vs model sharding