Deployment Guide — Ollama Herd

Architecture Decisions

Before deploying, decide:

Which machine runs the router? Pick a machine that's always on. The router is lightweight (minimal CPU/memory) — it doesn't run inference, just coordinates. Your most powerful machine is usually the best choice since it's likely always on and also runs Ollama.

Which devices enable capacity learning? Dedicated servers (Mac Studio, Linux boxes) should leave it disabled — they always run at full capacity. Laptops and shared devices should enable it so routing adapts to usage patterns.

Which devices run which models? The router handles this dynamically, but you control which models are pulled to each node. Large models go on high-RAM machines. Small fast models go on everything else. Embedding models go on one or two nodes.

Starting the Fleet

Router:

pip install ollama-herd
herd

Each node:

pip install ollama-herd
herd-node

For nodes that double as workstations:

FLEET_NODE_ENABLE_CAPACITY_LEARNING=true herd-node

Running as Background Services

macOS (launchd):

# Router
nohup herd &>/dev/null & disown

# Node
nohup herd-node &>/dev/null & disown

Or create a ~/Library/LaunchAgents/com.ollama-herd.router.plist for automatic startup.

Linux (systemd):

# /etc/systemd/system/ollama-herd.service
[Unit]
Description=Ollama Herd Router
After=network.target

[Service]
ExecStart=/usr/local/bin/herd
Restart=always
User=your-user

[Install]
WantedBy=multi-user.target

sudo systemctl enable --now ollama-herd

Monitoring

Dashboard

Open http://router-ip:11435/dashboard for real-time fleet monitoring with 8 tabs:

Fleet Overview — Live node cards, queue depths, request counts
Trends — Requests/hour, latency, token throughput (24h–7d)
Model Insights — Per-model performance comparison
Tags — Per-tag analytics (requires request tagging)
Benchmarks — Capacity growth over time
Health — 30+ automated health checks
Recommendations — AI-powered model mix suggestions
Settings — Runtime toggles and node versions

Health API

curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool

Returns 30+ automated checks covering fleet liveness, routing quality, backend reliability, and observability. Each check carries a severity (INFO / WARNING / CRITICAL) and an actionable recommendation. Highlights:

Fleet liveness — offline nodes, degraded nodes, memory pressure (OS-reported), underutilized nodes
Routing quality — VRAM fallbacks (cross-category escalates to ERROR), model thrashing, request timeouts, retry rates, context waste detection
MLX backend — server down (CRITICAL), server quarantined (after 5 crashes in 5 minutes), memory-blocked (skipped start due to memory gate)
Text embedding backend — embed error rate, text-embedding backend missing (fastembed not installed), text-embedding Ollama bypass (native server not running — embed requests still contend with LLMs), nomic still loaded in Ollama despite native server traffic (VRAM waste)
Vision backend — backend missing (weights cached but onnxruntime not loadable)
Observability — trace-store write failures (closes a silent SQLite-contention black hole), version mismatch, KV-cache bloat, zombie reaper activity
Stream + client integrity — client disconnects, incomplete streams, context protection events

The dashboard Health tab renders every check with current status and recommended remediation. For per-model context window utilization, the /dashboard/api/context-usage endpoint shows allocated vs. actual usage and recommends right-sized values to reclaim wasted VRAM.

Fleet Status

curl -s http://localhost:11435/fleet/status | python3 -m json.tool

Returns per-node details: status, hardware, memory, CPU, loaded models, queue depths.

Queue Depths (Lightweight)

curl -s http://localhost:11435/fleet/queue | python3 -m json.tool

Returns just queue depths — designed for client-side backoff logic.

Log Analysis

Structured Logs (JSONL)

All events are written to ~/.fleet-manager/logs/herd.jsonl — one JSON object per line, daily rotation, 30-day retention.

# Tail the live log
tail -f ~/.fleet-manager/logs/herd.jsonl | python3 -m json.tool

# Find errors
grep '"level":"ERROR"' ~/.fleet-manager/logs/herd.jsonl

# Find events for a specific model
grep '"model":"llama3.3:70b"' ~/.fleet-manager/logs/herd.jsonl

# Count errors by component
grep '"level":"ERROR"' ~/.fleet-manager/logs/herd.jsonl | \
  python3 -c "import sys,json; from collections import Counter; \
  c=Counter(json.loads(l)['logger'] for l in sys.stdin); \
  print('\n'.join(f'{v:4d} {k}' for k,v in c.most_common()))"

Log Levels

Variable	Default	Controls
`FLEET_LOG_LEVEL`	`DEBUG`	What's written to JSONL
`FLEET_CONSOLE_LOG_LEVEL`	`INFO`	What's printed to terminal

Set FLEET_LOG_LEVEL=INFO in production to reduce file size.

Request Traces (SQLite)

Every routing decision is recorded in ~/.fleet-manager/latency.db:

# Recent requests
sqlite3 ~/.fleet-manager/latency.db \
  "SELECT model, node_id, latency_ms, status FROM request_traces ORDER BY timestamp DESC LIMIT 10"

# Failures
sqlite3 ~/.fleet-manager/latency.db \
  "SELECT model, node_id, error_message FROM request_traces WHERE status='failed' ORDER BY timestamp DESC LIMIT 10"

# Average latency per model
sqlite3 ~/.fleet-manager/latency.db \
  "SELECT model, ROUND(AVG(latency_ms)/1000.0, 1) as avg_secs, COUNT(*) as requests \
   FROM request_traces WHERE status='completed' GROUP BY model ORDER BY requests DESC"

# Error rate over the last hour
sqlite3 ~/.fleet-manager/latency.db \
  "SELECT status, COUNT(*) FROM request_traces \
   WHERE timestamp > strftime('%s','now') - 3600 GROUP BY status"

Resilience Features

Auto-Retry

If a node fails before the first chunk, the router re-scores and retries on the next-best node. Up to 2 retries (configurable via FLEET_MAX_RETRIES).

Model Fallbacks

Clients specify backup models: "fallback_models": ["qwen2.5:32b", "qwen2.5:7b"]. The router tries each in order through the full scoring pipeline.

Auto-Pull

Missing models are automatically pulled to the best available node. Configurable via FLEET_AUTO_PULL (default: true).

Context Protection

Strips unnecessary num_ctx from requests to prevent model reload hangs. Auto-upgrades to a larger loaded model when possible. Configurable via FLEET_CONTEXT_PROTECTION (default: strip).

Graceful Drain

Send SIGTERM to a node agent:

Capacity learner state saves to disk
Drain heartbeat sent to router
Router stops routing new requests to this node
In-flight requests complete normally
Pending requests rebalance to other nodes
Agent shuts down cleanly

Zombie Reaper

Background task detects in-flight requests that never completed (connection drops, Ollama crashes) and cleans them up so queues stay accurate.

Configuration

All settings via environment variables. No config files.

Key Server Variables

Variable	Default	What
`FLEET_PORT`	`11435`	Router listen port
`FLEET_HEARTBEAT_TIMEOUT`	`15.0`	Seconds before node is degraded
`FLEET_HEARTBEAT_OFFLINE`	`30.0`	Seconds before node is offline
`FLEET_MAX_RETRIES`	`2`	Max retry attempts per request
`FLEET_AUTO_PULL`	`true`	Auto-pull missing models
`FLEET_CONTEXT_PROTECTION`	`strip`	Context size protection mode
`FLEET_DYNAMIC_NUM_CTX`	`false`	Enable automatic context window optimization
`FLEET_LOG_LEVEL`	`DEBUG`	JSONL log level

Key Node Variables

Variable	Default	What
`FLEET_NODE_ENABLE_CAPACITY_LEARNING`	`false`	Enable adaptive capacity
`FLEET_NODE_DATA_DIR`	`~/.fleet-manager`	State file directory

See the full configuration reference for all 44+ variables with tuning guidance.

Ollama Settings

For best results with a fleet, set these Ollama environment variables on each node:

# In ~/.zshrc (macOS) or ~/.bashrc (Linux)
export OLLAMA_NUM_PARALLEL=2        # Allow 2 concurrent requests
export OLLAMA_KEEP_ALIVE=-1         # Never unload models
export OLLAMA_MAX_LOADED_MODELS=-1  # No limit on loaded models

KEEP_ALIVE=-1 prevents model thrashing. MAX_LOADED_MODELS=-1 lets Ollama manage memory naturally.

Data Storage

All persistent data lives in ~/.fleet-manager/ (configurable via FLEET_DATA_DIR):

~/.fleet-manager/
  latency.db                           # SQLite: traces, latency, usage, benchmarks
  logs/
    herd.jsonl                         # Structured logs (daily rotation)
  capacity-learner-{node-id}.json      # Learned behavioral data (per node)

SQLite uses WAL mode for concurrent read/write. All files are human-readable and can be backed up, queried, or deleted at will.

Next Steps

Routing Engine — Understanding and tuning scoring decisions
Adaptive Capacity — Configuring capacity learning per device
API Reference — All endpoints and response formats