The mental model behind Ollama Herd. Read this if terms like node, heartbeat, scoring signal, queue, or capacity mode are still fuzzy.
A node is any device running Ollama with the herd-node agent. Each node is sovereign — it runs its own Ollama, manages its own models, and works fine standalone. The router coordinates but never controls.
Nodes have three states:
| State | Meaning |
|---|---|
| Online | Heartbeating normally, available for routing |
| Degraded | Heartbeat delayed (>15s), still routable but penalized |
| Offline | No heartbeat for 30+ seconds, excluded from routing |
A node can also be paused (meeting detected, manual override, or very low availability score) — paused nodes are online but temporarily excluded from routing.
Every 5 seconds, each node agent sends a heartbeat to the router containing:
The router uses heartbeats to maintain a real-time picture of every device in the fleet. If a heartbeat is missed, the node's state degrades. Three misses and it's offline.
When a request arrives, the router scores every eligible node across 7 weighted signals:
| # | Signal | What It Answers | Weight |
|---|---|---|---|
| 1 | Model thermal state | Is the model already loaded in GPU memory? | Up to +50 |
| 2 | Memory fit | How comfortably does the model fit? | Up to +20 |
| 3 | Queue depth | How many requests are waiting? | Up to -30 |
| 4 | Estimated wait time | How long until this request starts? | Up to -25 |
| 5 | Role affinity | Does this machine match the model's weight class? | Up to +15 |
| 6 | Availability trend | Is this device getting busier or freeing up? | Up to +10 |
| 7 | Context fit | Can this node handle the requested context size? | Up to +10 |
The highest total score wins. A hot model on an idle, powerful machine with plenty of memory scores 80+. A cold model on a busy laptop with a rising workload scores under 20.
Each node + model pair has its own queue. When a request is routed to a node, it enters that node's queue for that model.
Queues have dynamic concurrency — the router calculates how many parallel requests each node can handle based on available memory and model size. A Mac Studio with 512GB can run 8 parallel requests. A MacBook with 16GB might handle 2.
Key queue behaviors:
Each node operates in a capacity mode that determines how much memory the router is allowed to use:
| Mode | Memory Ceiling | When |
|---|---|---|
| Full | 80% of total RAM | Dedicated server or high availability |
| Learned high | 50% (max 64GB) | Normal learned availability |
| Learned medium | 25% (max 32GB) | Moderate workload detected |
| Learned low | 12.5% (max 16GB) | Heavy workload detected |
| Paused | 0GB | Meeting, critical memory pressure, or manual pause |
| Bootstrap | 0GB | First 7 days of capacity learning (observation only) |
Dedicated servers (like a Mac Studio) skip capacity learning and always run in full mode. Laptops and shared devices use the capacity learner to dynamically adjust.
Ollama Herd routes four types of models:
| Type | What | Where |
|---|---|---|
| LLM | Chat completions, text generation | Any Ollama node |
| Embedding | Vector embeddings for RAG | Any Ollama node with embedding model |
| Image | Image generation (FLUX, Stable Diffusion) | Apple Silicon nodes with mflux/DiffusionKit |
| Speech-to-text | Audio transcription | Apple Silicon nodes with MLX + Qwen3-ASR |
A model can be in three thermal states on a given node:
| State | Meaning | Scoring Impact |
|---|---|---|
| Hot | Currently loaded in GPU memory | +50 points |
| Warm | On disk, loaded within last 30 minutes (likely OS-cached) | +30 points |
| Cold | On disk, not recently used | +10 points |
The router strongly prefers hot models — cold-loading a 40GB model takes 15–30 seconds.
When a requested model doesn't exist on any node and auto-pull is enabled (default), the router automatically pulls it onto the node with the most available memory. The client waits for the download, then gets served normally.
Clients can specify fallback models — backup models to try if the primary isn't available. The router tries each in order through the same scoring pipeline:
{
"model": "llama3.3:70b",
"fallback_models": ["qwen2.5:32b", "qwen2.5:7b"]
}
If a node fails before the first response chunk is sent, the router re-scores remaining nodes (excluding the failed one) and retries. Up to 2 retries. Clients never see the failure.
Any request can carry tags for per-app analytics:
{"metadata": {"tags": ["my-app", "production"]}}
Or via header: X-Herd-Tags: my-app, production
The dashboard breaks down usage, latency, and error rates per tag — so you can see which tools consume the most fleet resources.
Every routing decision is recorded as a trace in SQLite. Each trace includes: model, node, score, latency, tokens, retry/fallback status, and tags. You can query traces through the dashboard API or directly with SQLite:
sqlite3 ~/.fleet-manager/latency.db \
"SELECT model, node_id, latency_ms FROM request_traces ORDER BY timestamp DESC LIMIT 5"
A real-time web UI at /dashboard with 8 tabs: Fleet Overview, Trends, Model Insights, Apps, Benchmarks, Health, Recommendations, and Settings. Updated via Server-Sent Events — no polling, no page refresh.