Features — Ollama Herd

Intelligent Routing

7-Signal Scoring Engine

Every request is scored across seven signals to find the optimal node. This isn't round-robin or random selection — it's a weighted decision that considers the physical reality of each device.

Signal	What It Measures
Model thermal state	A model already loaded in GPU memory (hot) gets +50 points. Cold-loading a 40GB model takes 15–30 seconds — the router avoids it whenever possible.
Memory fit	Not just "is there enough RAM?" but how comfortably the model fits given current utilization and the node's dynamic memory ceiling.
Queue depth	A hot model on a saturated node loses to a warm model on an empty node. Load spreads naturally.
Estimated wait time	Uses real per-node, per-model latency history (p75) to estimate actual wait time. A queue of 3 on a fast model differs from a queue of 3 on a slow one.
Role affinity	Large models route to powerful machines. Small models route to lighter hardware, preserving big-machine capacity.
Availability trend	Is this device freeing up or getting busier right now? Prevents sending a long request to a machine whose owner just sat down.
Context fit	Can this node handle the requested context size without triggering a model reload?

Model Fallbacks

Clients can specify backup models. If the primary model isn't available anywhere in the fleet, the router tries alternatives in order — same scoring pipeline, just a different model.

Auto-Retry

If a node fails before the first response chunk is sent, the router re-scores the remaining nodes and retries on the next-best option. Up to 2 retries. Clients never see the failure.

Context Protection

Strips unnecessary num_ctx from requests to prevent Ollama model reload hangs. Auto-upgrades to a larger loaded model in the same category when the requested model is cold but a compatible one is hot.

Thinking Model Support

Auto-detects chain-of-thought models (DeepSeek-R1, QwQ, Phi-4-Reasoning, GPT-OSS) and inflates num_predict by 4× to accommodate thinking tokens. Diagnostic headers (X-Thinking-Tokens, X-Output-Tokens, X-Budget-Used, X-Done-Reason) let clients see exactly how the token budget was spent.

Device-Aware Scoring (v0.6.0)

Every node's chip and memory bandwidth flow through the heartbeat into the scoring pipeline. Role affinity scales continuously with bandwidth instead of flat memory tiers — an M3 Ultra at 800 GB/s scores +25, an M4 Max at 546 GB/s scores +18, an M3 Pro at 150 GB/s scores +8.75. Queue-depth penalty normalizes by each node's bandwidth share of the fleet median, so a queue of 4 on a 4×-faster node is treated like a queue of 1. Expected steady-state load distribution equals each node's bandwidth share of the fleet total.

How does this compare? No other local routing tool offers 7-signal hardware-aware scoring. See how Herd's routing compares to exo, GPUStack, and vLLM.

Smart Benchmark

Multimodal Benchmark Suite

Benchmark your entire fleet across all five model types — LLMs, embeddings, image generation, speech-to-text, and vision — in a single run. Smart mode auto-discovers fleet capabilities and selects an optimal model mix to fill available memory.

Two Benchmark Modes

Default mode benchmarks whatever models are currently loaded — quick sanity check. Smart mode analyzes fleet hardware, pulls recommended models, and runs a comprehensive benchmark that fills available memory with an optimal mix. Duration, concurrency, and model types are all configurable.

Per-Model and Per-Node Charts

Four new benchmark visualizations: per-model latency and throughput, per-model success rates, per-node concurrency utilization, and an overall timeline. Results persist in SQLite — compare runs over time.

Dynamic Context Optimization

Context Usage Tracking

Every request's actual token usage (prompt + completion) is tracked. The context optimizer computes p50, p75, p95, p99, and max distributions per model — revealing that most models use under 5% of their allocated context window.

Automatic Right-Sizing

Three-phase optimization: observe actual usage, recommend optimal context sizes, then auto-adjust. A model allocated 131K context but using only 5K at p99? Herd recommends 16K — saving 50GB+ of VRAM that can be used to load additional models.

Context Usage API

The /dashboard/api/context-usage endpoint shows per-model utilization percentage, recommended context size, and potential memory savings. The health engine warns when allocated context exceeds actual usage by 4× or more.

Zero-Config Discovery

mDNS Auto-Discovery

Run herd-node on any device on the same network. It finds the router automatically via mDNS (Bonjour/Avahi). No IP addresses to configure, no config files to maintain, no DNS entries to manage.

Heartbeat-Based Health

Each node sends heartbeats every 5 seconds with full system state: CPU, memory, GPU utilization, thermal state, loaded models, disk space, Ollama version. The router knows the exact state of every device in real time.

LAN Proxy

The node agent automatically bridges LAN traffic to localhost Ollama. Other devices can reach each node's Ollama through the fleet without manual port forwarding.

Adaptive Learning

Capacity Learner

A 168-slot behavioral model (one slot per hour of the week) learns each device's availability patterns. After a few weeks, the router knows your MacBook is busy Tuesday mornings and your Mac Studio is always available. Routing decisions reflect these patterns.

Meeting Detection (macOS)

Detects active cameras and microphones and hard-pauses the node. No inference competes with your video calls. The node resumes automatically when the meeting ends.

App Fingerprinting

Classifies the current workload on each device (idle / light / moderate / heavy / intensive) using CPU, memory, and network patterns — without reading app names or window titles. Heavy workloads reduce the node's memory ceiling, shifting requests to other machines.

Latency Tables

Per-node, per-model response times tracked in SQLite. The scoring engine uses historical latency to estimate wait times accurately. A node that's consistently slow for a particular model gradually gets fewer requests for that model.

Queue Management

Per Node:Model Queues

Each node+model pair has its own queue with dynamic concurrency. The router knows how many parallel requests each device can handle without degrading performance.

Holding Queue

When all nodes are at capacity, requests wait in a holding queue instead of failing. The router retries scoring every 5 seconds as node states change.

Pre-Warming

When a primary node's queue gets deep, the router proactively loads the same model on the runner-up node. The next request hits a hot model instead of waiting.

Background Rebalancer

Runs every 5 seconds, moving queued requests from overloaded nodes to nodes with spare capacity — but only where the model is already loaded.

Zombie Reaper

Detects and cleans up stuck in-flight requests that never completed. Keeps queues accurate.

Backends

Ollama — the default runtime

Every node that runs herd-node needs Ollama. The router speaks Ollama's native API for chat, embeddings, model pulling, and lifecycle management. All mainstream GGUF models work out of the box — Gemma, Qwen, DeepSeek, Llama, Phi, GPT-OSS, hundreds more.

MLX — first-class alongside Ollama (v0.6.0)

Apple Silicon nodes can optionally spawn one or more mlx_lm.server processes alongside Ollama. Useful for MLX-specific models (Qwen3-Coder-Next MoE, Qwen3-Coder-30B-A3B-Instruct) and for running a dedicated compactor model side-by-side with the main coding model without Ollama eviction risk.

Multi-MLX-Server per Node (v0.6.0)

Configure N MLX servers on N ports via FLEET_NODE_MLX_SERVERS. Each runs in its own process with independent logs. Memory-pressure startup gate estimates weight size from the HuggingFace disk cache and refuses to spawn when the total (model + headroom) won't fit — surfaces the skip reason on the dashboard instead of failing silently.

Multi-Node MLX Aggregation (v0.6.0)

Set FLEET_NODE_MLX_BIND_HOST=0.0.0.0 to expose MLX servers on the LAN. The router walks every online node and routes each MLX request to a healthy server hosting the requested model. Per-URL httpx.AsyncClient cache isolation prevents a slow server from back-pressuring a fast one.

Native Text Embedding Server (v0.7.0)

A dedicated FastAPI server on port 11439 runs nomic-embed-text via fastembed + ONNX Runtime — no PyTorch, no Ollama. The router intercepts /api/embed calls for nomic-embed-text before they reach Ollama and proxies to the best node's text embedding server. Zero contention with LLM inference slots — the structural fix for an embed-timeout incident where a running 120B inference held both Ollama parallel slots and embedding requests queued indefinitely despite 14% CPU and 291 GB free RAM. Weights download automatically on first request (130 MB Q-quantized int8) and cache in ~/.fleet-manager/models/text-embedding/. Enable via uv sync --extra embedding — same flag that enables vision embeddings.

Vision Embedding Service

A separate vision-embedding service exposes DINOv2, SigLIP2, and CLIP via ONNX Runtime. The dashboard shows availability per node and the health engine fires vision_backend_missing when weights are cached on disk but onnxruntime isn't loadable — closes the "chip says available but /embed-image returns 500" footgun.

Multimodal Support

LLM Inference

Full support for chat completions and text generation. Both streaming and non-streaming. OpenAI and Ollama API formats.

Embeddings

Route embedding requests to the node with the embedding model loaded. Supports /api/embed, /api/embeddings, and /v1/embeddings. nomic-embed-text is intercepted before reaching Ollama and routed to the dedicated native fastembed server (port 11439) for zero-LLM-contention embedding traffic. Other embedding models still route to Ollama as a fallback.

Image Generation

Routes image generation requests to Apple Silicon nodes running mflux (FLUX models) or DiffusionKit. Supports FLUX Schnell, FLUX Dev, Stable Diffusion 3, and Ollama native image models. OpenAI-compatible /v1/images/generations endpoint included.

Speech-to-Text

Routes transcription requests to nodes with MLX and Qwen3-ASR installed. Apple Silicon only.

Model Pulling

Pull models onto fleet nodes through the router. Auto-selects the node with the most available memory, or target a specific node. Streams progress in real time.

Real-Time Dashboard

A web dashboard at /dashboard with eight tabs:

Fleet Overview — Live node cards, queue depths, request counts via Server-Sent Events
Node Models — (renamed from "Request Queues" in v0.7.0) Per-model cards for every backend on every node — Ollama (grey), MLX (purple), native fastembed (green), vision embedding (cyan). Instant-response backends show 24h completed/failed counts and avg latency from a 60s-TTL trace cache; an amber "DL ON DEMAND" chip surfaces models whose weights aren't yet cached.
Trends — Requests per hour, average latency, token throughput charts (24h–7d)
Model Insights — Per-model latency, tokens/sec, usage comparison
Tags — Per-tag analytics with request volume, latency, tokens, error rates
Benchmarks — Capacity growth over time with per-run throughput and latency percentiles
Health — 30+ automated health checks with severity levels including context waste detection, MLX server monitoring, vision-backend availability, and text-embedding backend health
Recommendations — AI-powered model mix recommendations per node
Settings — Runtime toggles, config overview, node version tracking

No external dependencies. No build process. Opens in any browser.

Fleet Overview tab of the Ollama Herd dashboard showing two node cards. Left card: Neons-Mac-Studio online, CPU 63.8%, Memory 489 GB / 512 GB, 32 cores, with 5 models loaded (gpt-oss:120b, qwen3:8b, gemma3:27b via Ollama; Qwen3-Coder-Next-4bit and Qwen3-Coder-30B-A3B-Instruct-4bit via MLX) plus image generation, speech-to-text, embedding, and vision services. Right card: Twin2-Macbook-Pro-M4 online, CPU 47.3%, Memory 61.4 GB / 128 GB, 16 cores, with gemma3:4b, gemma3:27b, qwen3-coder:30b-agent, and nomic-embed-text loaded. — Fleet Overview tab. Each node surfaces CPU, memory, core count, hot-loaded models with backend + quant + context window, and the full set of services (image gen, STT, embeddings, vision) available for routing.

Health Monitoring

30+ Automated Health Checks

The health engine continuously monitors fleet liveness, routing quality, backend reliability, and observability. Each check carries a severity (INFO / WARNING / CRITICAL) and an actionable recommendation. Highlights:

Category	What it catches
Fleet liveness	Offline nodes, degraded nodes, memory pressure (OS-reported, not just %), underutilized nodes
Routing quality	VRAM fallbacks (cross-category escalates to ERROR with QUALITY RISK note), model thrashing, request timeouts, retry rates, context waste detection
MLX backend	Server down (CRITICAL), server quarantined (crash-loop containment after 5 crashes/5min), memory-blocked (skipped start due to memory gate)
Text embedding backend	Embed error rate (WARN at ≥5/hr, CRIT at ≥25/hr), text-embedding backend missing (nomic weights cached but `fastembed` not installed), text-embedding Ollama bypass (native server not running so embed requests still contend for LLM inference slots), nomic still loaded in Ollama despite native server handling traffic (VRAM waste signal)
Vision backend	Backend missing (weights cached but onnxruntime not loadable) — closes the "chip says available but `/embed-image` returns 500" footgun
Observability	Trace-store write failures (closes a silent SQLite-contention black hole), version mismatch, KV-cache bloat, zombie reaper activity
Stream + client integrity	Client disconnects, incomplete streams, context protection events

Each check has a severity level and actionable recommendation. Available via the dashboard and the /dashboard/api/health endpoint.

API Compatibility

OpenAI Format

Endpoint	Purpose
`POST /v1/chat/completions`	Chat completions (streaming + non-streaming)
`GET /v1/models`	List available models
`POST /v1/embeddings`	Generate embeddings
`POST /v1/images/generations`	Generate images

Ollama Format

Endpoint	Purpose
`POST /api/chat`	Chat completions
`POST /api/generate`	Text generation
`POST /api/pull`	Pull models onto fleet nodes
`GET /api/tags`	List all models
`GET /api/ps`	List loaded models
`POST /api/embed`	Generate embeddings
`POST /api/embeddings`	Generate embeddings (alternative)

Anthropic Messages Format (v0.6.0)

Point Claude Code CLI at your herd with one env var: ANTHROPIC_BASE_URL=http://localhost:11435. The router translates Anthropic Messages ↔ Ollama format transparently, including tool-use, streaming SSE events, and count_tokens. No LiteLLM sidecar.

Endpoint	Purpose
`POST /v1/messages`	Anthropic Messages API (streaming + non-streaming)
`POST /v1/messages/count_tokens`	Token budget pre-flight (tiktoken estimate)

See the Claude Code integration guide for model mapping, tier routing (claude-haiku-* vs claude-sonnet-* / claude-opus-*), and the four stability knobs for long sessions.

Claude Code Reliability Layer (v0.6.0)

Three-Layer Context Management

Mirrors hosted Claude's behavior so local Qwen3-Coder sessions don't fall apart past 30K tokens.

Layer 1 — mechanical tool-result clearing. When the Anthropic request exceeds FLEET_ANTHROPIC_AUTO_CLEAR_TOOL_USES_TRIGGER_TOKENS (default 100K), older tool_result blocks are replaced with a short placeholder before the request reaches the model. No LLM call, microsecond-scale. Real Claude Code session verified: first fire reclaimed 81K tokens (206K → 125K, 60.8% reduction).
Layer 2 — LLM-based compactor with dynamic curator selection. Summary work goes to whatever capable model is already hot and idle rather than cold-loading a default. Cache key deliberately excludes curator_model so MLX prefix-cache bytes stay stable across curator-selection events.
Layer 3 — pre-inference 413 cap. If the prompt is still oversized after clearing + compaction, the route returns HTTP 413 with a run /compact and resubmit message before the request reaches the model — no multi-minute MLX prefill wedge.

Tool-Schema Fixup

Claude Code's 27-tool schema has heavy optional-param usage (Grep has 13 optional params); llama.cpp#20164 documents that Qwen3-Coder starts silently dropping optional params at ~30K tokens and loops tool calls with a field consistently missing. Herd promotes optional params with known-safe defaults (Bash.timeout=120000, Grep.head_limit=250, Read.offset=0) to required-with-default in the outbound schema.

Tool-Call JSON Repair

Local coding models occasionally emit tool_use.input with minor syntax errors. A repair cascade tries strict parse → json-repair → a 4-pattern regex catalog adapted from open-source proxies → pass-through. Every repair attempt is schema-validated before substitution — never hides real failures silently. Per-model repair counters exposed on /fleet/queue.

MLX Wall-Clock Timeout

FLEET_MLX_WALL_CLOCK_TIMEOUT_S (default 300s) catches wedged-request syndrome where mlx_lm.server keeps emitting tokens slowly but never stops. On timeout, the slot is released and the route returns 413 with the /compact hint. No silent server-side retry — client owns the decision of whether to resubmit.

Warm-Prompt Preload

After mlx_lm.server passes its health check, a fire-and-forget 1-token request primes the prompt cache with the system-prompt prefix. Based on measured 1.3–2.25× TTFT improvement on the first real request.

Why does Claude Code break at 30K tokens on local models? See the full research in Claude Code integration guide and the why-claude-code-degrades-at-30k analysis.

Fleet Management

Endpoint	Purpose
`GET /fleet/status`	Full fleet state
`GET /fleet/queue`	Lightweight queue depths

Works with Open WebUI, LangChain, CrewAI, AutoGen, Aider, Continue.dev, LlamaIndex, LiteLLM, and any OpenAI-compatible client. Just change the base URL.

Request Tagging

Tag requests with an app identifier to get per-tag analytics. Add X-Herd-Tags: my-app to any request and the dashboard breaks down usage by app — request volume, latency, tokens, error rates. See which tools consume the most fleet resources.

Platform Support

Feature	macOS	Linux	Windows
LLM routing, scoring, queues	Yes	Yes	Yes
Embeddings	Yes	Yes	Yes
mDNS auto-discovery	Yes	Yes	Yes
Dashboard & traces	Yes	Yes	Yes
Image gen (mflux, DiffusionKit)	Apple Silicon	—	—
Image gen (Ollama native)	Yes	Yes	Yes
Speech-to-text (MLX)	Apple Silicon	—	—
Meeting detection	Yes	—	—
Memory pressure detection	Yes	Yes	—

Core routing works identically on all platforms. macOS-only features degrade gracefully on other OSes.

Configuration

All settings via environment variables with FLEET_ prefix (server) or FLEET_NODE_ prefix (node). 44+ configuration options covering scoring weights, queue behavior, retry limits, heartbeat intervals, and more. Sensible defaults mean you don't need to touch any of them to get started.

See How We Compare

Honest, detailed comparisons with feature tables, pros/cons, and when to choose each tool:

Ollama Herd vs Single Ollama — when to upgrade from one machine
Ollama Herd vs exo — fleet routing vs model sharding
Ollama Herd vs vLLM — Apple Silicon fleet vs GPU serving
Ollama Herd vs Open WebUI — routing engine vs chat interface
Ollama Herd vs Cloud APIs — local fleet vs per-token pricing
View all 13 comparisons →

Everything Ollama Herd Does

Intelligent Routing

7-Signal Scoring Engine

Model Fallbacks

Auto-Retry

Context Protection

Thinking Model Support

Device-Aware Scoring (v0.6.0)

Smart Benchmark

Multimodal Benchmark Suite

Two Benchmark Modes

Per-Model and Per-Node Charts

Dynamic Context Optimization

Context Usage Tracking

Automatic Right-Sizing

Context Usage API

Zero-Config Discovery

mDNS Auto-Discovery

Heartbeat-Based Health

LAN Proxy

Adaptive Learning

Capacity Learner

Meeting Detection (macOS)

App Fingerprinting

Latency Tables

Queue Management

Per Node:Model Queues

Holding Queue

Pre-Warming

Background Rebalancer

Zombie Reaper

Backends

Ollama — the default runtime

MLX — first-class alongside Ollama (v0.6.0)

Multi-MLX-Server per Node (v0.6.0)

Multi-Node MLX Aggregation (v0.6.0)

Native Text Embedding Server (v0.7.0)

Vision Embedding Service

Multimodal Support

LLM Inference

Embeddings

Image Generation

Speech-to-Text

Model Pulling

Real-Time Dashboard

Health Monitoring

30+ Automated Health Checks

API Compatibility

OpenAI Format

Ollama Format

Anthropic Messages Format (v0.6.0)

Claude Code Reliability Layer (v0.6.0)

Three-Layer Context Management

Tool-Schema Fixup

Tool-Call JSON Repair

MLX Wall-Clock Timeout

Warm-Prompt Preload

Fleet Management

Request Tagging

Platform Support

Configuration

See How We Compare