v0.7.0 · Open Source · MIT Licensed ·

Turn idle Macs into an
AI compute fleet

Your spare Mac has 36GB of RAM doing nothing. Run DeepSeek-R1 70B on the Studio, FLUX image gen on the MacBook, Qwen3-ASR transcription on the Mini, all through one endpoint. Fix that.

Get Started → See how it works

# Install (or upgrade to v0.7.0), works on macOS, Linux, and Windows

pip install ollama-herd --upgrade

# macOS/Linux also: brew tap geeks-accelerator/ollama-herd && brew install ollama-herd

# Windows: pip works; uv is the fast path (curl -sSf https://astral.sh/uv/install.sh | sh, then `uv tool install ollama-herd`)

# Start the router (on your most powerful machine)

herd

# On each device in your fleet

herd-node

# That's it. Nodes auto-discover the router via mDNS.

Live dashboard

See your whole fleet at a glance

Every node, every model loaded, every image-gen / speech-to-text / embedding / vision service ready to route — one view. No ops console to build. No Grafana to wire up.

Ollama Herd dashboard showing two nodes side by side: a 512 GB Mac Studio running gpt-oss:120b and Qwen3-Coder models via Ollama and MLX, plus image generation (FLUX, SD3.5), speech-to-text, embeddings, and vision models; and a 128 GB MacBook Pro M4 running Gemma and Qwen3-Coder models. — Two real nodes in one fleet — a Mac Studio running `gpt-oss:120b` and MLX coding models, and a MacBook Pro M4 running Gemma. Both visible from one URL.

The Problem

You switched to local. Now you're stuck on one machine.

Sound familiar?

💰

Cloud API costs are bleeding you dry

You're running Aider, CrewAI, OpenClaw, or other AI agents. Cloud API bills hit hundreds a month and keep climbing. Every token costs money. Every request leaves your network.

💻

Local LLMs freed you — partially

You switched to Ollama on your Mac. Free, private, fast. But now you're constrained to a single device. Requests queue up behind each other. Larger models need more RAM than your laptop has. Agents stall waiting for inference.

⚡

Meanwhile, your other devices sit idle

Your Mac Studio with 256GB. Your old MacBook Air with 16GB. Your Mac Mini in the closet. All that memory and compute, doing nothing. Herd connects them all into one endpoint. Big models route to the machine with the most memory. Small models run on the lightweight device. Every machine contributes what it can.

Mac Studio 256GB → llama3.3:70b + FLUX image gen

MacBook Pro 36GB → qwen3.5:32b + Qwen3-ASR

MacBook Air 16GB → llama3.3:8b

Who it's for

Built for people with more Macs than they're using

If you have a few Apple Silicon machines and one of them is always the bottleneck, Herd is for you. Here are a few of the fleets people run it on.

💻

Solo developers

You've got a Studio, a laptop, and maybe a Mini in a drawer. Point Claude Code and your agents at one endpoint and let Herd pick the right machine every time.

🤖

Agent-heavy workflows

Aider, CrewAI, OpenClaw, and a dozen background agents all hitting local models at once. Herd spreads the load so nothing stalls waiting in a queue.

👥

Small teams and offices

Everyone points their tools at one URL. Herd handles contention, shows who's using what, and keeps the shared Mac Studio from melting down.

🎬

Creative studios

Image generation on one Mac, transcription on another, a big model on the Studio. Herd sends each kind of job to the machine that can actually do it.

🔬

Research labs

Mixed hardware, mixed workloads, shared fairly. Embeddings for RAG, vision models, and chat all routed by capability instead of guesswork.

🏠

Home lab enthusiasts

You already run Ollama on everything you own. Herd turns the pile into one smart cluster in two commands, with no Kubernetes in sight.

See detailed setups and example configs on the use cases page.

Features

Everything your fleet needs

Intelligent routing that gets smarter the longer it runs. Every component exists to serve one thing: getting the best response as fast as possible.

⚙

7-Signal Scoring Engine

Thermal state, memory fit, queue depth, latency history, role affinity, availability trend, and context fit. Every request goes to the best machine.

↻

Auto-Retry & Fallbacks

Transparent retry on node failure before the first chunk. Client-specified fallback models. Holding queue when all nodes are busy.

🔌

Zero-Config Discovery

mDNS auto-discovery. Nodes find the router on the LAN automatically. No config files, no service registries, no manual IP addresses.

📈

Real-Time Dashboard

8-tab live dashboard with SSE. Fleet overview, trends, model insights, per-tag analytics, benchmarks, health, recommendations, and settings. Multimodal type badges and per-node capability matrix. All backed by SQLite.

💡

Adaptive Capacity Learning

168-slot behavioral model learns each device's weekly patterns. Meeting detection pauses inference when you're on a call.

🔒

Multi-Protocol API

OpenAI-compatible endpoints for chat, images, and transcription. Plus native Ollama format. Drop-in replacement for any existing client, framework, or agent pipeline.

🎨

Multimodal Routing

Route LLMs, embeddings, image generation, speech-to-text, and vision across the fleet. Capability-aware — image requests only go to nodes with mflux, transcription only to nodes with Qwen3-ASR.

🧠

Thinking Model Support

Auto-detects chain-of-thought models like DeepSeek-R1 and inflates token budgets by 4×. Diagnostic headers show exactly how thinking tokens were spent.

📊

Smart Benchmark

Auto-discovers fleet capabilities, selects an optimal model mix to fill available memory, and benchmarks LLMs, embeddings, image gen, STT, and vision together. Per-model and per-node charts.

💫

Dynamic Context Optimization

Measures actual token usage per model, recommends optimal context sizes, and auto-adjusts to reclaim wasted VRAM. Most models use under 5% of allocated context — Herd fixes that.

Multimodal

Beyond text inference

One fleet, five model types. Every request routes to a node with the right capabilities.

💬

LLM

Chat, completion, reasoning. Smart routing by memory fit and model size.

Llama 3.3, Qwen 3.5, DeepSeek-V3

🔎

Embeddings

Vector search and RAG pipelines. Route to nodes with embedding models loaded.

nomic-embed-text, mxbai-embed

🎨

Image Generation

Text-to-image via FLUX. OpenAI-compatible endpoint. Routes to nodes with mflux installed and GPU capacity.

FLUX.1 Schnell, FLUX.1 Dev

🎤

Speech-to-Text

Audio transcription routed to capable nodes. OpenAI Whisper-compatible endpoint.

Qwen3-ASR

👁

Vision

Image understanding via multimodal models. Send images with prompts, get text descriptions and analysis.

Gemma3, LLaVA, Llama3.2-Vision

Workstation-aware

Your Macs aren't servers. Herd knows it.

A rack server is always ready. A Mac is somebody's actual computer. It joins meetings, runs builds, renders video, and throttles when it gets hot. Herd treats every machine as what it really is, a workstation with a life of its own, and it schedules around the person using it.

🎥

Meeting detection

Camera or mic goes live and Herd stops routing to that Mac until the call ends. It won't land inference on the laptop you're presenting from.

🖥

Foreground-app awareness

A heavy app in front, like Final Cut, Logic, or a big Xcode build, tells Herd the machine is busy with real work, so it steps back and routes elsewhere.

🌡️

Thermal and memory pressure

A throttling or memory-starved Mac drops down the ranking automatically, with no config and no babysitting.

📅

Learned weekly rhythm

Herd builds a 168-slot model of when each device is usually free, then quietly favors the machines that tend to be idle right now.

Every other local router treats your machines as interchangeable endpoints. Herd is the only one built to schedule around the people using them.

How It Works

Request flow

From client request to streamed response in milliseconds. Every step is traced, logged, and queryable.

Request arrives

Client hits any endpoint — chat completion, image generation, transcription, or embeddings. The request is normalized and routed by type.

Score & rank

The scoring engine eliminates unhealthy nodes, scores survivors on 7 signals, and selects the best. Fallback models are tried if the primary isn't available.

Queue & dispatch

The request enters a per-node:model queue with dynamic concurrency. The queue manager balances load and auto-rebalances if conditions change.

Stream & retry

The streaming proxy forwards to Ollama. If the node fails before the first chunk, auto-retry kicks in with a different node. Format conversion (SSE / NDJSON) is transparent.

Learn & trace

Every request is traced to SQLite. Latency data feeds back into the scoring engine. The fleet gets smarter with every request it serves.

Claude Code CLI

Point Claude Code at your own hardware

Ollama Herd speaks the Anthropic Messages API. One env var redirects Claude Code CLI to your local fleet — agentic coding with your models, your hardware, your data. No rate limits, no per-token bills, no prompts leaving your network.

# Point Claude Code at your herd router

export ANTHROPIC_BASE_URL=http://localhost:11435

export ANTHROPIC_AUTH_TOKEN=dummy

claude

Includes three-layer context management that fixes the "Claude Code breaks at 30K tokens" failure mode on local Qwen3-Coder models. Per-tier model routing maps claude-haiku-* to a fast Ollama model and claude-sonnet-* / claude-opus-* to an 80B MoE via MLX — configurable via FLEET_ANTHROPIC_MODEL_MAP.

Full integration guide →

Compatibility

Works with everything

One base_url change connects any framework. Ollama Herd is the orchestration layer, not a replacement.

Open WebUI

LangChain

CrewAI

OpenHands

AutoGen

Aider

Cline

Continue.dev

LlamaIndex

OpenClaw

LiteLLM

exo

And it works on top of the inference stack you already run. Herd routes to Ollama and MLX out of the box, and a distributed cluster like exo or Apple's MLX distributed stack can register as a single node in your fleet. Herd is the layer above your runtimes, not a replacement for any of them. Nothing to rip out, nothing to migrate.

Any client that supports a custom OpenAI, Ollama, or Anthropic Messages base URL works out of the box.
Beyond LLMs, it also routes image generation (FLUX via mflux) and speech-to-text (Qwen3-ASR) to capable nodes.

Wondering about LM Studio's LM Link? LM Link connects your Macs to each other. Ollama Herd routes across your whole team's mixed fleet (Mac, Linux, Windows, any Ollama or MLX node) with scoring, context management, and team admin. See the detailed comparison →

What's Next

The fleet that works while you sleep

Multimodal routing, smart benchmarking, and dynamic context optimization are shipped. Now we're building an agentic router — a fleet that doesn't just wait for requests, but generates its own work, learns your patterns, and uses idle compute proactively.

✓ Multimodal routing (LLM + Image + STT + Embeddings)

✓ Smart benchmark (multimodal)

✓ Dynamic context optimization

○ Video generation + TTS routing

○ Pattern-driven model pre-warming

○ Agentic task decomposition

Enterprise

Your Mac fleet is an untapped AI platform

500 MacBooks with Apple Silicon. Tens of terabytes of unified memory. Sitting idle during meetings, after hours, and weekends. Ollama Herd turns your existing hardware into a private AI compute platform — LLM inference, image generation, transcription, and embeddings — at zero additional cost.

SSO, RBAC, audit logging, compliance dashboards, fleet management, and SLA support. Everything enterprises need to run a full AI stack on the hardware they already own.

Additional hardware cost

58%

Enterprise employees now on Macs

96%

CIOs expect Mac fleet growth

50-70%

Savings vs cloud API costs

Turn idle Macs into anAI compute fleet