Open Source · MIT Licensed

Turn idle Macs into an
AI compute fleet

Your spare Mac has 36GB of RAM doing nothing. Your main machine is bottlenecked running inference alone. Fix that.

Get Started → See how it works
# Install
pip install ollama-herd

# Start the router
herd

# On each device
herd-node

# That's it. Nodes discover the router via mDNS.
× Kubernetes
× Docker
× YAML
× Config files
× Cloud costs
× Manual load balancing
The Problem

You switched to local. Now you're stuck on one machine.

Sound familiar?

💰

Cloud API costs are bleeding you dry

You're running Aider, CrewAI, OpenClaw, or other AI agents. Cloud API bills hit hundreds a month and keep climbing. Every token costs money. Every request leaves your network.

💻

Local LLMs freed you — partially

You switched to Ollama on your Mac. Free, private, fast. But now you're constrained to a single device. Requests queue up behind each other. Larger models need more RAM than your laptop has. Agents stall waiting for inference.

Meanwhile, your other devices sit idle

Your Mac Studio with 96GB. Your old MacBook Air with 16GB. Your Mac Mini in the closet. All that memory and compute, doing nothing. Herd connects them all into one endpoint. Big models route to the machine with the most memory. Small models run on the lightweight device. Every machine contributes what it can.

Mac Studio 96GB llama3:70b
MacBook Pro 36GB qwen2.5:32b
MacBook Air 16GB llama3:8b
7
Scoring signals per request
212+
Tests passing in ~4s
2
Commands to deploy
0
Config files needed
Features

Everything your fleet needs

Intelligent routing that gets smarter the longer it runs. Every component exists to serve one thing: getting the best response as fast as possible.

7-Signal Scoring Engine

Thermal state, memory fit, queue depth, latency history, role affinity, availability trend, and context fit. Every request goes to the best machine.

Auto-Retry & Fallbacks

Transparent retry on node failure before the first chunk. Client-specified fallback models. Holding queue when all nodes are busy.

🔌

Zero-Config Discovery

mDNS auto-discovery. Nodes find the router on the LAN automatically. No config files, no service registries, no manual IP addresses.

📈

Real-Time Dashboard

5-tab live dashboard with SSE. Fleet overview, trends, model insights, per-app analytics, and benchmarks. All backed by SQLite.

💡

Adaptive Capacity Learning

168-slot behavioral model learns each device's weekly patterns. Meeting detection pauses inference when you're on a call.

🔒

Dual API Compatibility

Both OpenAI and Ollama format endpoints. Drop-in replacement for any existing client, framework, or agent pipeline.

7 signals. Every request.

The scoring engine evaluates every available node on 7 dimensions before routing. The system learns from every request and improves over time.

🌡️
Thermal
💾
Memory
📊
Queue
⏱️
Latency
🎯
Affinity
📈
Availability
🧠
Context
How It Works

Request flow

From client request to streamed response in milliseconds. Every step is traced, logged, and queryable.

1

Request arrives

Client hits the OpenAI-compatible or Ollama-compatible endpoint. The request is normalized into a unified format.

2

Score & rank

The scoring engine eliminates unhealthy nodes, scores survivors on 7 signals, and selects the best. Fallback models are tried if the primary isn't available.

3

Queue & dispatch

The request enters a per-node:model queue with dynamic concurrency. The queue manager balances load and auto-rebalances if conditions change.

4

Stream & retry

The streaming proxy forwards to Ollama. If the node fails before the first chunk, auto-retry kicks in with a different node. Format conversion (SSE / NDJSON) is transparent.

5

Learn & trace

Every request is traced to SQLite. Latency data feeds back into the scoring engine. The fleet gets smarter with every request it serves.

Compatibility

Works with everything

One base_url change connects any framework. Ollama Herd is the orchestration layer, not a replacement.

Open WebUI
LangChain
CrewAI
OpenHands
AutoGen
Aider
Cline
Continue.dev
LlamaIndex
OpenClaw
LiteLLM
exo

Any client that supports a custom OpenAI or Ollama base URL works out of the box.

The fleet that works while you sleep

We're building an agentic router — a fleet that doesn't just wait for requests, but generates its own work, learns your patterns, and uses idle compute proactively.

Pattern-driven model pre-warming
Task backlog with idle-time processing
Agentic task decomposition
Fleet health opinions

Your Mac fleet is an untapped AI cluster

500 MacBooks with Apple Silicon. Tens of terabytes of unified memory. Sitting idle during meetings, after hours, and weekends. Ollama Herd turns your existing hardware into a private AI inference cluster — zero additional cost.

SSO, RBAC, audit logging, compliance dashboards, fleet management, and SLA support. Everything enterprises need to run AI inference on the hardware they already own.

Contact Us →
$0
Additional hardware cost
58%
Enterprise employees now on Macs
96%
CIOs expect Mac fleet growth
50-70%
Savings vs cloud API costs

Your hardware deserves
an orchestrator

Stop leaving compute on the table. Start herding.

pip install ollama-herd
View on GitHub →