Yes. Open-source, MIT license, with no paid tiers, no API keys, and no subscriptions.

Ollama Herd vs Envoy AI Gateway: Personal Fleet vs Enterprise K8s Gateway [2026]

Q: Can I use Ollama Herd with Envoy AI Gateway?

Yes. Envoy AI Gateway handles cloud-facing routing while Ollama Herd handles local fleet routing, creating a unified system where expensive requests go to cloud and routine inference stays local.

Q: How does Envoy AI Gateway compare to Ollama Herd for local inference?

Envoy AI Gateway has zero hardware awareness for local inference. Herd's 7-signal scoring makes intelligent routing decisions based on real-time device conditions like VRAM pressure and thermal state.

What is Envoy AI Gateway?

Envoy AI Gateway (~1,500 GitHub stars) is an open-source AI gateway built on Envoy Proxy, co-developed by Tetrate and Bloomberg and donated to the CNCF community. It provides multi-provider routing, credential injection, token-based rate limiting, and failover across 16+ cloud LLM APIs (OpenAI, Anthropic, Bedrock, Vertex AI, etc.), all deployed on Kubernetes via Helm charts and CRDs.

What is Ollama Herd?

Ollama Herd is an open-source smart multimodal AI router that turns multiple Ollama instances across Apple Silicon devices into one intelligent endpoint. It routes LLMs, embeddings, image generation, speech-to-text, and vision with a 7-signal scoring engine, mDNS auto-discovery, and an 8-tab real-time dashboard. Two commands to set up, zero config files. pip install ollama-herd or brew install ollama-herd.

Overview

They solve fundamentally different problems at fundamentally different scales. Think enterprise cloud API gateway vs local fleet router.

What Envoy AI Gateway Does

Envoy AI Gateway extends Envoy Gateway (a Kubernetes-native API gateway built on Envoy Proxy) with AI-specific capabilities. It sits in front of cloud LLM APIs and gives enterprise teams a single endpoint that handles:

Multi-provider routing — one endpoint for OpenAI, Anthropic, Bedrock, Vertex AI, and 12+ other providers
Credential injection — developers use a single internal auth token; the gateway injects per-provider API keys at the edge
Token-based rate limiting — policies based on token consumption, not just request-per-second
Cross-backend failover — if Provider A fails, traffic routes to Provider B automatically
Model name virtualization — expose abstract model names that map to specific backends
A/B testing — weight-based routing between providers or model versions
MCP support — dedicated MCPRoute CRD for Model Context Protocol routing
OpenTelemetry — distributed tracing following GenAI semantic conventions

Architecture

Two-tier gateway pattern:

Tier 1 — external-facing entry point. Handles client auth, global rate limiting, routing to cloud providers or internal model-serving clusters
Tier 2 — deployed inside self-hosted model-serving clusters (alongside KServe/vLLM). Handles internal traffic routing with an Endpoint Picker Provider (EPP) that selects endpoints based on KV-cache usage, queued requests, and LoRA adapter info

Deployment

Kubernetes is mandatory. No Docker-only, no bare-metal, no laptop setup. Installation requires Kubernetes Gateway API CRDs, Envoy Gateway via Helm chart, Envoy AI Gateway via Helm chart, and extension manager configuration. You need familiarity with Kubernetes Gateway API, Envoy's xDS configuration model, Helm, and CRD-based configuration.

Feature Comparison

Dimension	Envoy AI Gateway	Ollama Herd
Core problem	Govern cloud LLM API calls across enterprise teams	Route inference across local Ollama devices
Deployment	Kubernetes + Helm + CRDs	`pip install ollama-herd` (two commands, zero config)
Infrastructure	K8s cluster required	Any machine with Python
Provider focus	16+ cloud APIs	Ollama instances on LAN
Routing intelligence	Weight-based, failover, A/B	7-signal scoring (thermal, memory, queue, wait, affinity, availability, context fit)
Hardware awareness	None	Thermal state, memory pressure, CPU utilization, disk space, model loading state
Device intelligence	None	Capacity learning, meeting detection, dynamic context optimization
Auth model	CEL policies, credential injection, cross-namespace isolation	Trusted LAN, no auth needed
Rate limiting	Token-aware, policy-based	Per node:model queue with dynamic concurrency
Observability	OpenTelemetry + GenAI conventions	JSONL + SQLite + live dashboard + Fleet Intelligence
Scale target	Enterprise multi-cluster, multi-team	1–5 machines, home/office fleet
Operational overhead	High (Envoy xDS, Gateway API, Helm, CRDs)	Near-zero (mDNS, SQLite, HTTP)
Language	Go (90.6%)	Python (async, FastAPI)

Where Envoy AI Gateway Wins

Multi-provider abstraction — single endpoint for 16+ cloud APIs with automatic credential rotation
Enterprise governance — CEL authorization policies, cross-namespace isolation, TLS automation
Ecosystem — inherits battle-tested Envoy Proxy capabilities (the same proxy handling billions of requests at Google, Lyft, Bloomberg)
MCP routing — dedicated CRD for Model Context Protocol, ahead of the curve for agent tooling
Token-based rate limiting — prevents cost surprises from expensive completions

Where Ollama Herd Wins

Hardware-aware routing — knows about thermal state, memory pressure, model loading, device availability. Envoy AI Gateway routes blind to hardware
Zero-config setup — two commands vs multi-step Kubernetes installation
Device intelligence — capacity learning, meeting detection, dynamic context optimization. No equivalent in Envoy AI Gateway
Observability for operators — live dashboard with health checks, Fleet Intelligence briefings, real-time SSE updates. Envoy relies on external OpenTelemetry pipelines
Local economics — designed for the fleet agent use case where cloud API costs don't scale

Complementary in Hybrid Setups

An agent fleet that calls both cloud APIs and local models could use both:

Envoy AI Gateway as the front door: handles "should this go to OpenAI or local?"
Ollama Herd as the local backend: handles "which local machine handles this request?"

This is the hybrid architecture that makes sense for cost-sensitive agent fleets: expensive/complex requests go to cloud (Claude, GPT-4), routine inference stays local (120B open-source models).

Bottom Line

Envoy AI Gateway is the enterprise cousin of what Herd does. It governs cloud LLM API traffic for Kubernetes teams — multi-provider routing, credential injection, token-based rate limiting. Ollama Herd routes local inference across Apple Silicon devices with zero configuration.

The fact that Bloomberg and Tetrate are building an AI gateway under CNCF governance validates that AI traffic management is a real problem. The fact that their solution requires Kubernetes and enterprise infrastructure validates Herd's niche: the same problem solved at personal/small-team scale with zero operational overhead.

If someone says "use Envoy AI Gateway instead of Herd" — they're solving a different problem. The hybrid integration pattern (Envoy for cloud, Herd for local) is genuinely compelling.

Getting Started

pip install ollama-herd    # or: brew install ollama-herd
herd                       # start the router
herd-node                  # on each device

Using Envoy AI Gateway for cloud API routing? Herd fits naturally as the local backend — Envoy decides "cloud or local?", Herd decides "which local machine?" Try Herd on two Macs to see fleet routing in action.

Frequently Asked Questions

Is Ollama Herd a good alternative to Envoy AI Gateway?

They solve fundamentally different problems. Envoy AI Gateway governs cloud LLM API traffic for enterprise Kubernetes teams — multi-provider routing, credential injection, token-based rate limiting. Ollama Herd routes local inference across Apple Silicon devices with zero configuration. If you need to manage cloud API spend across teams, use Envoy AI Gateway. If you need your Macs working together as one AI system, use Herd.

Can I use Ollama Herd with Envoy AI Gateway?

Yes, and this is a compelling hybrid pattern. Envoy AI Gateway handles cloud-facing routing (deciding between OpenAI, Anthropic, Bedrock, etc.) while Ollama Herd handles local fleet routing (deciding which Mac handles a local inference request). Together they create a unified system where expensive requests go to cloud APIs and routine inference stays local.

How does Envoy AI Gateway compare to Ollama Herd for local inference?

Envoy AI Gateway can technically route to local Ollama instances via OpenAI-compatible API, but with zero hardware awareness — no knowledge of VRAM pressure, thermal state, or device capability. Herd's 7-signal scoring makes genuinely intelligent routing decisions based on real-time device conditions. For local inference, Herd is purpose-built; Envoy AI Gateway is cloud-first.

Does Ollama Herd require Kubernetes?

No. Herd installs via pip or Homebrew and runs as a lightweight Python service. No Kubernetes, no Helm charts, no CRDs, no Gateway API configuration. Envoy AI Gateway requires a full Kubernetes cluster with multiple layers of infrastructure.

Is Ollama Herd free?

Yes. Open-source, MIT license. No paid tiers, no API keys, no subscriptions.

Ollama Herd vs Envoy AI Gateway