Envoy AI Gateway is an enterprise Kubernetes-native gateway for managing cloud LLM API traffic across teams and providers. Ollama Herd is a zero-config fleet router for local Apple Silicon devices.
Envoy AI Gateway (~1,500 GitHub stars) is an open-source AI gateway built on Envoy Proxy, co-developed by Tetrate and Bloomberg and donated to the CNCF community. It provides multi-provider routing, credential injection, token-based rate limiting, and failover across 16+ cloud LLM APIs (OpenAI, Anthropic, Bedrock, Vertex AI, etc.), all deployed on Kubernetes via Helm charts and CRDs.
Ollama Herd is an open-source smart multimodal AI router that turns multiple Ollama instances across Apple Silicon devices into one intelligent endpoint. It routes LLMs, embeddings, image generation, speech-to-text, and vision with a 7-signal scoring engine, mDNS auto-discovery, and an 8-tab real-time dashboard. Two commands to set up, zero config files. pip install ollama-herd or brew install ollama-herd.
They solve fundamentally different problems at fundamentally different scales. Think enterprise cloud API gateway vs local fleet router.
Envoy AI Gateway extends Envoy Gateway (a Kubernetes-native API gateway built on Envoy Proxy) with AI-specific capabilities. It sits in front of cloud LLM APIs and gives enterprise teams a single endpoint that handles:
Two-tier gateway pattern:
Kubernetes is mandatory. No Docker-only, no bare-metal, no laptop setup. Installation requires Kubernetes Gateway API CRDs, Envoy Gateway via Helm chart, Envoy AI Gateway via Helm chart, and extension manager configuration. You need familiarity with Kubernetes Gateway API, Envoy's xDS configuration model, Helm, and CRD-based configuration.
| Dimension | Envoy AI Gateway | Ollama Herd |
|---|---|---|
| Core problem | Govern cloud LLM API calls across enterprise teams | Route inference across local Ollama devices |
| Deployment | Kubernetes + Helm + CRDs | pip install ollama-herd (two commands, zero config) |
| Infrastructure | K8s cluster required | Any machine with Python |
| Provider focus | 16+ cloud APIs | Ollama instances on LAN |
| Routing intelligence | Weight-based, failover, A/B | 7-signal scoring (thermal, memory, queue, wait, affinity, availability, context fit) |
| Hardware awareness | None | Thermal state, memory pressure, CPU utilization, disk space, model loading state |
| Device intelligence | None | Capacity learning, meeting detection, dynamic context optimization |
| Auth model | CEL policies, credential injection, cross-namespace isolation | Trusted LAN, no auth needed |
| Rate limiting | Token-aware, policy-based | Per node:model queue with dynamic concurrency |
| Observability | OpenTelemetry + GenAI conventions | JSONL + SQLite + live dashboard + Fleet Intelligence |
| Scale target | Enterprise multi-cluster, multi-team | 1–5 machines, home/office fleet |
| Operational overhead | High (Envoy xDS, Gateway API, Helm, CRDs) | Near-zero (mDNS, SQLite, HTTP) |
| Language | Go (90.6%) | Python (async, FastAPI) |
An agent fleet that calls both cloud APIs and local models could use both:
This is the hybrid architecture that makes sense for cost-sensitive agent fleets: expensive/complex requests go to cloud (Claude, GPT-4), routine inference stays local (120B open-source models).
Envoy AI Gateway is the enterprise cousin of what Herd does. It governs cloud LLM API traffic for Kubernetes teams — multi-provider routing, credential injection, token-based rate limiting. Ollama Herd routes local inference across Apple Silicon devices with zero configuration.
The fact that Bloomberg and Tetrate are building an AI gateway under CNCF governance validates that AI traffic management is a real problem. The fact that their solution requires Kubernetes and enterprise infrastructure validates Herd's niche: the same problem solved at personal/small-team scale with zero operational overhead.
If someone says "use Envoy AI Gateway instead of Herd" — they're solving a different problem. The hybrid integration pattern (Envoy for cloud, Herd for local) is genuinely compelling.
pip install ollama-herd # or: brew install ollama-herd
herd # start the router
herd-node # on each device
Using Envoy AI Gateway for cloud API routing? Herd fits naturally as the local backend — Envoy decides "cloud or local?", Herd decides "which local machine?" Try Herd on two Macs to see fleet routing in action.
They solve fundamentally different problems. Envoy AI Gateway governs cloud LLM API traffic for enterprise Kubernetes teams — multi-provider routing, credential injection, token-based rate limiting. Ollama Herd routes local inference across Apple Silicon devices with zero configuration. If you need to manage cloud API spend across teams, use Envoy AI Gateway. If you need your Macs working together as one AI system, use Herd.
Yes, and this is a compelling hybrid pattern. Envoy AI Gateway handles cloud-facing routing (deciding between OpenAI, Anthropic, Bedrock, etc.) while Ollama Herd handles local fleet routing (deciding which Mac handles a local inference request). Together they create a unified system where expensive requests go to cloud APIs and routine inference stays local.
Envoy AI Gateway can technically route to local Ollama instances via OpenAI-compatible API, but with zero hardware awareness — no knowledge of VRAM pressure, thermal state, or device capability. Herd's 7-signal scoring makes genuinely intelligent routing decisions based on real-time device conditions. For local inference, Herd is purpose-built; Envoy AI Gateway is cloud-first.
No. Herd installs via pip or Homebrew and runs as a lightweight Python service. No Kubernetes, no Helm charts, no CRDs, no Gateway API configuration. Envoy AI Gateway requires a full Kubernetes cluster with multiple layers of infrastructure.
Yes. Open-source, MIT license. No paid tiers, no API keys, no subscriptions.