Point Claude Code at your own hardware. Native Anthropic Messages API, three-layer context management, per-tier model routing, and the stability knobs that fix the "breaks at 30K tokens" failure mode on local Qwen3-Coder models.
If you already have a herd running, three env vars:
export ANTHROPIC_BASE_URL=http://localhost:11435
export ANTHROPIC_AUTH_TOKEN=dummy # any non-empty value
claude
That's it. Claude Code now talks to your local model with full tool use, streaming, and the standard agentic loop — routed through Ollama Herd's scoring engine, queue pipeline, and context management layers.
Want the full setup from scratch? Start with the Quickstart to get
herd+herd-noderunning, then come back here for Claude Code specifics.
Ollama Herd exposes a native Anthropic Messages API (/v1/messages and /v1/messages/count_tokens), translates requests to Ollama's wire format, runs them through the same scoring + queue + trace pipeline as every other route, and translates the response back to Anthropic SSE event sequences.
No LiteLLM sidecar. No OpenAI-format proxy. Claude Code's ANTHROPIC_BASE_URL points straight at the herd router.
| Anthropic concept | Translated to | Notes |
|---|---|---|
messages[].content blocks (text, image, tool_use, tool_result) |
Ollama messages (string content + images[] + tool_calls[] + role:"tool" for results) |
Order preserved; thinking blocks dropped on input |
system (string or text-block array) |
Prepended role:"system" message |
Both forms supported |
tools[] with input_schema |
Ollama tools[] with parameters |
JSON schema passes through; optional-param defaults injected (see Tool-Schema Fixup) |
tool_choice: auto / none / any / tool |
auto / strip / system-prompt nudge / system-prompt nudge |
any and tool are best-effort (Ollama doesn't natively force tool calls) |
| Streaming SSE | message_start → content_block_start/delta/stop → message_delta → message_stop |
Full event protocol; tool calls open new content blocks mid-stream |
count_tokens |
tiktoken cl100k estimate |
Best-effort; budget-gating only, not billing |
Claude Code is agentic. It makes many tool calls. You want a model that's good at following a structured schema and not easily derailed. Recommended starting points:
# Fast, decent-quality (fits on a 36GB Mac)
ollama pull qwen3-coder:30b
# Higher quality (needs a 64GB+ machine)
ollama pull qwen3:32b
# Lightweight tier (for claude-haiku-* mapping)
ollama pull qwen3:14b
export ANTHROPIC_BASE_URL=http://localhost:11435
export ANTHROPIC_AUTH_TOKEN=dummy # any non-empty string
claude
Claude Code sends model IDs like claude-sonnet-4-5. Ollama Herd maps these to your local models via FLEET_ANTHROPIC_MODEL_MAP:
# Default (no env var needed):
# claude-opus-* -> qwen3:32b
# claude-sonnet-* -> qwen3-coder:30b
# claude-haiku-* -> qwen3:14b
# Custom:
export FLEET_ANTHROPIC_MODEL_MAP='{
"claude-opus-4-7": "mlx:Qwen3-Coder-Next-4bit",
"claude-sonnet-4-5": "qwen3-coder:30b",
"claude-haiku-4-5": "gpt-oss:120b"
}'
The map values can reference Ollama models (qwen3-coder:30b) or MLX models (mlx:Qwen3-Coder-Next-4bit) — see MLX Backend below.
Claude Code's --model flag lets users trade speed for quality per-invocation:
claude --model claude-haiku-4-5 # fast turns, cheaper model
claude --model claude-sonnet-4-5 # default balance
claude --model claude-opus-4-7 # highest quality
A good production mapping splits speed from quality and diversifies failure modes:
claude-haiku-* → a smaller hot Ollama model (fast, already pinned)claude-sonnet-* → a solid coding model for the main loopclaude-opus-* → an 80B-class MoE via MLX for the tough casesDifferent model families on different tiers also means if Qwen3 has a bad day, your haiku fallback still works.
/compact command works out of the boxClaude Code's /compact slash command is client-side orchestration over the standard /v1/messages endpoint — it sends a normal request with a trailing user message asking the model to summarise the conversation, then locally replaces the in-memory history with the response. No special beta header, endpoint, or body field.
That means:
/compact works against Ollama Herd with no special support required. Same as against hosted Claude.In practice, a 2,700-message session that would have timed out or produced garbage on raw local inference can hit /compact successfully on our fleet — Layer 1 alone typically shrinks the prompt by 60%+ before it hits the model.
The biggest structural difference between hosted Claude and raw local inference is context hygiene. Hosted Claude silently strips stale tool results, summarises long sessions, and refuses oversized requests. Raw local Ollama doesn't. Ollama Herd ships three layers that close that gap.
When the Anthropic request exceeds FLEET_ANTHROPIC_AUTO_CLEAR_TOOL_USES_TRIGGER_TOKENS (default 100K), older tool_result blocks are replaced with a short placeholder before the request reaches the model. No LLM call, microsecond-scale. Matches hosted Claude's Context Editing API.
Configurable:
FLEET_ANTHROPIC_AUTO_CLEAR_TOOL_USES_TRIGGER_TOKENS=100000
FLEET_ANTHROPIC_AUTO_CLEAR_TOOL_USES_KEEP_RECENT=3 # keep 3 most-recent tool_result blocks verbatim
tool_use blocks (the model's own output) are never cleared — conversation structure stays intact, only stale bodies are dropped. Per-request log line shows tokens_before → tokens_after and cleared count for observability.
After clearing, if the prompt still exceeds FLEET_CONTEXT_COMPACTION_FORCE_TRIGGER_TOKENS (default 150K), an LLM-based summarizer runs on remaining content. Summary work goes to whatever capable model is already hot and idle rather than cold-loading a configured default.
Ranking: hot + eligible + idle (pinned models preferred when idle, penalized when busy, quality tiebreaks by params_b); falls back to the configured default when nothing suitable is hot; fails-open (no compaction) when even the default is saturated.
FLEET_CONTEXT_COMPACTION_ENABLED=true
FLEET_CONTEXT_COMPACTION_MODEL=mlx:mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit
FLEET_CONTEXT_COMPACTION_IDLE_WINDOW_S=120
FLEET_CONTEXT_COMPACTION_CURATOR_MIN_PARAMS_B=7.0
If the prompt is still oversized after clearing + compaction, the route returns HTTP 413 with a "run /compact and resubmit" message before the request ever reaches the model — no 5-minute MLX prefill wedge, no silent retry.
FLEET_ANTHROPIC_MAX_PROMPT_TOKENS=180000 # default
No silent server-side retry — client owns the decision of whether to resubmit, because correctness of agentic tool-use workflows depends on not altering context mid-turn.
Claude Code's 27-tool schema has heavy optional-param usage — Grep alone has 13 optional params. llama.cpp#20164 documents that Qwen3-Coder starts silently dropping optional params at ~30K tokens and loops tool calls with a field consistently missing.
Ollama Herd fixes this by promoting optional params with known-safe defaults (Bash.timeout=120000, Grep.head_limit=250, Read.offset=0, etc.) to required-with-default in the outbound schema.
FLEET_ANTHROPIC_TOOL_SCHEMA_FIXUP=inject # default
# Other modes:
# FLEET_ANTHROPIC_TOOL_SCHEMA_FIXUP=promote # only existing defaults
# FLEET_ANTHROPIC_TOOL_SCHEMA_FIXUP=off # pass-through
Backed by a CLAUDE_CODE_TOOL_DEFAULTS table keyed by (tool, param). Unknown tools pass through unchanged.
Local coding models occasionally emit tool_use.input with minor syntax errors (trailing commas, unescaped quotes, missing brackets). The repair cascade:
json-repair librarynicedreamzapp/claude-code-local): parameter=key>value, <parameter_key>value, malformed "arguments" objects, single-arg tool inference for Bash/Read/Write/Glob/Grep/WebFetch/WebSearch/TodoWriteSchema-gated — no repair substitutes unless it passes structural validation against the tool's input_schema. Per-model repair counters exposed on /fleet/queue so operators can see if a model's repair rate is climbing (>1% sustained = signal to reconsider the model).
FLEET_ANTHROPIC_TOOLS_DENY strips specified Claude Code tools from every /v1/messages request before translation. Saves 200–600 prompt tokens per turn.
export FLEET_ANTHROPIC_TOOLS_DENY="WebSearch,WebFetch,NotebookEdit"
Pairs with client-side permissions.deny in .claude/settings.json — client-side only blocks execution; this removes the definitions from the wire entirely. Names matched exactly (case-sensitive).
FLEET_ANTHROPIC_SIZE_ESCALATION_TOKENS + FLEET_ANTHROPIC_SIZE_ESCALATION_MODEL auto-route prompts over N tokens to a different (larger) model:
export FLEET_ANTHROPIC_SIZE_ESCALATION_TOKENS=50000
export FLEET_ANTHROPIC_SIZE_ESCALATION_MODEL=mlx:Qwen3-Coder-Next-4bit
Sonnet maps to qwen3-coder:30b for fast turns; escalates to the 480B MoE above 50K tokens. Trades small-request throughput for large-request quality where it matters.
Apple Silicon nodes can run mlx_lm.server alongside Ollama. Useful for MLX-specific models (Qwen3-Coder-Next MoE, Qwen3-Coder-30B-A3B-Instruct) and for running a dedicated compactor model side-by-side with the main coding model without Ollama eviction risk.
export FLEET_NODE_MLX_SERVERS='[
{"model":"mlx-community/Qwen3-Coder-Next-4bit","port":11440,"kv_bits":8},
{"model":"mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit","port":11441,"kv_bits":8}
]'
Each server runs in its own process with independent logs at ~/.fleet-manager/logs/mlx-server-<port>.log. Memory-pressure startup gate refuses to spawn when total (model + FLEET_NODE_MLX_MEMORY_HEADROOM_GB) won't fit. Requires the Ollama Herd MLX patch — run scripts/setup-mlx.sh.
Once configured, reference MLX models in your model map with the mlx: prefix:
FLEET_ANTHROPIC_MODEL_MAP='{"claude-opus-*": "mlx:Qwen3-Coder-Next-4bit"}'
FLEET_MLX_WALL_CLOCK_TIMEOUT_S (default 300s) catches wedged-request syndrome where mlx_lm.server keeps emitting tokens slowly but never stops. On timeout, the slot is released and the route returns 413 with the /compact hint.
Default 300s is reasonable for most workloads. Long Claude Code sessions (2000+ messages) on Qwen3-Coder-Next-4bit routinely run 200–245s and need 600 to avoid edge-case 300.5s-type timeouts.
export FLEET_MLX_WALL_CLOCK_TIMEOUT_S=600
After mlx_lm.server passes its health check, a fire-and-forget 1-token request primes the prompt cache with the system-prompt prefix. Measured 1.3–2.25× TTFT improvement on the first real request. Non-fatal on failure. Enabled by default.
Claude Code on qwen3-coder:30b-agent at 131K ctx on 128GB MacBooks triggered Jetsam OOM kills under real load. The combination that makes it reliable:
OLLAMA_NUM_PARALLEL=1
OLLAMA_KV_CACHE_TYPE=q8_0
OLLAMA_FLASH_ATTENTION=1
OLLAMA_KEEP_ALIVE=-1
Observed result on an M4 Max 128GB: 0% → 100% success on the big_agentic stress pattern (55 msgs, 27 tools).
Qwen3-Coder models can emit <|im_start|> or <|endoftext|> at ~30K tokens when attention to role separators weakens. Ollama Herd adds these to the MLX stop[] list and defensively strips them from any text that leaks through before the stop fires. If you still see literal <|im_start|> in your output, your OSS version is out of date — upgrade to v0.6.0+.
This is the llama.cpp#20164 bug. Confirm FLEET_ANTHROPIC_TOOL_SCHEMA_FIXUP=inject is active. Check /fleet/queue for tool_repair counters. If repair rate exceeds 1% sustained, consider escalating to a larger model or using a different family (e.g. gpt-oss:120b for haiku tier).
Expected behavior when a prompt exceeds FLEET_ANTHROPIC_MAX_PROMPT_TOKENS after clearing + compaction. Claude Code users should run /compact to trim history, then resubmit. This protects against multi-minute MLX prefill wedges.
Signal that the model is struggling with structured output. Options: (1) switch to a larger model for this tier, (2) reduce tool count via FLEET_ANTHROPIC_TOOLS_DENY, (3) enable size escalation to route long prompts to a heavier model.
anthropic-beta: compact-2026-01-12 + context_management.edits body field) — Ant-only beta, external Claude Code users don't send it.cache_edits content blocks, cache-editing-20250919) — also Ant-only today. We log first-occurrence of any unknown block type so if microcompact ever starts firing we notice without spam.Full three-mechanism analysis in the research doc: why-claude-code-degrades-at-30k.md.
herd + herd-node running/v1/messages