Guide

Claude Code CLI Integration

Point Claude Code at your own hardware. Native Anthropic Messages API, three-layer context management, per-tier model routing, and the stability knobs that fix the "breaks at 30K tokens" failure mode on local Qwen3-Coder models.

TL;DR

If you already have a herd running, three env vars:

export ANTHROPIC_BASE_URL=http://localhost:11435
export ANTHROPIC_AUTH_TOKEN=dummy   # any non-empty value
claude

That's it. Claude Code now talks to your local model with full tool use, streaming, and the standard agentic loop — routed through Ollama Herd's scoring engine, queue pipeline, and context management layers.

Want the full setup from scratch? Start with the Quickstart to get herd + herd-node running, then come back here for Claude Code specifics.

What Ollama Herd does for Claude Code

Ollama Herd exposes a native Anthropic Messages API (/v1/messages and /v1/messages/count_tokens), translates requests to Ollama's wire format, runs them through the same scoring + queue + trace pipeline as every other route, and translates the response back to Anthropic SSE event sequences.

No LiteLLM sidecar. No OpenAI-format proxy. Claude Code's ANTHROPIC_BASE_URL points straight at the herd router.

Anthropic concept Translated to Notes
messages[].content blocks (text, image, tool_use, tool_result) Ollama messages (string content + images[] + tool_calls[] + role:"tool" for results) Order preserved; thinking blocks dropped on input
system (string or text-block array) Prepended role:"system" message Both forms supported
tools[] with input_schema Ollama tools[] with parameters JSON schema passes through; optional-param defaults injected (see Tool-Schema Fixup)
tool_choice: auto / none / any / tool auto / strip / system-prompt nudge / system-prompt nudge any and tool are best-effort (Ollama doesn't natively force tool calls)
Streaming SSE message_startcontent_block_start/delta/stopmessage_deltamessage_stop Full event protocol; tool calls open new content blocks mid-stream
count_tokens tiktoken cl100k estimate Best-effort; budget-gating only, not billing

Setup

Step 1: Pull a coding model

Claude Code is agentic. It makes many tool calls. You want a model that's good at following a structured schema and not easily derailed. Recommended starting points:

# Fast, decent-quality (fits on a 36GB Mac)
ollama pull qwen3-coder:30b

# Higher quality (needs a 64GB+ machine)
ollama pull qwen3:32b

# Lightweight tier (for claude-haiku-* mapping)
ollama pull qwen3:14b

Step 2: Point Claude Code at your router

export ANTHROPIC_BASE_URL=http://localhost:11435
export ANTHROPIC_AUTH_TOKEN=dummy   # any non-empty string
claude

Step 3: (Optional) Configure model mapping

Claude Code sends model IDs like claude-sonnet-4-5. Ollama Herd maps these to your local models via FLEET_ANTHROPIC_MODEL_MAP:

# Default (no env var needed):
#   claude-opus-*    -> qwen3:32b
#   claude-sonnet-*  -> qwen3-coder:30b
#   claude-haiku-*   -> qwen3:14b

# Custom:
export FLEET_ANTHROPIC_MODEL_MAP='{
  "claude-opus-4-7": "mlx:Qwen3-Coder-Next-4bit",
  "claude-sonnet-4-5": "qwen3-coder:30b",
  "claude-haiku-4-5": "gpt-oss:120b"
}'

The map values can reference Ollama models (qwen3-coder:30b) or MLX models (mlx:Qwen3-Coder-Next-4bit) — see MLX Backend below.

Per-Tier Model Routing

Claude Code's --model flag lets users trade speed for quality per-invocation:

claude --model claude-haiku-4-5      # fast turns, cheaper model
claude --model claude-sonnet-4-5     # default balance
claude --model claude-opus-4-7       # highest quality

A good production mapping splits speed from quality and diversifies failure modes:

Different model families on different tiers also means if Qwen3 has a bad day, your haiku fallback still works.

The /compact command works out of the box

Claude Code's /compact slash command is client-side orchestration over the standard /v1/messages endpoint — it sends a normal request with a trailing user message asking the model to summarise the conversation, then locally replaces the in-memory history with the response. No special beta header, endpoint, or body field.

That means:

In practice, a 2,700-message session that would have timed out or produced garbage on raw local inference can hit /compact successfully on our fleet — Layer 1 alone typically shrinks the prompt by 60%+ before it hits the model.

Three-Layer Context Management

The biggest structural difference between hosted Claude and raw local inference is context hygiene. Hosted Claude silently strips stale tool results, summarises long sessions, and refuses oversized requests. Raw local Ollama doesn't. Ollama Herd ships three layers that close that gap.

Layer 1 — Mechanical tool-result clearing

When the Anthropic request exceeds FLEET_ANTHROPIC_AUTO_CLEAR_TOOL_USES_TRIGGER_TOKENS (default 100K), older tool_result blocks are replaced with a short placeholder before the request reaches the model. No LLM call, microsecond-scale. Matches hosted Claude's Context Editing API.

Configurable:

FLEET_ANTHROPIC_AUTO_CLEAR_TOOL_USES_TRIGGER_TOKENS=100000
FLEET_ANTHROPIC_AUTO_CLEAR_TOOL_USES_KEEP_RECENT=3   # keep 3 most-recent tool_result blocks verbatim

tool_use blocks (the model's own output) are never cleared — conversation structure stays intact, only stale bodies are dropped. Per-request log line shows tokens_before → tokens_after and cleared count for observability.

Layer 2 — LLM-based compactor with dynamic curator selection

After clearing, if the prompt still exceeds FLEET_CONTEXT_COMPACTION_FORCE_TRIGGER_TOKENS (default 150K), an LLM-based summarizer runs on remaining content. Summary work goes to whatever capable model is already hot and idle rather than cold-loading a configured default.

Ranking: hot + eligible + idle (pinned models preferred when idle, penalized when busy, quality tiebreaks by params_b); falls back to the configured default when nothing suitable is hot; fails-open (no compaction) when even the default is saturated.

FLEET_CONTEXT_COMPACTION_ENABLED=true
FLEET_CONTEXT_COMPACTION_MODEL=mlx:mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit
FLEET_CONTEXT_COMPACTION_IDLE_WINDOW_S=120
FLEET_CONTEXT_COMPACTION_CURATOR_MIN_PARAMS_B=7.0

Layer 3 — Pre-inference 413 cap

If the prompt is still oversized after clearing + compaction, the route returns HTTP 413 with a "run /compact and resubmit" message before the request ever reaches the model — no 5-minute MLX prefill wedge, no silent retry.

FLEET_ANTHROPIC_MAX_PROMPT_TOKENS=180000   # default

No silent server-side retry — client owns the decision of whether to resubmit, because correctness of agentic tool-use workflows depends on not altering context mid-turn.

Tool-Schema Fixup (Qwen3-Coder)

Claude Code's 27-tool schema has heavy optional-param usage — Grep alone has 13 optional params. llama.cpp#20164 documents that Qwen3-Coder starts silently dropping optional params at ~30K tokens and loops tool calls with a field consistently missing.

Ollama Herd fixes this by promoting optional params with known-safe defaults (Bash.timeout=120000, Grep.head_limit=250, Read.offset=0, etc.) to required-with-default in the outbound schema.

FLEET_ANTHROPIC_TOOL_SCHEMA_FIXUP=inject   # default
# Other modes:
# FLEET_ANTHROPIC_TOOL_SCHEMA_FIXUP=promote   # only existing defaults
# FLEET_ANTHROPIC_TOOL_SCHEMA_FIXUP=off       # pass-through

Backed by a CLAUDE_CODE_TOOL_DEFAULTS table keyed by (tool, param). Unknown tools pass through unchanged.

Tool-Call JSON Repair

Local coding models occasionally emit tool_use.input with minor syntax errors (trailing commas, unescaped quotes, missing brackets). The repair cascade:

  1. Strict parse
  2. json-repair library
  3. 4-pattern regex catalog (adapted from nicedreamzapp/claude-code-local): parameter=key>value, <parameter_key>value, malformed "arguments" objects, single-arg tool inference for Bash/Read/Write/Glob/Grep/WebFetch/WebSearch/TodoWrite
  4. Pass-through original

Schema-gated — no repair substitutes unless it passes structural validation against the tool's input_schema. Per-model repair counters exposed on /fleet/queue so operators can see if a model's repair rate is climbing (>1% sustained = signal to reconsider the model).

Token-Saving Knobs

Drop unused tool definitions

FLEET_ANTHROPIC_TOOLS_DENY strips specified Claude Code tools from every /v1/messages request before translation. Saves 200–600 prompt tokens per turn.

export FLEET_ANTHROPIC_TOOLS_DENY="WebSearch,WebFetch,NotebookEdit"

Pairs with client-side permissions.deny in .claude/settings.json — client-side only blocks execution; this removes the definitions from the wire entirely. Names matched exactly (case-sensitive).

Size-based model escalation

FLEET_ANTHROPIC_SIZE_ESCALATION_TOKENS + FLEET_ANTHROPIC_SIZE_ESCALATION_MODEL auto-route prompts over N tokens to a different (larger) model:

export FLEET_ANTHROPIC_SIZE_ESCALATION_TOKENS=50000
export FLEET_ANTHROPIC_SIZE_ESCALATION_MODEL=mlx:Qwen3-Coder-Next-4bit

Sonnet maps to qwen3-coder:30b for fast turns; escalates to the 480B MoE above 50K tokens. Trades small-request throughput for large-request quality where it matters.

MLX Backend

Apple Silicon nodes can run mlx_lm.server alongside Ollama. Useful for MLX-specific models (Qwen3-Coder-Next MoE, Qwen3-Coder-30B-A3B-Instruct) and for running a dedicated compactor model side-by-side with the main coding model without Ollama eviction risk.

export FLEET_NODE_MLX_SERVERS='[
  {"model":"mlx-community/Qwen3-Coder-Next-4bit","port":11440,"kv_bits":8},
  {"model":"mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit","port":11441,"kv_bits":8}
]'

Each server runs in its own process with independent logs at ~/.fleet-manager/logs/mlx-server-<port>.log. Memory-pressure startup gate refuses to spawn when total (model + FLEET_NODE_MLX_MEMORY_HEADROOM_GB) won't fit. Requires the Ollama Herd MLX patch — run scripts/setup-mlx.sh.

Once configured, reference MLX models in your model map with the mlx: prefix:

FLEET_ANTHROPIC_MODEL_MAP='{"claude-opus-*": "mlx:Qwen3-Coder-Next-4bit"}'

Stability Techniques for Long Sessions

MLX wall-clock timeout

FLEET_MLX_WALL_CLOCK_TIMEOUT_S (default 300s) catches wedged-request syndrome where mlx_lm.server keeps emitting tokens slowly but never stops. On timeout, the slot is released and the route returns 413 with the /compact hint.

Default 300s is reasonable for most workloads. Long Claude Code sessions (2000+ messages) on Qwen3-Coder-Next-4bit routinely run 200–245s and need 600 to avoid edge-case 300.5s-type timeouts.

export FLEET_MLX_WALL_CLOCK_TIMEOUT_S=600

Warm-prompt preload

After mlx_lm.server passes its health check, a fire-and-forget 1-token request primes the prompt cache with the system-prompt prefix. Measured 1.3–2.25× TTFT improvement on the first real request. Non-fatal on failure. Enabled by default.

Ollama tuning for 128GB Mac Studios and below

Claude Code on qwen3-coder:30b-agent at 131K ctx on 128GB MacBooks triggered Jetsam OOM kills under real load. The combination that makes it reliable:

OLLAMA_NUM_PARALLEL=1
OLLAMA_KV_CACHE_TYPE=q8_0
OLLAMA_FLASH_ATTENTION=1
OLLAMA_KEEP_ALIVE=-1

Observed result on an M4 Max 128GB: 0% → 100% success on the big_agentic stress pattern (55 msgs, 27 tools).

Troubleshooting

"Claude Code is truncating responses mid-stream"

Qwen3-Coder models can emit <|im_start|> or <|endoftext|> at ~30K tokens when attention to role separators weakens. Ollama Herd adds these to the MLX stop[] list and defensively strips them from any text that leaks through before the stop fires. If you still see literal <|im_start|> in your output, your OSS version is out of date — upgrade to v0.6.0+.

"The model keeps calling tools with missing params"

This is the llama.cpp#20164 bug. Confirm FLEET_ANTHROPIC_TOOL_SCHEMA_FIXUP=inject is active. Check /fleet/queue for tool_repair counters. If repair rate exceeds 1% sustained, consider escalating to a larger model or using a different family (e.g. gpt-oss:120b for haiku tier).

"HTTP 413 responses mid-session"

Expected behavior when a prompt exceeds FLEET_ANTHROPIC_MAX_PROMPT_TOKENS after clearing + compaction. Claude Code users should run /compact to trim history, then resubmit. This protects against multi-minute MLX prefill wedges.

"My fleet dashboard shows tool_repair counters climbing"

Signal that the model is struggling with structured output. Options: (1) switch to a larger model for this tier, (2) reduce tool count via FLEET_ANTHROPIC_TOOLS_DENY, (3) enable size escalation to route long prompts to a heavier model.

What Ollama Herd does not implement

Full three-mechanism analysis in the research doc: why-claude-code-degrades-at-30k.md.

Related Reading