Running Claude Code and Codex with Local Models

One overnight report-generation workflow cost me about $280. It used Claude Code with Opus 4.8 and launched more than 80 research subagents.

The strongest model was valuable for resolving ambiguity, comparing sources, and synthesizing the final report. Much of the remaining work, however, involved reading files, running commands, collecting information, and writing intermediate notes.

I built a small CLI to run those existing coding-agent workflows against local open-weight models on my 64GB M1 Max, without repeatedly reconfiguring models, runtimes, endpoints, and environment variables.

The $280 figure was the provider usage total for that single overnight Opus 4.8 run.

In this context, a research subagent was a separate Claude Code subtask spawned by the report workflow to inspect a source, collect evidence, or produce intermediate notes for the parent task.

The integration problem

Getting a model to generate text locally on my 64GB M1 Max was not the hard part. The friction was making the full coding-agent workflow repeatable.

Each time I changed a model, backend, or agent, I had to remember:

which model would fit;
which runtime should serve it;
which protocol and environment the agent expected;
which old processes were still holding memory.

In my current setup, Claude Code uses an Anthropic Messages-compatible endpoint, while Codex uses an OpenAI Responses-compatible endpoint. Rapid-MLX, LM Studio, and LiteLLM each cover different parts of that path.

The individual tools already existed. I needed a small command-line layer that connected them for daily use.

The CLI I built

I built Local Coding Agent Runner, a small CLI that starts local model backends and launches existing coding agents against them.

The public command is local-agent. I use llm as a shorter shell alias.

A normal session looks like this:

local-agent models
local-agent up qwen3-coder-30b
local-agent claude

Or, for Codex:

local-agent codex

The same CLI also handles the less interesting but necessary operations:

local-agent status
local-agent logs
local-agent switch qwen3.5-9b
local-agent down
local-agent down-all

For example:

$ llm claude
rapid-mlx is not running - starting it first...
-> ready (model=mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit)
   endpoint: http://127.0.0.1:8000
-> launching claude against the local endpoint

The command removes the repeated setup work: server command, model identifier, endpoint, agent environment variables, and cleanup.

How the stack fits together

Current tested configurations:

Agent	Local runtime	Model	Quantization	Observed outcome
Claude Code	Rapid-MLX	`mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit`	4-bit	Starts Rapid-MLX and launches Claude Code against the local `/v1/messages` endpoint.
Codex CLI	LM Studio	configured LM Studio model	depends on selected model	Managed fallback route is available; startup does not imply reliable task completion.
Codex CLI	LiteLLM	configured LiteLLM upstream	depends on configured backend	External route is implemented; it requires the proxy and profile to be configured already.

My main Claude Code path currently uses Rapid-MLX with an MLX model:

Claude Code
    ↓
Anthropic-compatible /v1/messages
    ↓
Rapid-MLX
    ↓
MLX model

Codex can be launched through the configured LM Studio or LiteLLM routes.

The runner keeps the active backend, model, and endpoint visible because those details matter when something fails. A model may load successfully but still produce malformed tool calls, use an incompatible chat template, or fail under an agent’s expected streaming behavior.

It also handles model lifecycle. Unified memory is shared by macOS, the browser, the editor, Docker, test processes, model weights, and context cache.

Leaving multiple large model servers running can push the system into swap and make local inference much less useful.

That is why commands such as these are part of the main workflow:

local-agent status
local-agent switch qwen3.5-9b
local-agent down-all

status shows what is running. switch replaces the active model. down-all releases memory when I want the laptop back for other work.

The surprising first-turn latency

The biggest surprise was that generation speed did not fully describe the experience.

A model could answer a short prompt at an acceptable rate and still take two or three minutes before Claude Code performed its first useful action.

The first Claude Code request is much larger than a normal chat prompt. It can include:

the system instructions;
descriptions of available tools;
project instructions;
environment information;
conversation context;
the actual user request.

This creates several different kinds of waiting:

Model loading
    ↓
Agent startup
    ↓
Initial prompt processing
    ↓
First output token
    ↓
First valid tool action

Tokens per second still matters, but for an interactive coding agent, time to first useful action can be more important. A model that generates quickly after a three-minute initial wait may be acceptable for a long-running background task and frustrating for a short interactive one.

Current limitations

The project is currently focused on Apple Silicon and tested primarily on my M1 Max with 64GB of unified memory.

Claude Code through Rapid-MLX is the path I use most often. Codex can use the configured LM Studio or LiteLLM routes, but local model compatibility changes quickly.

There is an important distinction between “the agent starts successfully” and “the model completes realistic tasks reliably.” A route can start and expose the expected endpoint while still failing on tool-call format, long first prompts, context handling, or recovery from bad intermediate steps.

I have not yet tested:

Gemini CLI;
several recent GLM model releases;
every available Codex-to-Rapid-MLX route;
serving the model from another machine over a private network such as Tailscale;
enough models and tasks to make broad recommendations;
exactly why subsequent Claude Code turns are faster than the first one.

For now, the utility is a practical way to start a local model, launch the agent I already use, and clean up afterward.

The source code is available at github.com/lzongren/local-coding-agent-runner.

The integration problem#

The CLI I built#

How the stack fits together#

The surprising first-turn latency#

Current limitations#

The integration problem

The CLI I built

How the stack fits together

The surprising first-turn latency

Current limitations