Article
Running Claude Code and Codex with Local Models
A small CLI for running existing coding agents with local open-weight models on Apple Silicon.
One overnight report-generation workflow cost me about $280. It used Claude Code with Opus 4.8 and launched more than 80 research subagents.
The strongest model was valuable for resolving ambiguity, comparing sources, and synthesizing the final report. Much of the remaining work, however, involved reading files, running commands, collecting information, and writing intermediate notes.
I built a small CLI to run those existing coding-agent workflows against local open-weight models on my 64GB M1 Max, without repeatedly reconfiguring models, runtimes, endpoints, and environment variables.
The $280 figure was the provider usage total for that single overnight Opus 4.8 run.
In this context, a research subagent was a separate Claude Code subtask spawned by the report workflow to inspect a source, collect evidence, or produce intermediate notes for the parent task.
The integration problem
Getting a model to generate text locally on my 64GB M1 Max was not the hard part. The friction was making the full coding-agent workflow repeatable.
Each time I changed a model, backend, or agent, I had to remember:
- which model would fit;
- which runtime should serve it;
- which protocol and environment the agent expected;
- which old processes were still holding memory.
In my current setup, Claude Code uses an Anthropic Messages-compatible endpoint, while Codex uses an OpenAI Responses-compatible endpoint. Rapid-MLX, LM Studio, and LiteLLM each cover different parts of that path.
The individual tools already existed. I needed a small command-line layer that connected them for daily use.
The CLI I built
I built Local Coding Agent Runner, a small CLI that starts local model backends and launches existing coding agents against them.
The public command is local-agent. I use llm as a shorter shell alias.
A normal session looks like this:
local-agent models
local-agent up qwen3-coder-30b
local-agent claude
Or, for Codex:
local-agent codex
The same CLI also handles the less interesting but necessary operations:
local-agent status
local-agent logs
local-agent switch qwen3.5-9b
local-agent down
local-agent down-all
For example:
$ llm claude
rapid-mlx is not running - starting it first...
-> ready (model=mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit)
endpoint: http://127.0.0.1:8000
-> launching claude against the local endpoint
The command removes the repeated setup work: server command, model identifier, endpoint, agent environment variables, and cleanup.
How the stack fits together
Current tested configurations:
| Agent | Local runtime | Model | Quantization | Observed outcome |
|---|---|---|---|---|
| Claude Code | Rapid-MLX | mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit | 4-bit | Starts Rapid-MLX and launches Claude Code against the local /v1/messages endpoint. |
| Codex CLI | LM Studio | configured LM Studio model | depends on selected model | Managed fallback route is available; startup does not imply reliable task completion. |
| Codex CLI | LiteLLM | configured LiteLLM upstream | depends on configured backend | External route is implemented; it requires the proxy and profile to be configured already. |
My main Claude Code path currently uses Rapid-MLX with an MLX model:
Claude Code
↓
Anthropic-compatible /v1/messages
↓
Rapid-MLX
↓
MLX model
Codex can be launched through the configured LM Studio or LiteLLM routes.
The runner keeps the active backend, model, and endpoint visible because those details matter when something fails. A model may load successfully but still produce malformed tool calls, use an incompatible chat template, or fail under an agent’s expected streaming behavior.
It also handles model lifecycle. Unified memory is shared by macOS, the browser, the editor, Docker, test processes, model weights, and context cache.
Leaving multiple large model servers running can push the system into swap and make local inference much less useful.
That is why commands such as these are part of the main workflow:
local-agent status
local-agent switch qwen3.5-9b
local-agent down-all
status shows what is running. switch replaces the active model. down-all releases memory when I want the laptop back for other work.
The surprising first-turn latency
The biggest surprise was that generation speed did not fully describe the experience.
A model could answer a short prompt at an acceptable rate and still take two or three minutes before Claude Code performed its first useful action.
The first Claude Code request is much larger than a normal chat prompt. It can include:
- the system instructions;
- descriptions of available tools;
- project instructions;
- environment information;
- conversation context;
- the actual user request.
This creates several different kinds of waiting:
Model loading
↓
Agent startup
↓
Initial prompt processing
↓
First output token
↓
First valid tool action
Tokens per second still matters, but for an interactive coding agent, time to first useful action can be more important. A model that generates quickly after a three-minute initial wait may be acceptable for a long-running background task and frustrating for a short interactive one.
Current limitations
The project is currently focused on Apple Silicon and tested primarily on my M1 Max with 64GB of unified memory.
Claude Code through Rapid-MLX is the path I use most often. Codex can use the configured LM Studio or LiteLLM routes, but local model compatibility changes quickly.
There is an important distinction between “the agent starts successfully” and “the model completes realistic tasks reliably.” A route can start and expose the expected endpoint while still failing on tool-call format, long first prompts, context handling, or recovery from bad intermediate steps.
I have not yet tested:
- Gemini CLI;
- several recent GLM model releases;
- every available Codex-to-Rapid-MLX route;
- serving the model from another machine over a private network such as Tailscale;
- enough models and tasks to make broad recommendations;
- exactly why subsequent Claude Code turns are faster than the first one.
For now, the utility is a practical way to start a local model, launch the agent I already use, and clean up afterward.
The source code is available at github.com/lzongren/local-coding-agent-runner.