Let’s be real, local-first tooling is an essential capability when your internet is not stable, in a world where every cloud service out-there is frequently crashing, not to mention speed-privacy-ownership concerns while sending code, prompts, logs, or half-baked ideas to a remote API.
To avoid all these noise and make my workflow more autonomous-quicker-stable-private I made a machine where I can code with an AI assistant Locally and Fully Offline.
I do not think everyone should run a local coding model, this is still a serious tech task. But if you enjoy owning and playing with your stack, and you have the hardware for it, it can be a very satisfying experience.
What I’ve used for the AI coding assistant that can run without the cloud
This is a Debian machine with 64 GB GPU and OpenCode as the coding agent with llama.cpp .
When you get your local llama-server running, OpenCode talks to it like it would talk to any OpenAI-compatible provider.
The difference is that the whole loop stays on my local machine.
Neat!
Be warned – Local AI assistant is hungry for GPU memory
Running Qwen3.6 27B Q8_0 with 256k context in reasoning mode loads around 50GB of the GPU memory and gives around 64 tokens/s for prompt+generation.
That is quite good for a local model with that much context.
There are some Caveats
Qwen3.6 27B at 256k context is about three times slower compared to a hosted frontier model.
You have to care about model storage, updates, server flags, GPU memory, and cooling.