Three Tools That Run a Full AI Coding Agent on Your Laptop

Why Local Is a Legitimate Option Now

Cloud-hosted AI coding tools have made it easy to offload work - paste a function, get a suggestion, ship faster. That convenience carries a cost, sometimes literal, sometimes a matter of where your code ends up after you hit send. For teams with IP constraints, solo developers watching API bills, or anyone who wants to actually understand what’s happening inside the agent stack, running the whole thing locally has become worth the setup effort. The pieces to do it well now exist.

This walkthrough covers exactly three tools: Ollama to serve the model, Gemma 4 as the local LLM, and OpenCode as the agent interface. Each one handles a distinct layer. Get all three talking to each other and you have a coding agent that runs entirely on your own machine.

Ollama: The Local Model Server

Ollama is a runtime for downloading, running, and serving local language models directly from your machine. Once it’s running, it exposes a local API endpoint - other tools can query the model through that endpoint the same way they’d call a cloud provider, except the request never leaves your hardware.

On Windows, installation comes from the official installer at ollama.com/download, or via PowerShell with a single winget command: winget install Ollama.Ollama. On Linux, a curl script handles it: curl -fsSL https://ollama.com/install.sh | sh. After either install, running ollama --version confirms the CLI is available. On Windows, Ollama also appears in the Start menu and, once launched, sits in the system tray as a background service - meaning it’s ready to accept requests whenever you need it.

The reason Ollama matters here isn’t just convenience. Tools like OpenCode need somewhere to send prompts. Normally that’s an external API. Ollama replaces that external destination with a local one, so the agent’s calls to the model go to your machine, not a data center.

Gemma 4: Google’s April 2026 Open Model

Google released Gemma 4 on April 2, 2026. The model is designed for reasoning, coding, multimodal understanding, and agentic workflows - which makes it a reasonable fit for a local coding agent rather than a general-purpose chat model. It comes in multiple sizes, from smaller edge-oriented variants up to larger workstation-grade ones.

For local laptop use, the relevant variants are E2B (gemma4:e2b) and E4B (gemma4:e4b). The “E” in Ollama’s naming stands for “effective” parameters, not total parameters. E4B offers more capability; E2B is the fallback if E4B runs too slowly on your hardware.

Pulling E4B takes one command, identical on Windows and Linux: ollama pull gemma4:e4b. Once downloaded, ollama list shows what’s on your machine. The E4B model weighs in at 9.6 GB. It’s a 4-bit quantized model stored in GGUF format - the local model format Ollama uses - and it ships with a 128K context window. That context length matters for coding tasks where you’re feeding in large files or multi-file context.

The hardware floor for running E4B comfortably is roughly what a mid-range developer laptop carries. The reference machine used here has an Intel i7-13800H CPU, 32 GB RAM, and an NVIDIA RTX 2000 Ada Laptop GPU with 8 GB VRAM. Before wiring up the rest of the stack, a quick sanity check is worth running: ollama run gemma4:e4b "what's the capital of France?". A response of “Paris” means the model loaded and is responding. The first call runs slow because Ollama has to load the model into memory; subsequent prompts come back faster once the model is warm.

OpenCode: The Agent Interface That Closes the Loop

OpenCode is the third piece - the agent runtime that sits on top of the model and actually does the work inside a codebase. If you’ve used Claude Code or Codex, OpenCode belongs to the same category: it can operate within a local repository, inspect files, run commands, and carry out tasks across a project. The critical difference for this setup is that OpenCode can be pointed at a local Ollama endpoint instead of a cloud provider, which is what connects it to Gemma 4 running on your own machine.

That single capability - directing the agent’s LLM calls to a local server - is what makes the whole stack coherent. Ollama exposes the endpoint. Gemma 4 answers on that endpoint. OpenCode drives the agent logic against it.

What the Stack Actually Looks Like in Practice

The architecture is straightforward once you see all three layers together. Ollama runs as a background service, holding the Gemma 4 model and waiting for API calls. OpenCode operates in your terminal or within a project directory, sending prompts to the local Ollama endpoint rather than reaching out to Anthropic, OpenAI, or Google’s cloud. Gemma 4 processes those prompts and returns completions, which OpenCode uses to take actions - reading files, writing changes, running commands - inside the repository.

Nothing in that chain requires an internet connection after the initial model download. No API keys. No per-token billing after setup. No code leaving the machine.

Choosing Between E2B and E4B

The decision between the two Gemma 4 edge variants comes down to hardware and patience. E4B at 9.6 GB fits within 8 GB VRAM on the RTX 2000 Ada with some memory management overhead, and it delivers noticeably better reasoning on complex coding tasks. E2B is the right call if inference on E4B stalls out or bogs down your machine during extended agent sessions - it runs lighter and still handles most coding queries adequately.

Context length stays the same across both at 128K tokens, so the difference is purely in output quality and inference speed, not in how much code you can feed into a single prompt.

The Practical Value of Owning the Stack

Running a local coding agent through these three tools is not faster to set up than signing up for a cloud service. The 9.6 GB model download alone takes time, and configuring OpenCode to hit the local endpoint adds a step that cloud tools skip entirely. What you get in exchange is a setup where every component is auditable - you can see what version of the model you’re running, what the agent interface is doing, and where every prompt goes.

For developers who want to understand how the agent layer actually works rather than treating it as a black box, this stack makes the internals inspectable. Ollama’s local API endpoint behaves the same way a cloud API behaves, so anything you learn about how OpenCode talks to models here applies directly to how it would talk to a cloud model in production.

The 4-bit quantization in the GGUF-format E4B model does represent a tradeoff compared to running the full-precision version on cloud infrastructure. On most practical coding tasks inside a real repository, that tradeoff is acceptable - and the 128K context window means you’re not constantly hitting a ceiling when working across large files.

The E4B model download runs 9.6 GB and, on the reference hardware, sits fully in VRAM once loaded - meaning after that first slow prompt, latency stays low enough for interactive use.