While I Wait: Running Gemma 4 on Laptop Hardware

LLM on a Budget series

Part I — The hardware plan
Interlude — Gemma 4 on laptop hardware
Part II — The build
Part III — Software, the 4am saga, and benchmarks
Companion — Model comparison deep dive
Part IV — GPU upgrade, six models, 16GB VRAM

The hardware

While the parts for the dedicated build make their way here, I thought I'd see what running a local model on existing hardware actually looked like in practice. The laptop isn't new, but it's not weak either:

CPU: Intel Core i7-8850H (6 cores / 12 threads, 8th gen) — RAM: 64GB — GPU: NVIDIA Quadro P1000 Mobile (4GB GDDR5 VRAM, Pascal architecture)

The 64GB of RAM is the real asset here. The Quadro P1000 Mobile is a 2017-era professional mobile GPU — capable of assisting with inference via CUDA, but 4GB of VRAM is a tight budget for a 4-billion-parameter model. Gemma 4 E4B in a standard quantisation lands at roughly 2.5–3GB, which means it can fit in VRAM — just barely, and with little headroom for the context overhead that builds up during a long session. Anything that doesn't fit spills over to system RAM, at which point you're back to relying on the CPU for that portion of the work.

I'd already downloaded Gemma 4 E4B and 31B locally in anticipation of the new machine. The 31B was never going to work on the laptop — the RAM might technically hold it, but the inference speed would be measured in sentences per hour. E4B was the realistic option.

I installed LM Studio, loaded E4B, and got started.

The promising start

Initial impressions were genuinely reasonable. Responses were coming back at a pace roughly on par with reading speed — slightly faster than I could comfortably absorb, actually. For a CPU-only inference setup with a 4-billion-parameter model, that was better than expected.

I set up a well-structured prompt with clear guidelines: what I wanted built, how it was to be structured, and specific architectural requirements. One of those requirements was that everything should run in Docker except the database, which would sit in its own dedicated VM to make scaling easier down the track. Clear, specific, documented in the context from the start.

For a while, it was doing exactly what I'd asked. Making progress, following the architecture, producing useful output.

The gradual decline

After a couple of hours, things started to drift.

The first sign was the model attempting to run Docker commands against what appeared to be the database container — the one that was explicitly not in Docker, in its own VM, in the requirements it had been given. I corrected it. It acknowledged the correction, confirmed it understood, and carried on. A while later, a variation of the same mistake surfaced. Corrected again. Acknowledged again. Another loop. Different detail, same pattern.

By this point responses were also noticeably slower. What had been faster than reading speed was becoming a wait. Long enough that I was stopping responses mid-sentence — not because they were wrong, but because I could tell from the first few words they were heading somewhere unhelpful, and I wasn't willing to sit through two minutes of generation to confirm it.

What's actually happening here is context saturation compounded by VRAM constraints. The Quadro P1000's 4GB is enough to hold the base model, but as a conversation grows the key-value cache — the data structure the model uses to track conversation history — expands with every exchange. When that overflows available VRAM, it spills to system RAM, and the CPU has to handle an increasing share of the inference work. Processing time grows faster than linearly with context length, and the model's effective attention on earlier instructions weakens as the window fills. The architectural decisions you established at the start of a long session carry less and less weight by the time you're two hours in. What looks like the model "forgetting" your requirements is the context window working against you on hardware that can't keep up with it.

The correction loop was the clearest symptom. Each correction added more tokens to the context. More tokens meant slower responses and a progressively greater distance between the current generation and the original requirements. I was spending more time correcting hallucinations than making forward progress.

The conclusion

After a couple of hours I called it. The session had reached a point where the ratio of useful output to time spent correcting had inverted. The model wasn't getting worse at reasoning — it was running out of the resources it needed to reason well over a long context on constrained hardware.

A few things became clear from the exercise:

E4B on CPU is viable for short, focused tasks. Quick questions, isolated code snippets, things that don't require extensive context — it handles those well at a reasonable pace.
Long sessions with complex, stateful requirements are where it breaks down. The architecture of a multi-service system is exactly the kind of task that needs the model to hold a lot of context reliably over time. CPU inference on a 2018 laptop chip isn't the right tool for that.
GPU VRAM is the binding constraint, not raw GPU presence. Having a GPU helped, but 4GB of VRAM on a Pascal-era mobile chip is not the same thing as having enough fast memory to sustain a long session. The model fit, barely — but the context cache that builds up during extended work didn't. A GPU doesn't just hold more; it needs to hold enough, fast enough, to keep the whole session in play.

It also reinforced why the dedicated build matters. Not because the laptop can't run a local model at all — it can — but because "runs" and "useful for extended technical work" are meaningfully different thresholds.

The hardware is still on its way. Part II will be more interesting.

While I wait: running Gemma 4 on laptop hardware

The hardware

The promising start

The gradual decline

The conclusion