AI & Hardware LLM on a Budget — Interlude

While I wait: running Gemma 4 on laptop hardware

The dedicated build hardware is still in transit. In the meantime, I figured I'd try running Gemma 4 E4B on the laptop. It started well. It did not end well. Here's what happened and why.

LLM on a Budget series

The hardware

While the parts for the dedicated build make their way here, I thought I'd see what running a local model on existing hardware actually looked like in practice. The laptop isn't new, but it's not weak either:

CPU: Intel Core i7-8850H (6 cores / 12 threads, 8th gen) — RAM: 64GB — GPU: NVIDIA Quadro P1000 Mobile (4GB GDDR5 VRAM, Pascal architecture)

The 64GB of RAM is the real asset here. The Quadro P1000 Mobile is a 2017-era professional mobile GPU — capable of assisting with inference via CUDA, but 4GB of VRAM is a tight budget for a 4-billion-parameter model. Gemma 4 E4B in a standard quantisation lands at roughly 2.5–3GB, which means it can fit in VRAM — just barely, and with little headroom for the context overhead that builds up during a long session. Anything that doesn't fit spills over to system RAM, at which point you're back to relying on the CPU for that portion of the work.

I'd already downloaded Gemma 4 E4B and 31B locally in anticipation of the new machine. The 31B was never going to work on the laptop — the RAM might technically hold it, but the inference speed would be measured in sentences per hour. E4B was the realistic option.

I installed LM Studio, loaded E4B, and got started.

The promising start

Initial impressions were genuinely reasonable. Responses were coming back at a pace roughly on par with reading speed — slightly faster than I could comfortably absorb, actually. For a CPU-only inference setup with a 4-billion-parameter model, that was better than expected.

I set up a well-structured prompt with clear guidelines: what I wanted built, how it was to be structured, and specific architectural requirements. One of those requirements was that everything should run in Docker except the database, which would sit in its own dedicated VM to make scaling easier down the track. Clear, specific, documented in the context from the start.

For a while, it was doing exactly what I'd asked. Making progress, following the architecture, producing useful output.

The gradual decline

After a couple of hours, things started to drift.

The first sign was the model attempting to run Docker commands against what appeared to be the database container — the one that was explicitly not in Docker, in its own VM, in the requirements it had been given. I corrected it. It acknowledged the correction, confirmed it understood, and carried on. A while later, a variation of the same mistake surfaced. Corrected again. Acknowledged again. Another loop. Different detail, same pattern.

By this point responses were also noticeably slower. What had been faster than reading speed was becoming a wait. Long enough that I was stopping responses mid-sentence — not because they were wrong, but because I could tell from the first few words they were heading somewhere unhelpful, and I wasn't willing to sit through two minutes of generation to confirm it.

What's actually happening here is context saturation compounded by VRAM constraints. The Quadro P1000's 4GB is enough to hold the base model, but as a conversation grows the key-value cache — the data structure the model uses to track conversation history — expands with every exchange. When that overflows available VRAM, it spills to system RAM, and the CPU has to handle an increasing share of the inference work. Processing time grows faster than linearly with context length, and the model's effective attention on earlier instructions weakens as the window fills. The architectural decisions you established at the start of a long session carry less and less weight by the time you're two hours in. What looks like the model "forgetting" your requirements is the context window working against you on hardware that can't keep up with it.

The correction loop was the clearest symptom. Each correction added more tokens to the context. More tokens meant slower responses and a progressively greater distance between the current generation and the original requirements. I was spending more time correcting hallucinations than making forward progress.

The conclusion

After a couple of hours I called it. The session had reached a point where the ratio of useful output to time spent correcting had inverted. The model wasn't getting worse at reasoning — it was running out of the resources it needed to reason well over a long context on constrained hardware.

A few things became clear from the exercise:

It also reinforced why the dedicated build matters. Not because the laptop can't run a local model at all — it can — but because "runs" and "useful for extended technical work" are meaningfully different thresholds.

The hardware is still on its way. Part II will be more interesting.

← Part I: The hardware Part II coming soon