LLM on a Budget series
- Part I — The hardware plan
- Interlude — Gemma 4 on laptop hardware
- Part II — The build
- Part III — Software, the 4am saga, and benchmarks
- This page — Model comparison deep dive
The setup
All four models were given the identical prompt on the same hardware — the budget build from this series, RTX 2060 Super (8GB VRAM), Ollama, 32768 context window, num_predict unlocked. The prompt:
# The prompt given to all four models, verbatim
I'm building a multi-tenant SaaS platform with the following stack: React frontend,
Python FastAPI backend, PostgreSQL (one database per tenant), Redis for session
management, and Celery for background jobs. Everything runs in Docker except
PostgreSQL, which runs in dedicated VMs — one per tenant for isolation.
I need a comprehensive technical specification document covering:
1. Complete database schema for tenant management, including tables for tenants,
users, roles, permissions, audit logs, and billing records — with full column
definitions, types, and relationships
2. The full authentication and authorization flow, including JWT structure,
refresh token handling, and role-based access control implementation
3. API endpoint specification for all CRUD operations on tenants, users, and the
core business objects (assume the platform manages "projects" and "tasks")
4. Docker Compose configuration for the full stack
5. Background job definitions for tenant provisioning, billing reconciliation,
and data export
6. A complete security review covering every component, with specific mitigations
Be exhaustive and thorough in every section. Provide complete code examples,
not summaries.
Two things in this prompt matter more than they might look like at first glance:
- The PostgreSQL constraint. "Everything runs in Docker except PostgreSQL, which runs in dedicated VMs" is a specific, testable architectural rule. Any response that includes PostgreSQL as a Docker Compose service has failed to follow the brief — regardless of how good the rest of the response is.
- "One database per tenant" as a routing problem. If each tenant's data lives in a separate database on a separate VM, the application needs some way to work out, at request time, which database to talk to for the current request. This is the actual hard part of the architecture — everything else is fairly standard SaaS plumbing.
The scorecard
| Model | Speed | VRAM | PostgreSQL constraint followed | Solved the "tenant switch" routing problem | Code runs as written |
|---|---|---|---|---|---|
| gemma4:e4b | ~59 t/s | 3.6GB | ✓ Yes | ✗ No | Mostly |
| gemma4:31b | ~1.9 t/s | — | ✓ Yes | ✓ Yes | Mostly |
| qwen2.5-coder:7b | ~53 t/s | 6.2GB | ✗ No | ✗ No | Too thin to assess |
| qwen2.5-coder:32b | ~2.2 t/s | 6.5GB | ✗ No | ✗ No | ✗ No |
gemma4:e4b — the best balance
gemma4:e4b (7.5B params, Q4_K_M)
~59 t/s · 3.6GB VRAM · 7,081 output tokensBefore writing a word of the response, e4b produced 16 seconds of visible reasoning — including a genuine self-correction. It initially considered placing tenant application data in the central metadata database, then explicitly reasoned that this was wrong: application data belongs in the per-tenant isolated PostgreSQL instance, not the shared metadata store. That's the kind of mid-thought correction you want to see.
Section 1 — Database schema
Full CREATE TABLE statements for all eight required tables, plus a sensible addition: a separate role_permissions many-to-many junction table — a detail none of the other three models included, despite "permissions" being explicitly named in the brief.
Section 2 — Authentication
JWT structure with tenant_id embedded in the claims. Refresh tokens correctly implemented as opaque random strings stored in Redis — not JWTs — with explicit rotation on use. RBAC implemented as FastAPI dependency-injection pseudocode that is structurally correct, even if not 100% runnable as-is.
Section 3 — API specification
Proper tabular format: method, path, auth requirement, request body, response shape. Explicit note that every endpoint must scope queries by tenant_id — carrying the isolation requirement from section 1 through to the API layer.
Section 4 — Docker Compose
This is the section that separates e4b from both qwen models. e4b's Compose file correctly omits PostgreSQL as a containerised service. Only a metadata_db (the central, shared metadata store) appears in Docker; the comments explicitly state that tenant databases live on dedicated VMs outside the Docker Compose stack — exactly matching the brief. redis, backend, celeryworker, and frontend are all present with healthchecks and environment variables.
Section 5 — Celery jobs
Full, working-style Python for provision_tenant, reconcile_billing, and export_data, including retry/backoff logic on the provisioning task.
Section 6 — Security review
Five sub-tables (Infrastructure/Network, Auth, Backend, Frontend, DevOps), each with vulnerability/mitigation/implementation columns. Generic in places, but every row is at least plausibly tied to the stated stack.
Verdict
The best all-round result. Fast, complete, internally consistent, and the only model in this comparison that got the PostgreSQL constraint right and ran fast enough to be usable in a real working session. Its one gap — not addressing how the application finds the correct tenant database at request time — is the thing gemma4:31b solved instead.
gemma4:31b — best ideas, 34 minutes
gemma4:31b (31B params)
~1.9 t/s · 3,936 output tokens · 33m 59s · 225s reasoningThis model spent 225 seconds thinking before writing anything — nearly four minutes of visible reasoning, working through the architecture, the schema, the auth flow, and explicitly noting "How is the VM boundary handled?" as an open question to resolve. Then it took a further 30+ minutes to write the response itself.
The standout: the "tenant switch" pattern
Every other model in this comparison treated "one database per tenant on dedicated VMs" as something to mention in the Docker Compose file and otherwise ignore. gemma4:31b is the only model that recognised it as a routing problem: how does the running application know which database to talk to for a given request?
Its solution: the central tenants table stores db_host, db_name, and db_user for each tenant. The JWT carries the authenticated user's tenant_id. On each request, the backend looks up that tenant's connection details from the central table and dynamically selects (or creates) a connection pool for that specific VM. This is, genuinely, the crux of the architecture the brief described — and three other models, including two "coding" specialists, didn't address it at all.
# gemma4:31b's tenants table — the core of the "tenant switch" design
CREATE TABLE tenants (
tenant_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name VARCHAR(255) NOT NULL,
slug VARCHAR(64) UNIQUE NOT NULL,
db_host VARCHAR(255) NOT NULL, -- IP/hostname of the dedicated VM
db_name VARCHAR(64) NOT NULL,
db_user VARCHAR(64) NOT NULL,
plan_id UUID REFERENCES plans(id),
status VARCHAR(20) CHECK (status IN ('active','suspended','deleted')),
created_at TIMESTAMP DEFAULT now()
);
Section 4 — Docker Compose
Like e4b, correctly omits tenant PostgreSQL. Only postgres-central (the control plane database) is containerised — consistent with the brief.
Section 6 — Security review
The most architecture-specific of all four responses. Each row addresses a risk that's specific to this design: the Silo isolation pattern, JWT-based tenant-switch integrity (preventing a client from spoofing tenant_id in a header to access another tenant's data), VPC peering requirements for the dedicated database VMs, and a deliberately under-privileged service account for the provisioning worker.
Where it falls short of e4b
Section 1 is presented as markdown tables for some entities rather than full CREATE TABLE SQL throughout — less "complete code" than requested in places. The Celery code in section 5 references db and model classes without defining them, similar looseness to the other models.
Verdict
The most architecturally sophisticated response by a clear margin — the only one to solve the actual hard problem. But 34 minutes for a single response (after 4 minutes of thinking) makes this a "leave it running and check back later" tool, not something usable in an interactive session. On this hardware, that's the trade-off: best ideas, worst latency.
qwen2.5-coder:7b — fast, but too thin
qwen2.5-coder:7b
~53 t/s · 6.2GB VRAM · 1,493 output tokensAt first glance, qwen2.5-coder:7b's result looked like the most efficient of the four: it covered all six section headings in under a quarter of the tokens that gemma4:e4b used, at a comparable speed. Reading the actual content changes that assessment considerably.
Section 1 — Database schema
Present, but thin. The roles and permissions tables have no tenant_id scoping at all — in a multi-tenant system, this means roles and permissions would be shared globally rather than scoped per tenant, which is a significant gap given the brief's emphasis on tenant isolation. The audit log is a single generic JSONB blob with no structure.
Section 2 — Authentication
Bare JWT payload shapes for access and refresh tokens — both shown as JWTs, including the refresh token, which is a weaker pattern than e4b's Redis-stored opaque token. No refresh rotation described. No RBAC implementation at all, despite RBAC being explicitly named in the brief.
Section 3 — API specification
A route list: method and path only. No request/response schemas, no Pydantic models, no indication of what each endpoint accepts or returns.
Section 4 — Docker Compose
# qwen2.5-coder:7b's Docker Compose — note the 'db' service
services:
frontend:
build: ./frontend
ports: ["3000:3000"]
depends_on: [backend]
backend:
build: ./backend
environment:
- DATABASE_URL=postgres://user:password@db/tenant_db
depends_on: [db]
redis:
image: "redis:latest"
db: # ← PostgreSQL as a Docker service
image: "postgres:13" # ← directly contradicts the brief
environment:
- POSTGRES_USER=user
- POSTGRES_PASSWORD=password
- POSTGRES_DB=tenant_db
volumes:
- db_data:/var/lib/postgresql/data
This is the central problem with this response. The brief states, in plain language, that "everything runs in Docker except PostgreSQL, which runs in dedicated VMs." This Compose file includes PostgreSQL as a Docker service named db. It also omits Celery entirely — despite Celery being named as part of the stack in the prompt's first sentence, and section 5 explicitly asking for Celery job definitions.
Section 5 — Background jobs
Three bullet-point descriptions ("Create a new database for the tenant. Initialize the schema.") with no code at all, despite the brief's explicit instruction: "provide complete code examples, not summaries."
Section 6 — Security review
Generic bullet points — "Use Redis ACLs," "Implement rate limiting on login attempts" — not tied to the specific architecture described in the prompt.
Verdict
"Completed all six sections" only in the sense that six headings appeared. The content under each heading is a fraction of what was asked for, several explicit instructions ("complete code examples, not summaries") were not followed, and the Docker Compose file directly contradicts the architecture described in the same prompt. The brevity that initially looked like efficiency was incompleteness.
qwen2.5-coder:32b — slow, and the code doesn't run
qwen2.5-coder:32b
~2.2 t/s · 6.5GB VRAM · 3,423 output tokens · 25m 37s26 times slower than e4b, qwen2.5-coder:32b produced a longer response than the 7B version, with more code overall. Unfortunately, more code meant more opportunities for the code to be wrong — and several pieces of it are not just incomplete, but would fail outright if run.
Section 1 — Database schema
An improvement on the 7B version: roles includes tenant_id. However, permissions uses a flat resource_name + action pair with only a role_id link, rather than a proper many-to-many role_permissions table like e4b's design.
Section 2 — Authentication: a real bug
# Earlier in the response, TokenData is constructed like this:
token_data = TokenData(user_id=user_id)
# ...but get_current_user() later does this:
async def get_current_user(token_data: TokenData = Depends(verify_token)):
user = db.query(User).filter(User.username == token_data.username).first()
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# TokenData has no 'username' attribute — only 'user_id' was set above.
# This raises an AttributeError at runtime, every time.
Section 3 — Broken dependency injection
# How it's defined:
def has_permission(required_permissions: list[str], user: User) -> bool:
return all(p in user.permissions for p in required_permissions)
# How it's used as a FastAPI dependency:
@app.get("/projects", dependencies=[Depends(has_permission(["read:project"]))])
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# has_permission(["read:project"]) is CALLED IMMEDIATELY here, with only one
# argument — 'user' is missing — and its return value (a bool, or a
# TypeError) is passed to Depends(), not a callable dependency function.
# This does not work as written.
Section 3 — Model confusion
# Tenant is defined as a Pydantic schema (request/response model)...
class Tenant(BaseModel):
name: str
# ...but is then used as if it were a SQLAlchemy ORM model:
db_tenant = Tenant(name=tenant.name)
db.add(db_tenant) # Pydantic models have no .add() relationship with db
db.commit() # 'db' is also never defined in this scope
Section 4 — Docker Compose
The same fundamental issue as the 7B version: db: image: postgres:12 appears as a Docker Compose service, again contradicting the brief's explicit instruction that PostgreSQL runs on dedicated VMs, not in Docker.
Section 5 — Celery jobs
Present but minimal — task functions defined with pass as their entire body for provisioning and billing reconciliation. Doesn't meet "complete code examples, not summaries."
Section 6 — Security review
Genuinely the strongest section of this response — reasonable breadth across Redis, Celery, PostgreSQL, and the frontend, each with specific mitigations. If the rest of the response matched this section's quality, this would be a much stronger result overall.
Verdict
Slowest of the four by a wide margin, and the only response containing code that would throw a runtime error if executed — not just incomplete, but actively broken. Larger parameter count and a "coder" specialisation did not translate into more reliable code in this test.
What this actually tells us
A few things stood out clearly enough to be worth stating plainly:
- "Coder" in the model name didn't mean better code. Both qwen2.5-coder models missed an explicit architectural constraint stated in plain English in the first paragraph of the prompt. The 32B version additionally produced code with real bugs — an undefined attribute access, a broken dependency-injection pattern, and a Pydantic/ORM model confusion that would fail immediately if run.
- Bigger wasn't faster, and wasn't more correct. Both 30B+ class models (qwen 32b and gemma 31b) ran at roughly 2 t/s — 25–34 minutes for a single response on this hardware. Of the two, only gemma4:31b produced code free of the kind of bugs found in qwen's 32B response.
- The one model that solved the hardest part of the brief was also the slowest. gemma4:31b's "tenant switch" design was the only response that addressed how the running application actually finds the correct database for a given tenant at request time — arguably the central challenge the prompt was describing. It took 34 minutes to get there.
- For this hardware, gemma4:e4b is the practical choice. Fast enough to be usable in a real session, correct on the architectural constraint that both specialists missed, and complete without needing the context-window and
num_predictfixes to be pushed to their absolute limits.
The takeaway for anyone building similar hardware: don't assume a "coding" model will handle a coding task better than a general-purpose model of similar size, and don't assume a bigger model is worth the wait unless the task genuinely needs the kind of architectural reasoning that only showed up in the slowest response here. Test on your own representative prompts — the labels on the model names didn't predict the outcome in this case.
