AI & Hardware — Deep Dive LLM on a Budget — companion

Local LLM model comparison: gemma4 vs qwen2.5-coder on a real spec

Four models, one demanding technical specification prompt, and a result that runs against the usual assumptions: the general-purpose model beat both code-specialist models — on speed, on following the brief, and on not writing code that's actually broken.

The setup

All four models were given the identical prompt on the same hardware — the budget build from this series, RTX 2060 Super (8GB VRAM), Ollama, 32768 context window, num_predict unlocked. The prompt:

# The prompt given to all four models, verbatim
I'm building a multi-tenant SaaS platform with the following stack: React frontend,
Python FastAPI backend, PostgreSQL (one database per tenant), Redis for session
management, and Celery for background jobs. Everything runs in Docker except
PostgreSQL, which runs in dedicated VMs — one per tenant for isolation.

I need a comprehensive technical specification document covering:
1. Complete database schema for tenant management, including tables for tenants,
   users, roles, permissions, audit logs, and billing records — with full column
   definitions, types, and relationships
2. The full authentication and authorization flow, including JWT structure,
   refresh token handling, and role-based access control implementation
3. API endpoint specification for all CRUD operations on tenants, users, and the
   core business objects (assume the platform manages "projects" and "tasks")
4. Docker Compose configuration for the full stack
5. Background job definitions for tenant provisioning, billing reconciliation,
   and data export
6. A complete security review covering every component, with specific mitigations

Be exhaustive and thorough in every section. Provide complete code examples,
not summaries.

Two things in this prompt matter more than they might look like at first glance:

The scorecard

Model Speed VRAM PostgreSQL constraint followed Solved the "tenant switch" routing problem Code runs as written
gemma4:e4b ~59 t/s 3.6GB ✓ Yes ✗ No Mostly
gemma4:31b ~1.9 t/s ✓ Yes ✓ Yes Mostly
qwen2.5-coder:7b ~53 t/s 6.2GB ✗ No ✗ No Too thin to assess
qwen2.5-coder:32b ~2.2 t/s 6.5GB ✗ No ✗ No ✗ No

gemma4:e4b — the best balance

gemma4:e4b (7.5B params, Q4_K_M)

~59 t/s · 3.6GB VRAM · 7,081 output tokens

Before writing a word of the response, e4b produced 16 seconds of visible reasoning — including a genuine self-correction. It initially considered placing tenant application data in the central metadata database, then explicitly reasoned that this was wrong: application data belongs in the per-tenant isolated PostgreSQL instance, not the shared metadata store. That's the kind of mid-thought correction you want to see.

Section 1 — Database schema

Full CREATE TABLE statements for all eight required tables, plus a sensible addition: a separate role_permissions many-to-many junction table — a detail none of the other three models included, despite "permissions" being explicitly named in the brief.

Section 2 — Authentication

JWT structure with tenant_id embedded in the claims. Refresh tokens correctly implemented as opaque random strings stored in Redis — not JWTs — with explicit rotation on use. RBAC implemented as FastAPI dependency-injection pseudocode that is structurally correct, even if not 100% runnable as-is.

Section 3 — API specification

Proper tabular format: method, path, auth requirement, request body, response shape. Explicit note that every endpoint must scope queries by tenant_id — carrying the isolation requirement from section 1 through to the API layer.

Section 4 — Docker Compose

This is the section that separates e4b from both qwen models. e4b's Compose file correctly omits PostgreSQL as a containerised service. Only a metadata_db (the central, shared metadata store) appears in Docker; the comments explicitly state that tenant databases live on dedicated VMs outside the Docker Compose stack — exactly matching the brief. redis, backend, celeryworker, and frontend are all present with healthchecks and environment variables.

Section 5 — Celery jobs

Full, working-style Python for provision_tenant, reconcile_billing, and export_data, including retry/backoff logic on the provisioning task.

Section 6 — Security review

Five sub-tables (Infrastructure/Network, Auth, Backend, Frontend, DevOps), each with vulnerability/mitigation/implementation columns. Generic in places, but every row is at least plausibly tied to the stated stack.

Verdict

The best all-round result. Fast, complete, internally consistent, and the only model in this comparison that got the PostgreSQL constraint right and ran fast enough to be usable in a real working session. Its one gap — not addressing how the application finds the correct tenant database at request time — is the thing gemma4:31b solved instead.

gemma4:31b — best ideas, 34 minutes

gemma4:31b (31B params)

~1.9 t/s · 3,936 output tokens · 33m 59s · 225s reasoning

This model spent 225 seconds thinking before writing anything — nearly four minutes of visible reasoning, working through the architecture, the schema, the auth flow, and explicitly noting "How is the VM boundary handled?" as an open question to resolve. Then it took a further 30+ minutes to write the response itself.

The standout: the "tenant switch" pattern

Every other model in this comparison treated "one database per tenant on dedicated VMs" as something to mention in the Docker Compose file and otherwise ignore. gemma4:31b is the only model that recognised it as a routing problem: how does the running application know which database to talk to for a given request?

Its solution: the central tenants table stores db_host, db_name, and db_user for each tenant. The JWT carries the authenticated user's tenant_id. On each request, the backend looks up that tenant's connection details from the central table and dynamically selects (or creates) a connection pool for that specific VM. This is, genuinely, the crux of the architecture the brief described — and three other models, including two "coding" specialists, didn't address it at all.

# gemma4:31b's tenants table — the core of the "tenant switch" design
CREATE TABLE tenants (
    tenant_id   UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name        VARCHAR(255) NOT NULL,
    slug        VARCHAR(64)  UNIQUE NOT NULL,
    db_host     VARCHAR(255) NOT NULL,  -- IP/hostname of the dedicated VM
    db_name     VARCHAR(64)  NOT NULL,
    db_user     VARCHAR(64)  NOT NULL,
    plan_id     UUID REFERENCES plans(id),
    status      VARCHAR(20)  CHECK (status IN ('active','suspended','deleted')),
    created_at  TIMESTAMP DEFAULT now()
);

Section 4 — Docker Compose

Like e4b, correctly omits tenant PostgreSQL. Only postgres-central (the control plane database) is containerised — consistent with the brief.

Section 6 — Security review

The most architecture-specific of all four responses. Each row addresses a risk that's specific to this design: the Silo isolation pattern, JWT-based tenant-switch integrity (preventing a client from spoofing tenant_id in a header to access another tenant's data), VPC peering requirements for the dedicated database VMs, and a deliberately under-privileged service account for the provisioning worker.

Where it falls short of e4b

Section 1 is presented as markdown tables for some entities rather than full CREATE TABLE SQL throughout — less "complete code" than requested in places. The Celery code in section 5 references db and model classes without defining them, similar looseness to the other models.

Verdict

The most architecturally sophisticated response by a clear margin — the only one to solve the actual hard problem. But 34 minutes for a single response (after 4 minutes of thinking) makes this a "leave it running and check back later" tool, not something usable in an interactive session. On this hardware, that's the trade-off: best ideas, worst latency.

qwen2.5-coder:7b — fast, but too thin

qwen2.5-coder:7b

~53 t/s · 6.2GB VRAM · 1,493 output tokens

At first glance, qwen2.5-coder:7b's result looked like the most efficient of the four: it covered all six section headings in under a quarter of the tokens that gemma4:e4b used, at a comparable speed. Reading the actual content changes that assessment considerably.

Section 1 — Database schema

Present, but thin. The roles and permissions tables have no tenant_id scoping at all — in a multi-tenant system, this means roles and permissions would be shared globally rather than scoped per tenant, which is a significant gap given the brief's emphasis on tenant isolation. The audit log is a single generic JSONB blob with no structure.

Section 2 — Authentication

Bare JWT payload shapes for access and refresh tokens — both shown as JWTs, including the refresh token, which is a weaker pattern than e4b's Redis-stored opaque token. No refresh rotation described. No RBAC implementation at all, despite RBAC being explicitly named in the brief.

Section 3 — API specification

A route list: method and path only. No request/response schemas, no Pydantic models, no indication of what each endpoint accepts or returns.

Section 4 — Docker Compose

# qwen2.5-coder:7b's Docker Compose — note the 'db' service
services:
  frontend:
    build: ./frontend
    ports: ["3000:3000"]
    depends_on: [backend]

  backend:
    build: ./backend
    environment:
      - DATABASE_URL=postgres://user:password@db/tenant_db
    depends_on: [db]

  redis:
    image: "redis:latest"

  db:                          # ← PostgreSQL as a Docker service
    image: "postgres:13"       # ← directly contradicts the brief
    environment:
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=password
      - POSTGRES_DB=tenant_db
    volumes:
      - db_data:/var/lib/postgresql/data

This is the central problem with this response. The brief states, in plain language, that "everything runs in Docker except PostgreSQL, which runs in dedicated VMs." This Compose file includes PostgreSQL as a Docker service named db. It also omits Celery entirely — despite Celery being named as part of the stack in the prompt's first sentence, and section 5 explicitly asking for Celery job definitions.

Section 5 — Background jobs

Three bullet-point descriptions ("Create a new database for the tenant. Initialize the schema.") with no code at all, despite the brief's explicit instruction: "provide complete code examples, not summaries."

Section 6 — Security review

Generic bullet points — "Use Redis ACLs," "Implement rate limiting on login attempts" — not tied to the specific architecture described in the prompt.

Verdict

"Completed all six sections" only in the sense that six headings appeared. The content under each heading is a fraction of what was asked for, several explicit instructions ("complete code examples, not summaries") were not followed, and the Docker Compose file directly contradicts the architecture described in the same prompt. The brevity that initially looked like efficiency was incompleteness.

qwen2.5-coder:32b — slow, and the code doesn't run

qwen2.5-coder:32b

~2.2 t/s · 6.5GB VRAM · 3,423 output tokens · 25m 37s

26 times slower than e4b, qwen2.5-coder:32b produced a longer response than the 7B version, with more code overall. Unfortunately, more code meant more opportunities for the code to be wrong — and several pieces of it are not just incomplete, but would fail outright if run.

Section 1 — Database schema

An improvement on the 7B version: roles includes tenant_id. However, permissions uses a flat resource_name + action pair with only a role_id link, rather than a proper many-to-many role_permissions table like e4b's design.

Section 2 — Authentication: a real bug

# Earlier in the response, TokenData is constructed like this:
token_data = TokenData(user_id=user_id)

# ...but get_current_user() later does this:
async def get_current_user(token_data: TokenData = Depends(verify_token)):
    user = db.query(User).filter(User.username == token_data.username).first()
    #                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    # TokenData has no 'username' attribute — only 'user_id' was set above.
    # This raises an AttributeError at runtime, every time.

Section 3 — Broken dependency injection

# How it's defined:
def has_permission(required_permissions: list[str], user: User) -> bool:
    return all(p in user.permissions for p in required_permissions)

# How it's used as a FastAPI dependency:
@app.get("/projects", dependencies=[Depends(has_permission(["read:project"]))])
#                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# has_permission(["read:project"]) is CALLED IMMEDIATELY here, with only one
# argument — 'user' is missing — and its return value (a bool, or a
# TypeError) is passed to Depends(), not a callable dependency function.
# This does not work as written.

Section 3 — Model confusion

# Tenant is defined as a Pydantic schema (request/response model)...
class Tenant(BaseModel):
    name: str

# ...but is then used as if it were a SQLAlchemy ORM model:
db_tenant = Tenant(name=tenant.name)
db.add(db_tenant)      # Pydantic models have no .add() relationship with db
db.commit()            # 'db' is also never defined in this scope

Section 4 — Docker Compose

The same fundamental issue as the 7B version: db: image: postgres:12 appears as a Docker Compose service, again contradicting the brief's explicit instruction that PostgreSQL runs on dedicated VMs, not in Docker.

Section 5 — Celery jobs

Present but minimal — task functions defined with pass as their entire body for provisioning and billing reconciliation. Doesn't meet "complete code examples, not summaries."

Section 6 — Security review

Genuinely the strongest section of this response — reasonable breadth across Redis, Celery, PostgreSQL, and the frontend, each with specific mitigations. If the rest of the response matched this section's quality, this would be a much stronger result overall.

Verdict

Slowest of the four by a wide margin, and the only response containing code that would throw a runtime error if executed — not just incomplete, but actively broken. Larger parameter count and a "coder" specialisation did not translate into more reliable code in this test.

What this actually tells us

A few things stood out clearly enough to be worth stating plainly:

The takeaway for anyone building similar hardware: don't assume a "coding" model will handle a coding task better than a general-purpose model of similar size, and don't assume a bigger model is worth the wait unless the task genuinely needs the kind of architectural reasoning that only showed up in the slowest response here. Test on your own representative prompts — the labels on the model names didn't predict the outcome in this case.

← Back to Part III All posts →