Pavel (Pasha) Simakov - Testing the Scaffolding, Not the Brain: A Pattern for Reliable LLM Systems

Testing the Scaffolding, Not the Brain: A Pattern for Reliable LLM Systems

by Pasha Simakov, 2026-04-30

LLMs challenge one of the oldest assumptions in software testing: same input, same output. I have been looking at this through the architecture of gbrain and its evaluation suite, gbrain-evals. They offer a useful case study in how to test an LLM-backed system without putting the model in the critical path of every test.

Traditional software systems are built around deterministic assertions enforced fully automatically by Continuous Integration (CI) pipelines. But LLM-backed systems can produce different wording, structure, or reasoning paths for the same prompt. If we test them with exact string matches, tests become flaky. If we test them only with another LLM, we add more probabilistic behavior to the system.

This matters because many teams are building artifical intelligence (AI) systems whose tests are either too brittle or too vague. Exact output tests fail on harmless wording changes. LLM-as-judge tests can be slow, expensive, and themselves inconsistent. Without better evaluation patterns, teams lose trust in their own CI pipeline.

The better approach is not to force the LLM to become deterministic. It is to design the surrounding system so that most of the work remains deterministic, testable, and observable. Here are few patterns gbrain and gbrain-evals use to achieve reliable, deterministic testing.

1. Architectural Separation: "Thin Harness, Fat Skills"

Reliable testing requires separating mechanical execution from probabilistic reasoning.

The architecture divides operations into two domains:

Deterministic Space: Data retrieval, tool execution, and state mutation. This is the "Thin Harness" written in traditional code such as TypeScript and SQL. It is predictable enough to test with conventional unit and integration tests.
Latent Space: Intelligence, synthesis, and classification. This is the domain of the LLM.

The system relies on "Fat Skills"—documents that define procedures. The skill dictates how to solve a problem, but execution occurs entirely in the deterministic harness.

A key implementation detail is the "Deterministic Collector" pattern. Tasks with a single correct answer (e.g., fetching a calendar event or parsing an API response) should not be assigned to the LLM when they can be handled by deterministic code. These tasks run as deterministic scripts or service calls. The LLM is invoked only to apply judgment to the collected data.

By pushing the majority of operations into deterministic code, engineers can write standard unit tests for the ingestion, routing, and storage layers, isolating non-determinism to the edges of the pipeline.

2. Synthetic Static Corpora

Testing a retrieval-augmented generation (RAG) system against live APIs or dynamic databases introduces test drift. To solve this, the evaluation suite uses committed, static datasets.

The primary benchmark corpus (world-v1) contains hundreds of fictional entities (people, companies, meetings). This dataset was generated once using a static seed via an LLM. The resulting JSON files are committed to the repository and act as an immutable baseline.

For ingestion testing, a secondary dataset of raw inputs (emails, chat messages, and transcripts) is used. The engineering team planted specific, hardcoded perturbations within this data—such as a fact changing between an email and a subsequent meeting transcript. These seeded contradictions serve as deterministic integration tests, verifying that the ingestion and graph resolution logic correctly handles conflicting temporal data.

3. Evaluating Scaffolding via Ground Truth Metadata

When benchmarking the system, the evaluation suite does not use an "LLM-as-a-judge" to evaluate answers. It asserts against hardcoded metadata.

Every document in the synthetic corpus contains a hidden _facts block. For example, a meeting document explicitly lists the IDs of its attendees.

When the test suite runs the query "Who attended the Strategy Meeting?", it compares the array of entity IDs returned by the retrieval engine (whether vector search, keyword search, or graph-traversal) directly against the _facts array.

This strict byte-matching allows the suite to objectively score different retrieval implementations on Precision and Recall. Because the corpus is static and the retrieval logic is deterministic code, the benchmark expects a standard deviation of exactly zero across runs. Any variance in a deterministic adapter indicates an order-dependent bug in the code, not an LLM hallucination.

4. Managing Unavoidable Variance

While the architecture isolates the LLM, external vector embedding APIs introduce floating-point variance. This is mitigated through specific test runner configurations.

To account for embedding non-determinism, evaluations are run multiple times and the scores are averaged. To ensure retrieval algorithms are not succeeding due to the sequential order of file ingestion, the runner shuffles the ingestion order. Crucially, it uses a seeded randomizer. This surfaces order-dependent bugs while maintaining reproducibility across developer machines.

5. The Fail-Improve Loop for Classifiers

For tasks requiring classification—such as inferring the relationship between two entities from raw text—the system implements a deterministic-first fallback pattern.

It first attempts to classify the relationship using deterministic regular expressions (e.g., parsing "founded by" or "invested in"). If the deterministic code fails to match, it falls back to the LLM.

When the LLM successfully extracts the relationship, the system logs the input and the LLM's output. Engineers use these logs to write new regular expressions, generating new deterministic test cases. This "fail-improve" loop ensures that the deterministic test suite continuously expands, gradually reducing the system’s reliance on probabilistic execution for repeatable cases.

Enforcing Structured Outputs

The same principle applies at runtime, not just in tests. Whenever possible, the LLM should return structured outputs that are validated against schemas before the rest of the system acts on them. The model can propose; deterministic code should validate, route, persist, and execute.

Conclusion

The broader lesson is that reliable LLM systems are not built by pretending the model is deterministic. They are built by surrounding probabilistic reasoning with deterministic collectors, stable datasets, explicit ground truth, reproducible test runners, and carefully bounded model judgment.

In production AI systems, the model may be the most visible component, but the scaffolding determines whether engineers can trust the system.

Here are my other article on topics of generative & agentic LLMs and applied AI:

Testing the Scaffolding, Not the Brain: A Pattern for Reliable LLM Systems (2026/04/30) original
From Luck to Skill: Using AI to Consistently Win System Design Interviews (2025/9/10) original
Gemini CLI: A Developer's Mental Model (2025/8/21) original
Architecting AI Memory: Lessons from Gemini CLI (2025/8/13) original
Inside the Mind: Gemini CLI's System Prompts Deep Dive (2025/7/19) original
Meet the Agent: The Brain Behind Gemini CLI (2025/7/18) original