Six Failed Runs

Before RAG Proxy worked, fine-tuning failed six times. Here is what was tried and why each run fell short.

Run 1 No System Prompt
Strategy: hand the model 94 policy PDFs and let it cook with default settings.

Skipped one detail: no system prompt (the standing instructions that tell the model how to behave).

The output came back polished and professional. It also made up email addresses, invented form names the faculty has never used, and cited policies that don't exist.

Verdict: it learned to sound like a policy assistant. It didn't learn any actual policies.
Run 2 More Data, More Bloat
Strategy: more training examples, plus a system prompt telling it to stick to policy content.

The model got more confident. The wrong answers got more convincing, not less frequent.

The system prompt itself mutated during training, ballooning to 716 characters of garbled text.

In a clinical setting, a confidently wrong answer is more dangerous than an obviously wrong one.

Verdict: more data made the wrong answers harder to spot, not less common.
Run 3 Fix the Plumbing
Strategy: clean up the document parser and the model-compression step.

Two real bugs found: the PDF reader was leaving HTML junk in the training data, and the compression step (quantization, which shrinks the model so it runs on a laptop) was mislabeling its own output.

Fixed both. Retrained. Output got slightly cleaner.

The model still couldn't recall specific policy content. It served up word salad, mixing real terminology with invented procedures.

Verdict: the plumbing was cleaner. The plumbing wasn't the problem.
Run 4 More Training Pairs
Strategy: scale up to 164 carefully curated question-answer pairs with clean data.

Looked great on the practice questions. Real users broke it the same way as before.

Fabricated email addresses (cas.uoft.me/drcm@mcmaster.ca). Made-up form names. Confident answers to questions it should have refused.

Each fact was seen roughly 15 times during training.

Verdict: 164 pairs was nowhere near enough. Research says a fact needs 100 to 1,000 repetitions to stick.
Run 5 RAFT
Strategy: try RAFT (Retrieval-Augmented Fine-Tuning, a Microsoft technique that trains the model to read documents alongside the question instead of memorizing them).

The paper looked promising. Practice tests: a perfect 6 out of 6.

Catch: that score only held while a document-fetching step was running alongside it. Strip that away and the model did worse than a stock model that had never been fine-tuned.

Verdict: RAFT works, but it needs the document-fetching layer, which defeats the whole point of training a self-contained model.
Run 6 Maximum Effort
Strategy: throw everything at it. 2,198 training pairs, the deepest fine-tuning configuration available, extended training time.

Practice tests hit a perfect 10 out of 10.

Real users found five fresh ways for it to fail: vague answers, references to documents that don't exist, wrong content attributed to real document numbers, false refusals (it claimed not to know things it had been trained on), and missing citations.

The model knew the style perfectly. It just didn't know the facts.

Verdict: fine-tuning teaches style, not knowledge. Six runs to confirm it was the wrong tool.

The Research That Confirmed It

After six failures, the academic literature provided the explanation:

RAG: 87.5% accuracy vs. Fine-tuning: 50.4%

Microsoft's EMNLP 2024 study found fine-tuning was barely better than a coin flip for knowledge injection tasks. Stanford's FineTuneBench showed a 37% generalization ceiling regardless of method or model size.

Meta's LIMA paper introduced the "Superficial Alignment Hypothesis": fine-tuning teaches a model how to format and what tone to use. It does not inject new factual knowledge.

Allen-Zhu and Li (ICLR 2025) quantified the gap: each fact requires 100 to 1,000 training exposures to be reliably learned. The RAG Proxy training runs had roughly 15 exposures per fact, off by two orders of magnitude.

The Pivot

Instead of teaching the AI to memorize policies, hand it the relevant pages at the moment someone asks. That is RAG (Retrieval-Augmented Generation), and it is what the RAG Proxy does. The accuracy jumped from 50.4% to 87.5% overnight, with zero fine-tuning, zero training time, and zero cost.

See How RAG Proxy Works →