Six Failed Runs
Before RAG Proxy worked, fine-tuning failed six times. Here is what was tried and why each run fell short.
Skipped one detail: no system prompt (the standing instructions that tell the model how to behave).
The output came back polished and professional. It also made up email addresses, invented form names the faculty has never used, and cited policies that don't exist.
The model got more confident. The wrong answers got more convincing, not less frequent.
The system prompt itself mutated during training, ballooning to 716 characters of garbled text.
In a clinical setting, a confidently wrong answer is more dangerous than an obviously wrong one.
Two real bugs found: the PDF reader was leaving HTML junk in the training data, and the compression step (quantization, which shrinks the model so it runs on a laptop) was mislabeling its own output.
Fixed both. Retrained. Output got slightly cleaner.
The model still couldn't recall specific policy content. It served up word salad, mixing real terminology with invented procedures.
Looked great on the practice questions. Real users broke it the same way as before.
Fabricated email addresses (cas.uoft.me/drcm@mcmaster.ca). Made-up form names. Confident answers to questions it should have refused.
Each fact was seen roughly 15 times during training.
The paper looked promising. Practice tests: a perfect 6 out of 6.
Catch: that score only held while a document-fetching step was running alongside it. Strip that away and the model did worse than a stock model that had never been fine-tuned.
Practice tests hit a perfect 10 out of 10.
Real users found five fresh ways for it to fail: vague answers, references to documents that don't exist, wrong content attributed to real document numbers, false refusals (it claimed not to know things it had been trained on), and missing citations.
The model knew the style perfectly. It just didn't know the facts.
The Research That Confirmed It
After six failures, the academic literature provided the explanation:
RAG: 87.5% accuracy vs. Fine-tuning: 50.4%
Microsoft's EMNLP 2024 study found fine-tuning was barely better than a coin flip for knowledge injection tasks. Stanford's FineTuneBench showed a 37% generalization ceiling regardless of method or model size.
Meta's LIMA paper introduced the "Superficial Alignment Hypothesis": fine-tuning teaches a model how to format and what tone to use. It does not inject new factual knowledge.
Allen-Zhu and Li (ICLR 2025) quantified the gap: each fact requires 100 to 1,000 training exposures to be reliably learned. The RAG Proxy training runs had roughly 15 exposures per fact, off by two orders of magnitude.
The Pivot
Instead of teaching the AI to memorize policies, hand it the relevant pages at the moment someone asks. That is RAG (Retrieval-Augmented Generation), and it is what the RAG Proxy does. The accuracy jumped from 50.4% to 87.5% overnight, with zero fine-tuning, zero training time, and zero cost.