The Story
How six failed fine-tuning runs led to a system that actually works.
The Problem
The Faculty of Dentistry has 94 policy documents scattered across shared drives. They cover everything from infection control procedures to clinic scheduling rules to emergency protocols. Clinicians need answers in seconds, not minutes of scrolling through PDFs.
We wanted an AI system that could answer questions like "What do I do if a patient has a needlestick injury?" and cite the exact policy document the answer came from.
The Wrong Answer: Fine-Tuning
The obvious first idea was fine-tuning. Take a small language model, train it directly on our policy documents, and let it absorb the knowledge. I spent weeks on this. Six training runs, each one a different approach.
The results were impressive in all the wrong ways. The AI learned our institutional tone perfectly. It sounded exactly like a Faculty of Dentistry document. But the facts? Made up. It invented email addresses that didn't exist. Fabricated form names. Cited documents that were never written. On one run, it flat-out refused to answer questions it had been explicitly trained on.
The Research Said We Were Doomed
After the sixth failure, I dug into the academic literature. What I found was clarifying.
Microsoft's EMNLP 2024 study tested RAG against fine-tuning head-to-head. RAG scored 87.5% accuracy. Fine-tuning scored 50.4%. Basically a coin flip.
Allen-Zhu and Li's work at ICLR 2025 explained why: for a language model to reliably learn a single fact through training, it needs to see that fact between 100 and 1,000 times in different contexts. Our documents had each fact mentioned maybe 15 times. We never had a chance.
The Pivot: RAG Proxy
RAG stands for Retrieval Augmented Generation. Instead of teaching the AI our facts, we let it read the relevant documents right before answering. Like giving a student the textbook during an open-book exam.
I built RAG Proxy: a Perl/Mojolicious web service that sits between Open WebUI (our chat interface) and Ollama (our local AI engine). When someone asks a question, RAG Proxy intercepts it, searches our document database for the most relevant passages, injects those passages into the prompt, and lets the AI answer from what it just read.
The AI doesn't need to memorize anything. It just needs to read well. And modern language models are very good at reading.
The Stack
Everything runs on one Mac Studio sitting in a server room at the Faculty of Dentistry. No cloud. No subscriptions. No data leaving the building.
The architecture is intentionally simple. Users connect to Open WebUI on port 3000. Open WebUI thinks it's talking to a standard Ollama instance. But RAG Proxy is listening on the path in between, reading the question, finding the relevant documents, and enriching the prompt before Ollama ever sees it.
What's Novel
Two things set this apart from the typical RAG tutorial you'll find online.
First: vision plus RAG. A user can upload a photo of a needlestick injury scene, and the system will combine what it sees in the image with what it finds in the policy documents to produce a cited, procedure-accurate response. That's not something I've seen in other local RAG implementations.
Second: zero vendor dependency. No OpenAI API key. No Azure subscription. No Pinecone. No LangChain. Every component is open source and runs locally. If Ollama, Perl, SQLite, and a Mac exist, this system works.