The Story, RAG Proxy, TKF 2026

The Problem

The Faculty of Dentistry has 94 policy documents scattered across shared drives. They cover everything from infection control procedures to clinic scheduling rules to emergency protocols. Clinicians need answers in seconds, not minutes of scrolling through PDFs.

We wanted an AI system that could answer questions like "What do I do if a patient has a needlestick injury?" and cite the exact policy document the answer came from.

            The catch: We operate under PHIPA (Personal Health Information Protection Act) and FIPPA (Freedom of Information and Protection of Privacy Act). Patient-adjacent data cannot leave the building. Cloud AI was never an option.
        

The Wrong Answer: Fine-Tuning

The obvious first idea was fine-tuning. Take a small language model, train it directly on our policy documents, and let it absorb the knowledge. I spent weeks on this. Six training runs, each one a different approach.

The results were impressive in all the wrong ways. The AI learned our institutional tone perfectly. It sounded exactly like a Faculty of Dentistry document. But the facts? Made up. It invented email addresses that didn't exist. Fabricated form names. Cited documents that were never written. On one run, it flat-out refused to answer questions it had been explicitly trained on.

            Run after run: perfect tone, fabricated facts. The model learned how we talk, not what we know.
        

The Research Said We Were Doomed

After the sixth failure, I dug into the academic literature. What I found was clarifying.

Microsoft's EMNLP 2024 study tested RAG against fine-tuning head-to-head. RAG scored 87.5% accuracy. Fine-tuning scored 50.4%. Basically a coin flip.

Allen-Zhu and Li's work at ICLR 2025 explained why: for a language model to reliably learn a single fact through training, it needs to see that fact between 100 and 1,000 times in different contexts. Our documents had each fact mentioned maybe 15 times. We never had a chance.

The insight: Fine-tuning teaches style. RAG teaches facts. We needed facts.

The Pivot: RAG Proxy

RAG stands for Retrieval Augmented Generation. Instead of teaching the AI our facts, we let it read the relevant documents right before answering. Like giving a student the textbook during an open-book exam.

I built RAG Proxy: a Perl/Mojolicious web service that sits between Open WebUI (our chat interface) and Ollama (our local AI engine). When someone asks a question, RAG Proxy intercepts it, searches our document database for the most relevant passages, injects those passages into the prompt, and lets the AI answer from what it just read.

The AI doesn't need to memorize anything. It just needs to read well. And modern language models are very good at reading.

The Stack

Everything runs on one Mac Studio sitting in a server room at the Faculty of Dentistry. No cloud. No subscriptions. No data leaving the building.

            Brain.pm is the core: 730 lines of Perl that handle document chunking, vector embedding, semantic search, and context injection. It uses PDL (Perl Data Language) for the vector math, SQLite for storage, and Ollama with the bge-m3 model for generating embeddings. The whole thing runs as a Mojolicious web application.
        

The architecture is intentionally simple. Users connect to Open WebUI on port 3000. Open WebUI thinks it's talking to a standard Ollama instance. But RAG Proxy is listening on the path in between, reading the question, finding the relevant documents, and enriching the prompt before Ollama ever sees it.

What's Novel

Two things set this apart from the typical RAG tutorial you'll find online.

First: vision plus RAG. A user can upload a photo of a needlestick injury scene, and the system will combine what it sees in the image with what it finds in the policy documents to produce a cited, procedure-accurate response. That's not something I've seen in other local RAG implementations.

Second: zero vendor dependency. No OpenAI API key. No Azure subscription. No Pinecone. No LangChain. Every component is open source and runs locally. If Ollama, Perl, SQLite, and a Mac exist, this system works.

            The bottom line: Anyone in the room at TKF could build this. That's the point. Sovereign AI isn't a slogan. It's 730 lines of Perl and a Mac.