How RAG Proxy Works
The full technical setup: architecture, vector search, proxy interception, and the code that makes it run.
Request Flow
User → Open WebUI → RAG Proxy (:7079) → Ollama (:11434)
↓
Brain.pm searches
94 dental policy docs
↓
Injects top 5 chunks
into the prompt
- A user types a question in Open WebUI (a ChatGPT-style chat interface, running locally on the same Mac).
- That question hits RAG Proxy (our middleman server, it sees every chat message before it reaches the AI).
- RAG Proxy searches the policy documents using Brain.pm (a custom vector-search library, vectors are numerical fingerprints of meaning, so “needlestick” and “percutaneous injury” find each other).
- The top 5 matching chunks get injected into the prompt as context.
- The enriched prompt is forwarded to Ollama (the local AI engine that runs the language model on your own hardware, no cloud).
- The AI reads the context and writes a cited answer.
- The user never knows RAG is happening. It looks like a chatbot that knows your policies.
The Technology Stack
Key Components
RAG Proxy (bin/ragproxy)
A Perl/Mojolicious (a lightweight web framework, like Flask but for Perl) server that pretends to be Ollama.
Any client that already talks to ChatGPT or Ollama can talk to RAG Proxy with zero configuration changes. You just change the port number from 11434 to 7079.
It speaks both the OpenAI API (the de facto standard for AI chat clients) and the Ollama API (the native protocol for the local engine), so both client types work out of the box.
It intercepts every request, searches the right brain, injects the context, and forwards to the real Ollama. Streaming and non-streaming both supported.
Model-name routing picks the brain: dental/ministral-3 means “search the dental brain, forward to the Ministral model.” That's how multiple knowledge domains share one server.
Brain.pm (lib/Brain.pm)
A custom Perl library that handles ingestion (parsing files), chunking (splitting them into bite-sized pieces), embedding (converting text into number vectors), and semantic search (finding chunks by meaning, not exact wording).
Each “brain” is a self-contained knowledge domain. It lives as two files: a SQLite database (a single-file relational database, holds the chunk text and metadata) and a PDL matrix (Perl Data Language, a fast numerical-array format for the vectors).
Documents get split into chunks of roughly 500 tokens (a token is a word or word-fragment; 500 is about two paragraphs).
Each chunk gets converted into a 1,024-dimensional vector (a list of 1,024 numbers, a numerical fingerprint of the chunk's meaning) using BGE-M3 (a multilingual embedding model from BAAI that runs locally through Ollama).
When a question arrives, it gets the same vector treatment. Brain.pm compares the question vector against every chunk vector using cosine similarity (a math function that measures the angle between two vectors, closer angle equals closer meaning).
The top 5 most similar chunks come back in milliseconds.
RAG Proxy Uses Semantic Search
What we use: semantic search, also called vector search or dense retrieval. It finds chunks by meaning, not by exact wording. Cosine similarity over 1,024-dimensional BGE-M3 embeddings.
Not keyword search (the kind of matching most file servers and old-school site search did 20 years ago, where you need the exact word to be present).
Why it matters: ask “I pricked my finger on a needle” and a keyword search misses Policy H.03, Percutaneous Injury Protocol entirely, because the word “percutaneous” never appears in the question.
Semantic search handles it. The vector for “pricked my finger on a needle” sits mathematically close to the vector for “percutaneous exposure incident.” Same concept, different words.
Brain.pm finds the right policy even when the user doesn't know the clinical terminology.
Transparent Proxy Architecture
RAG Proxy sits between the chat interface and the AI model. It's invisible to the user.
Open WebUI thinks it's talking directly to Ollama. Ollama thinks it's receiving a normal prompt. The RAG layer is entirely transparent (the technical term for “in the middle but unnoticed”).
Any LLM client that supports the OpenAI or Ollama API can use RAG Proxy without modification. No plugins, no custom integrations, no vendor lock-in.
Swap out the model, the chat interface, or the proxy itself without touching the others.
Brain Anatomy
Each brain is a folder of five files. The entire knowledge base for 94 dental policy documents fits in under 1 MB, small enough to email.
How Indexing Works
Sample Query Flow
View the Code
The core source files. All Perl. Click to download.