How RAG Proxy Works

The full technical setup: architecture, vector search, proxy interception, and the code that makes it run.

Request Flow

User  →  Open WebUI  →  RAG Proxy (:7079)  →  Ollama (:11434)
                               ↓
                          Brain.pm searches
                          94 dental policy docs
                               ↓
                          Injects top 5 chunks
                          into the prompt

The Technology Stack

Perl + Mojolicious
The proxy server, ~550 lines
Brain.pm
Custom vector search library
SQLite
Document chunks + metadata
PDL
Vector math (cosine similarity)
BGE-M3
Embedding model (1,024 dimensions)
Ollama
Local AI inference engine
Ministral 3 14B
Language model (Apache 2.0)
Open WebUI
Chat interface (Docker)
Mac Studio M2 Max
96 GB unified memory

Key Components

The Proxy

RAG Proxy (bin/ragproxy)

A Perl/Mojolicious (a lightweight web framework, like Flask but for Perl) server that pretends to be Ollama.

Any client that already talks to ChatGPT or Ollama can talk to RAG Proxy with zero configuration changes. You just change the port number from 11434 to 7079.

It speaks both the OpenAI API (the de facto standard for AI chat clients) and the Ollama API (the native protocol for the local engine), so both client types work out of the box.

It intercepts every request, searches the right brain, injects the context, and forwards to the real Ollama. Streaming and non-streaming both supported.

Model-name routing picks the brain: dental/ministral-3 means “search the dental brain, forward to the Ministral model.” That's how multiple knowledge domains share one server.

The Brain

Brain.pm (lib/Brain.pm)

A custom Perl library that handles ingestion (parsing files), chunking (splitting them into bite-sized pieces), embedding (converting text into number vectors), and semantic search (finding chunks by meaning, not exact wording).

Each “brain” is a self-contained knowledge domain. It lives as two files: a SQLite database (a single-file relational database, holds the chunk text and metadata) and a PDL matrix (Perl Data Language, a fast numerical-array format for the vectors).

Documents get split into chunks of roughly 500 tokens (a token is a word or word-fragment; 500 is about two paragraphs).

Each chunk gets converted into a 1,024-dimensional vector (a list of 1,024 numbers, a numerical fingerprint of the chunk's meaning) using BGE-M3 (a multilingual embedding model from BAAI that runs locally through Ollama).

When a question arrives, it gets the same vector treatment. Brain.pm compares the question vector against every chunk vector using cosine similarity (a math function that measures the angle between two vectors, closer angle equals closer meaning).

The top 5 most similar chunks come back in milliseconds.

The Search

RAG Proxy Uses Semantic Search

What we use: semantic search, also called vector search or dense retrieval. It finds chunks by meaning, not by exact wording. Cosine similarity over 1,024-dimensional BGE-M3 embeddings.

Not keyword search (the kind of matching most file servers and old-school site search did 20 years ago, where you need the exact word to be present).

Why it matters: ask “I pricked my finger on a needle” and a keyword search misses Policy H.03, Percutaneous Injury Protocol entirely, because the word “percutaneous” never appears in the question.

Semantic search handles it. The vector for “pricked my finger on a needle” sits mathematically close to the vector for “percutaneous exposure incident.” Same concept, different words.

Brain.pm finds the right policy even when the user doesn't know the clinical terminology.

The Interception

Transparent Proxy Architecture

RAG Proxy sits between the chat interface and the AI model. It's invisible to the user.

Open WebUI thinks it's talking directly to Ollama. Ollama thinks it's receiving a normal prompt. The RAG layer is entirely transparent (the technical term for “in the middle but unnoticed”).

Any LLM client that supports the OpenAI or Ollama API can use RAG Proxy without modification. No plugins, no custom integrations, no vendor lock-in.

Swap out the model, the chat interface, or the proxy itself without touching the others.

Brain Anatomy

Each brain is a folder of five files. The entire knowledge base for 94 dental policy documents fits in under 1 MB, small enough to email.

# Each brain folder: data/dental-brain/ brain.db # SQLite: chunks, metadata, source tracking brain_vectors.pdl # PDL matrix: 1024-dim vectors for all chunks system-prompt.txt # Instructions for the AI when answering config.json # Brain settings (chunk size, overlap, etc.) sources/ # Original markdown documents

How Indexing Works

# Index documents into a brain: ragproxy-ingest --brain dental --source corpus/dental/ # What happens under the hood: # 1. Each document is split into ~500-token chunks # 2. Each chunk is sent to BGE-M3 via Ollama's /api/embed # 3. The 1024-number vector is stored in the PDL matrix # 4. Chunk text + metadata goes into SQLite # 5. The vector matrix is rebuilt and saved to disk

Sample Query Flow

# 1. User asks in Open WebUI: "What do I do after a needlestick injury?" # 2. RAG Proxy receives the request # 3. Parses model name: dental/ministral-3 # brain = "dental", model = "ministral-3" # 4. Brain.pm converts question to a 1024-dim vector # 5. Cosine similarity against all dental chunks # 6. Top 5 chunks returned (all from Policy H.03) # 7. Proxy builds augmented prompt: "Answer using ONLY the following context: [Policy H.03 - Percutaneous Exposure Protocol...] [Policy H.03 - Post-Exposure Blood Testing...] ..." # 8. Forwards to Ollama (ministral-3:14b) # 9. AI generates cited answer from the context # 10. Response streams back to the user

View the Code

The core source files. All Perl. Click to download.

ragproxy.pl (The Proxy) Brain.pm (Vector Search) ragproxy-ingest.pl (Document Loader)
View the Slides →