How RAG Proxy Works

The full technical setup: architecture, vector search, proxy interception, and the code that makes it run.

Request Flow

User  →  Open WebUI  →  RAG Proxy (:7079)  →  Ollama (:11434)
                               ↓
                          Brain.pm searches
                          94 dental policy docs
                               ↓
                          Injects top 5 chunks
                          into the prompt

A user types a question in Open WebUI (a ChatGPT-style chat interface, running locally on the same Mac).
That question hits RAG Proxy (our middleman server, it sees every chat message before it reaches the AI).
RAG Proxy searches the policy documents using Brain.pm (a custom vector-search library, vectors are numerical fingerprints of meaning, so “needlestick” and “percutaneous injury” find each other).
The top 5 matching chunks get injected into the prompt as context.
The enriched prompt is forwarded to Ollama (the local AI engine that runs the language model on your own hardware, no cloud).
The AI reads the context and writes a cited answer.
The user never knows RAG is happening. It looks like a chatbot that knows your policies.

The Technology Stack

Perl + Mojolicious

The proxy server, ~550 lines

Brain.pm

Custom vector search library

SQLite

Document chunks + metadata

PDL

Vector math (cosine similarity)

BGE-M3

Embedding model (1,024 dimensions)

Ollama

Local AI inference engine

Ministral 3 14B

Language model (Apache 2.0)

Open WebUI

Chat interface (Docker)

Mac Studio M2 Max

96 GB unified memory

Key Components

The Proxy

RAG Proxy (bin/ragproxy)

A Perl/Mojolicious (a lightweight web framework, like Flask but for Perl) server that pretends to be Ollama.

Any client that already talks to ChatGPT or Ollama can talk to RAG Proxy with zero configuration changes. You just change the port number from 11434 to 7079.

It speaks both the OpenAI API (the de facto standard for AI chat clients) and the Ollama API (the native protocol for the local engine), so both client types work out of the box.

It intercepts every request, searches the right brain, injects the context, and forwards to the real Ollama. Streaming and non-streaming both supported.

Model-name routing picks the brain: dental/ministral-3 means “search the dental brain, forward to the Ministral model.” That's how multiple knowledge domains share one server.

The Brain

Brain.pm (lib/Brain.pm)

A custom Perl library that handles ingestion (parsing files), chunking (splitting them into bite-sized pieces), embedding (converting text into number vectors), and semantic search (finding chunks by meaning, not exact wording).

Each “brain” is a self-contained knowledge domain. It lives as two files: a SQLite database (a single-file relational database, holds the chunk text and metadata) and a PDL matrix (Perl Data Language, a fast numerical-array format for the vectors).

Documents get split into chunks of roughly 500 tokens (a token is a word or word-fragment; 500 is about two paragraphs).

Each chunk gets converted into a 1,024-dimensional vector (a list of 1,024 numbers, a numerical fingerprint of the chunk's meaning) using BGE-M3 (a multilingual embedding model from BAAI that runs locally through Ollama).

When a question arrives, it gets the same vector treatment. Brain.pm compares the question vector against every chunk vector using cosine similarity (a math function that measures the angle between two vectors, closer angle equals closer meaning).

The top 5 most similar chunks come back in milliseconds.

The Search

RAG Proxy Uses Semantic Search

What we use: semantic search, also called vector search or dense retrieval. It finds chunks by meaning, not by exact wording. Cosine similarity over 1,024-dimensional BGE-M3 embeddings.

Not keyword search (the kind of matching most file servers and old-school site search did 20 years ago, where you need the exact word to be present).

Why it matters: ask “I pricked my finger on a needle” and a keyword search misses Policy H.03, Percutaneous Injury Protocol entirely, because the word “percutaneous” never appears in the question.

Semantic search handles it. The vector for “pricked my finger on a needle” sits mathematically close to the vector for “percutaneous exposure incident.” Same concept, different words.

Brain.pm finds the right policy even when the user doesn't know the clinical terminology.

The Interception

Transparent Proxy Architecture

RAG Proxy sits between the chat interface and the AI model. It's invisible to the user.

Open WebUI thinks it's talking directly to Ollama. Ollama thinks it's receiving a normal prompt. The RAG layer is entirely transparent (the technical term for “in the middle but unnoticed”).

Any LLM client that supports the OpenAI or Ollama API can use RAG Proxy without modification. No plugins, no custom integrations, no vendor lock-in.

Swap out the model, the chat interface, or the proxy itself without touching the others.

Brain Anatomy

Each brain is a folder of five files. The entire knowledge base for 94 dental policy documents fits in under 1 MB, small enough to email.

# Each brain folder:
data/dental-brain/
  brain.db            # SQLite: chunks, metadata, source tracking
  brain_vectors.pdl   # PDL matrix: 1024-dim vectors for all chunks
  system-prompt.txt   # Instructions for the AI when answering
  config.json         # Brain settings (chunk size, overlap, etc.)
  sources/            # Original markdown documents
    

How Indexing Works

# Index documents into a brain:
ragproxy-ingest --brain dental --source corpus/dental/

# What happens under the hood:
# 1. Each document is split into ~500-token chunks
# 2. Each chunk is sent to BGE-M3 via Ollama's /api/embed
# 3. The 1024-number vector is stored in the PDL matrix
# 4. Chunk text + metadata goes into SQLite
# 5. The vector matrix is rebuilt and saved to disk
    

Sample Query Flow

# 1. User asks in Open WebUI:
"What do I do after a needlestick injury?"

# 2. RAG Proxy receives the request
# 3. Parses model name: dental/ministral-3
#    brain = "dental", model = "ministral-3"

# 4. Brain.pm converts question to a 1024-dim vector
# 5. Cosine similarity against all dental chunks
# 6. Top 5 chunks returned (all from Policy H.03)

# 7. Proxy builds augmented prompt:
"Answer using ONLY the following context:
 [Policy H.03 - Percutaneous Exposure Protocol...]
 [Policy H.03 - Post-Exposure Blood Testing...]
 ..."

# 8. Forwards to Ollama (ministral-3:14b)
# 9. AI generates cited answer from the context
# 10. Response streams back to the user
    

View the Code

The core source files. All Perl. Click to download.

ragproxy.pl (The Proxy) Brain.pm (Vector Search) ragproxy-ingest.pl (Document Loader)

View the Slides →