March 18, 2026 · 6 min read

Build a RAG System for Free with Cloudflare Workers

Recorded in a loud coffee shop near the road — please excuse the audio quality.

Intro

LLMs don’t know your business. They can’t — they weren’t trained on it. So when a customer asks about your specific service or product, the model guesses. That’s a context problem. Retrieval-Augmented Generation (RAG) is one way to fix it — and you can build a working prototype today, for free, on Cloudflare’s free tier.

What RAG Actually Is (The Short Version)

Think of it like an open-book exam. A standard LLM studied everything it was trained on — and most likely that wasn’t content around your service offerings. RAG lets the model open your books at test time. You retrieve the most relevant chunks from your documents and hand them to the model as context. It reads your stuff and answers from it — instead of guessing.

Less hallucination, more accuracy.

If you want the deep technical version, Anthropic’s contextual retrieval documentation is worth reading.

Building the Pipeline on Cloudflare’s Free Tier

A basic implementation can get away with six pieces. Understanding each one matters when something breaks — or when you’re ready to swap something out.

1. Document processing. Take your internal documents and convert them to plain text. Document processing can get complex — managing different document types is its own problem.

Workers AI toMarkdown() converts PDFs to markdown. TXT and MD files pass through as-is.
Files stored in R2 (Cloudflare’s object storage).

2. Chunking. When a customer asks about a specific product offering, you don’t hand them your whole catalog — you hand them a one-pager. Same idea here. We break documents into useful chunks. There are a lot of ways to do this — asking which is best is a bit like asking the best way to organize a book. In this demo I’m just splitting by text size, which is super naive. My go-to in a real project would be LlamaIndex’s semantic chunker. But for getting the pattern running, this works.

All runs in the Worker — no external service.

3. Embeddings. Converting text chunks into vectors (a series of numbers) that encode meaning. Two sentences that say the same thing in different words will land close together in vector space.

Workers AI running bge-base-en-v1.5 (768-dimensional vectors).
Batches up to 100 chunks per request.
Runs on Cloudflare’s edge — no third-party API call.

4. Vector database. Storing the vectors for your document chunks, then looking up the similarity between what the user asks and what content you have.

Vectorize (Cloudflare’s native vector DB). Cosine similarity search.
Metadata filtering scopes results to the user’s own documents.

5. LLM. Taking the retrieved chunks and the user’s question and generating an answer. If nothing relevant comes back, returning a clean fallback instead of a confident hallucination.

Workers AI running Llama 3.1 8B (fp8 quantized).
When a chunk matches, it goes into the system prompt as context and the LLM generates a conversational answer.
When nothing matches, the LLM responds with general knowledge and is honest about it.
Responses stream via SSE.

6. Hosting. Serving the frontend and API from a single deploy. Cloudflare Workers with Static Assets handles both — no separate server.

Workers Static Assets handles the frontend (HTML/CSS/JS with the deep-chat web component).
Hono (lightweight router) handles the API routes.
D1 (SQLite) stores user and document metadata.
One wrangler deploy and it’s live.

A production system would also have auth, file storage, and metadata tracking. Those are real infrastructure concerns, but they’re not RAG-specific.

Where This Is Actually Useful

RAG is a good fit when you have known information people keep asking about in unpredictable ways:

FAQ and knowledge base bots — stop answering the same questions by hand
Customer support grounded in your actual documentation
Documentation Q&A for developers
Internal onboarding, where new hires are asking questions your handbook already answers

There’s also a simpler variant: instead of uploading documents, you pre-embed known Q&A pairs and return exact matches. Zero hallucination risk, fully deterministic. The tradeoff is maintaining a curated list instead of uploading arbitrary documents. (Interested in that approach? Let me know and I’ll dive deeper into it.)

The Cloudflare Free Tier Has Real Constraints

~39,000 vector queries/month on Vectorize
~50–100 LLM conversations/day on the free neuron budget

For a prototype, internal tool, or client demo — that’s plenty. For a customer-facing product with real traffic, you’ll hit the ceiling.

You’re implementing a basic RAG at zero cost and upgrading the specific piece that’s limiting you, not rebuilding from scratch.

RAG Isn’t the Only Answer

RAG is a tool to solve a context problem. Specifically, the problem of getting an LLM to answer questions about your content. I’m running it in production to serve real customers today — so I can tell you it works. But it’s one tool, not the only one.

The alternatives worth knowing:

Long-context windows — just send everything. Works if your corpus is small and cost isn’t a concern.
Prompt caching — efficient for static context that repeats across many queries.
Tool use — let the model call APIs for real-time data instead of embedding static documents.

You Can Start for Free

These tools are free or near-free. Cloudflare’s free tier is real. Workers AI is real. Vectorize is real. And the problems they can solve — customers not getting answers, support queues backed up, knowledge locked in documents nobody reads — those are real too.

Build the basic version. See if it solves the problem. Upgrade the piece that’s actually limiting you when you get there.

The code walkthrough is the next post — subscribe if you don’t want to miss it.

What’s the document-heavy problem in your business that people keep asking you about? That’s probably where to start.