Case Notes

Shipping a grounded RAG assistant in two weeks

A field report on scoping, retrieval quality, and the evals that let us put a retrieval-grounded assistant in front of real users — fast.

Nabeel GhafoorNabeel GhafoorApril 22, 20263 min read
Share
Shipping a grounded RAG assistant in two weeks — cover

A retrieval-augmented assistant can go from zero to in-front-of-real-users in about two weeks — but only if you spend the first days on retrieval and scope, not on prompts. This is a field report on how we sequence the work, where RAG projects usually fail, and why measuring retrieval before touching generation is the move that makes the timeline real.

Scope before you build

The fastest RAG projects start by narrowing, not building. We pin three things on day one: which questions the assistant must answer, which it must refuse, and what a good answer looks like. A bounded assistant that nails fifty real questions beats an open-ended one that is mediocre at everything — and it is the difference between a two-week ship and a two-quarter slog.

Scope is also a retrieval decision. Knowing the question space tells you what to index, how to chunk it, and what “relevant” means before you write a line of prompt.

Most RAG failures are retrieval failures

When a RAG system gives a bad answer, the instinct is to blame the prompt. Usually the prompt is fine — the model never received the right context to begin with. If retrieval surfaces the wrong passages, no amount of prompt engineering rescues the answer. Garbage in, confident garbage out.

So we measure retrieval first, as its own component. For a set of real questions with known source passages, we check whether retrieval actually returns those passages in the top results. That single number — retrieval recall — predicts answer quality more reliably than anything we do downstream.

Measure retrieval, then generation

The sequence matters:

  1. Build a retrieval eval. Real questions, the passages that should answer them, and a score for whether those passages were retrieved. Tune chunking, embeddings, and ranking against this number — not against vibes.
  2. Only then tune generation. Once the model is reliably handed the right context, prompt and answer-format work actually pays off, because you are improving a system that has the information it needs.
  3. Add an end-to-end eval. Finally, score whole answers against known-good responses, so you can catch regressions across the full pipeline.

Teams that skip step one spend weeks “fixing the prompt” on a system that was never given the right context. Teams that do it in order ship.

Grounding and refusals are features

A trustworthy assistant cites its sources and declines when it does not know. Both are product features, not nice-to-haves. Citations let users verify and build trust; refusals prevent the confident-wrong answers that destroy it. We build both in from the start — an answer with no supporting passage is a bug, and “I don't have that information” is a valid, correct response.

Why two weeks is realistic

The timeline is not heroics; it is sequence. Tight scope removes the open-ended work. Retrieval-first removes the dead-end prompt tuning. Evals remove the guesswork that turns a one-week change into a one-month investigation. With those in place, a senior engineer — amplified by agents handling the scaffolding — can stand up a grounded, measured, citeable assistant and put it in front of real users inside a fortnight. Then the real work begins: watching how people actually use it, and sharpening it against what you learn.

Nabeel Ghafoor

Written by

Nabeel Ghafoor

Senior Engineer

Senior engineer at ByteTuned, leading production AI builds and modernizations.

Keep reading

Let’s talk

Building production AI? Let’s talk.

Book a 30-minute call. We’ll map the highest-impact system to build first — and what moving that number is worth.