Hybrid Search vs Semantic Search: RAG Retrieval Strategies
Smart search is what separates demos from products
Section titled “Smart search is what separates demos from products”“Searching by meaning is just the beginning. The difference between a toy demo and a real product is how smart your search is.”
In the last chapter, you stored your vectors in a database. You can now ask “what’s most similar to my query?” and get results in milliseconds. That’s great — but it’s also the most basic retrieval strategy there is. In this chapter, you’ll learn the techniques that make retrieval actually good.
Think of it this way: basic
Naive similarity search and the top-K problem
Section titled “Naive similarity search and the top-K problem”The simplest retrieval strategy is: embed the user’s query, calculate
This works. For simple questions over small document sets, it works surprisingly well. But it has two blind spots.
Blind spot 1: choosing K. If K is too small (say, 1), you might miss relevant context that was split across multiple chunks. If K is too large (say, 20), you flood the LLM with noise — chunks that are vaguely related but don’t actually help answer the question. And since LLMs have limited context windows, every irrelevant chunk pushes out space for a relevant one.
Blind spot 2: semantic-only misses. Imagine searching for “GPT-4 pricing” in a document about AI models. The chunk that contains the actual pricing table might use the exact phrase “GPT-4” but talk about it in a very different semantic context than your query. Pure semantic search might rank a chunk about pricing concepts higher than the chunk that contains the actual price. Keywords matter too. [src: pinecone_hybrid_search]
Hybrid search: the best of both worlds
Section titled “Hybrid search: the best of both worlds”- Semantic search — finds chunks that are similar in meaning to your query using vector embeddings. Great for conceptual questions like “how does authentication work in this system?”
- Keyword search (
) — finds chunks that contain the exact words in your query. Great for specific lookups like “error code 403” or “GPT-4.”BM25
BM25 is the algorithm that traditional search engines have used for decades. It scores documents based on how often your search terms appear, adjusted for document length and term rarity. It’s simple, fast, and shockingly effective for exact-match queries.
Hybrid search runs both strategies in parallel, then combines the scores. The typical approach is Reciprocal Rank Fusion (RRF): each result gets a score based on its rank in each list, and the combined scores determine the final ranking. A chunk that ranks highly in both semantic and keyword search will float to the top. [src: langchain_retrieval_docs]
When to use hybrid search:
- ✅ Works for most production scenarios
- ✅ Captures both semantic similarity and exact keywords
- ✅ Usually better than pure semantic search when supported natively
MMR: don’t return five chunks that say the same thing
Section titled “MMR: don’t return five chunks that say the same thing”Here’s a problem that surprises people: you search for “what is RAG?” and your top 5 results are five chunks that all basically say the same thing in slightly different words. Each one individually is relevant, but together they waste four of your five context slots.
The analogy: You’re building a playlist for a road trip. You love rock music, so a naive algorithm gives you five rock songs. MMR gives you your favourite rock song, then a great pop song, then a rock song in a different style, then something unexpected. Each song is good on its own and adds something the others don’t.
Technically, MMR works by selecting results iteratively. It picks the most relevant chunk first. Then for each subsequent pick, it balances two scores: similarity to the query (relevance) and dissimilarity to the chunks already selected (diversity). A parameter called lambda controls the trade-off — higher lambda means more relevance, lower lambda means more diversity. [src: langchain_retrieval_docs]
When to use MMR:
- ✅ Your corpus has repetitive or near-duplicate chunks
- ✅ You want broader context diversity in the top-K results
- ✅ You want to avoid sending five versions of the same idea
Re-ranking: a smarter second opinion
Section titled “Re-ranking: a smarter second opinion”Here’s the key insight behind
A re-ranker (also called a cross-encoder) is different. It takes your query and a chunk as a pair, processes them together, and outputs a single relevance score. Because it sees both texts side by side, it understands relevance much more deeply — it can catch nuances that independent embeddings miss. [src: pinecone_hybrid_search]
The catch: cross-encoders are slow. You can’t run one against a million chunks. So the strategy is a two-stage pipeline:
- Stage 1 — Fast retrieval: Use your vector database to grab the top 20 candidates using semantic or hybrid search. This takes milliseconds.
- Stage 2 — Precise re-ranking: Feed those 20 candidates through a cross-encoder to score them against the query. Pick the top 3. This takes a few hundred milliseconds.
The result: you get the speed of vector search and the precision of a cross-encoder. This is the standard architecture for production RAG systems that need high-quality retrieval. [src: langchain_retrieval_docs]
When to use re-ranking:
- ✅ Retrieval quality is user-critical
- ✅ You can afford extra latency (often a few hundred ms)
- ✅ You need better precision on ambiguous queries
Metadata filtering: search smarter, not harder
Section titled “Metadata filtering: search smarter, not harder”Sometimes the best retrieval strategy isn’t about algorithms — it’s about narrowing the search space before you search.
Metadata filtering lets you attach structured data to each chunk (source document, date, category, author, language) and filter on it at query time. Instead of searching all 50,000 chunks in your database, you say “only search chunks from documents tagged as finance uploaded after January 2025.”
This is enormously powerful and surprisingly underused. If your user asks about “Q4 revenue,” you probably don’t need to search your engineering docs. Metadata filtering eliminates irrelevant results before the vector search even begins. [src: qdrant_docs]
When to use metadata filtering:
- ✅ Your data spans multiple domains, dates, or sources
- ✅ You can narrow search scope before retrieval
- ✅ You need better precision and lower noise in production
Try it: compare strategies side by side
Section titled “Try it: compare strategies side by side”Enter a query and see how different retrieval strategies return different results. Watch how hybrid search catches things pure semantic search misses, and how re-ranking reorders results for better relevance.
Try It: Strategy Comparator
Enter a query and see how three retrieval strategies return different results from the same corpus.
The retrieval funnel
Section titled “The retrieval funnel”Here’s how the strategies layer together in a production system. You start with your full collection and progressively narrow down to the best results.
Not every system needs every stage. For a personal notes chatbot, naive similarity search with K=5 is probably fine. For a customer-facing product searching millions of documents, you want the full funnel. Match the complexity to the problem.
Project step: run your first query
Section titled “Project step: run your first query”Head to the Playground and run your first query against the chunks you stored in Chapter 4. Try the strategy toggle to switch between naive semantic search and hybrid search. Notice which results change and which stay the same. That difference is the whole point of this chapter.
Quick check
Section titled “Quick check”Why does hybrid search typically outperform pure semantic search?
What problem does re-ranking with a cross-encoder solve?
What you just built
Section titled “What you just built”You now know five retrieval strategies and when to use each one:
- Naive similarity search — simple, works for small collections
- Hybrid search — combines semantic and keyword search, almost always better
- MMR — ensures diverse results when your documents overlap
- Re-ranking — a precise second pass using cross-encoders
- Metadata filtering — narrows the search space before you even begin
Your chatbot project now has embedded chunks stored in a vector database, and you’ve run your first real query. The retrieval side of RAG is solid.
Next up: Chapter 6 — The Prompt Layer. Retrieved chunks are just raw ingredients. The prompt is the recipe that tells the AI what to cook — and getting it wrong can undo all the good retrieval work you just did.
Was this chapter helpful?
Sources
Section titled “Sources”- LangChain Retrieval Documentation — python.langchain.com/docs [src: langchain_retrieval_docs]
- Pinecone Learning Center — Hybrid Search — pinecone.io/learn/hybrid-search [src: pinecone_hybrid_search]
- Qdrant Documentation — Filtering — qdrant.tech/documentation [src: qdrant_docs]
← Chapter 4: Vector Databases | Chapter 6: The Prompt Layer →