Implementing Hybrid Semantic-Lexical Search in RAG Systems

In this article, you will learn how to implement a hybrid search strategy for RAG systems by combining BM25 lexical search with semantic search, fused together using Reciprocal Rank Fusion.

Topics we will cover include:

Why hybrid search outperforms either lexical or semantic search alone in retrieval-augmented generation systems.
How to implement BM25 lexical search and dense vector semantic search as independent retrieval engines in Python.
How to merge both rankings using Reciprocal Rank Fusion (RRF) to produce a final, balanced retrieval result.

Introduction

Implementing hybrid search strategies is a critical step in building modern RAG (Retrieval-Augmented Generation) systems, especially when shifting from prototype to production-ready solutions.

There is little argument against semantic search — fueled by dense vectors or embeddings, which are numerical representations of text — being incredibly useful at understanding semantics, synonyms, and context. However, lexical, keyword-based search with approaches like BM25 covers a small blind spot neglected by semantic search. Combining the best of both worlds is therefore the perfect recipe to take your RAG system’s retrieval mechanism the extra mile.

Note: If you are unfamiliar with RAG systems, you may find the “Understanding RAG” article series remarkably insightful for getting the most out of this read. In particular, acquiring an understanding of vector databases first through this article is recommended.

Step-by-Step Implementation

The first step is to ensure all the necessary external Python libraries are installed:

pip install rank_bm25 sentence-transformers requests

rank_bm25: an implementation of the BM25 lexical search algorithm for information retrieval (BM stands for “Best Matching”).
sentence-transformers: provides pre-trained language models for generating text embeddings. In a real setting, you may already have your own vector database containing many document embeddings and not need this, but we will use it here to simulate the construction of a toy vector database and illustrate hybrid search on it.
requests: used to fetch the raw dataset package from a public GitHub datasets repository prepared for this example.

With these ingredients at hand, we start by loading the dataset and storing the raw texts in a list (we do so because it is a small dataset).

import requests
import zipfile
import io
import os

# Downloading and extracting the dataset from the compressed file
url = "https://github.com/gakudo-ai/open-datasets/raw/refs/heads/main/asia_documents.zip"
response = requests.get(url)
with zipfile.ZipFile(io.BytesIO(response.content)) as z:
    z.extractall("asia_data")

# Loading documents and getting their filenames
documents = []
doc_names = []
for file in os.listdir("asia_data"):
    if file.endswith(".txt"):
        with open(f"asia_data/{file}", "r", encoding="utf-8") as f:
            documents.append(f.read())
            doc_names.append(file)

print(f"Loaded {len(documents)} documents for the knowledge base.")

The hybrid search process is divided into three stages: two of them take place in parallel, or independently from each other. The third is where the fusion of both approaches happens, using a merging method called Reciprocal Rank Fusion (RRF).

Lexical Search with BM25

from rank_bm25 import BM25Okapi

# BM25 requires that each text is tokenized as a (sub)list of words
tokenized_corpus = [doc.lower().split() for doc in documents]
bm25 = BM25Okapi(tokenized_corpus)

def search_bm25(query, top_k=3):
    tokenized_query = query.lower().split()
    # Getting scores (lexical relevance to the query) for all documents
    scores = bm25.get_scores(tokenized_query)
    # Ranking documents by score
    ranked_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)
    return ranked_indices[:top_k], scores

The lexical search process has been encapsulated in a function called search_bm25(). This function takes two input arguments: a string containing the user’s query to the RAG system, and the number of top results to retrieve. The rank_bm25 library provides a get_scores() method that computes, for each document — treated as a collection of tokens — a lexical relevance score. Documents are then ranked by decreasing score, the top-k are selected, and returned.

Semantic Search with Sentence Transformers

The semantic search engine first uses a sentence transformer model to obtain embedding vectors for the texts and the user query, then applies a vector similarity metric like cosine similarity to rank texts by semantic relevance and retrieve the most relevant k:

from sentence_transformers import SentenceTransformer, util
import torch

# Loading the pre-trained embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Pre-compute embeddings for our corpus (our "Vector DB")
# You do not need this step if you already have an external vector database:
# you may read and import your document vectors instead
doc_embeddings = model.encode(documents, convert_to_tensor=True)

def search_semantic(query, top_k=3):
    query_embedding = model.encode(query, convert_to_tensor=True)
    scores = util.cos_sim(query_embedding, doc_embeddings)[0]
    ranked_indices = torch.argsort(scores, descending=True).tolist()
    return ranked_indices[:top_k], scores.tolist()

Fusing Results with Reciprocal Rank Fusion

With both retrieval engines in place, the final step is to merge their rankings using Reciprocal Rank Fusion. RRF assigns each document a score based on its rank position in each result list, then sums those scores across both lists to produce a unified ranking. Documents that rank highly in both systems receive the strongest combined scores.

def reciprocal_rank_fusion(bm25_indices, semantic_indices, k=60):
    rrf_scores = {}

    for rank, idx in enumerate(bm25_indices):
        rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (k + rank + 1)

    for rank, idx in enumerate(semantic_indices):
        rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (k + rank + 1)

    sorted_indices = sorted(rrf_scores, key=lambda i: rrf_scores[i], reverse=True)
    return sorted_indices

def hybrid_search(query, top_k=3):
    bm25_indices, _ = search_bm25(query, top_k=len(documents))
    semantic_indices, _ = search_semantic(query, top_k=len(documents))
    fused = reciprocal_rank_fusion(bm25_indices, semantic_indices)
    return fused[:top_k]

Running a Query

With the hybrid search pipeline assembled, you can run a query against the knowledge base:

query = "What are the main economic activities in Southeast Asia?"
results = hybrid_search(query, top_k=3)

print(f"Top {len(results)} results for query: '{query}'\n")
for rank, idx in enumerate(results):
    print(f"Rank {rank + 1}: {doc_names[idx]}")
    print(documents[idx][:300])
    print("---")

Summary

Hybrid search combines the complementary strengths of BM25 lexical retrieval and dense vector semantic search into a single, more robust retrieval pipeline. BM25 excels at exact keyword matching, while semantic search captures meaning and context. By fusing their rankings with Reciprocal Rank Fusion, the hybrid approach consistently surfaces more relevant documents than either method alone — making it a practical upgrade for any production RAG system.