Creating A Semantic Search Model With Sentence Transformers For A RAG Application

Implementing a Retrieval Augmented Generation (RAG) pipeline by fine-tuning your own semantic search model is a powerful approach to enhance the accuracy and relevance of question-answering systems.

This technique combines the strengths of both semantic search and generative AI, enabling the system to better understand user questions and generate more accurate and contextually relevant responses. By fine-tuning a semantic search model using Sentence Transformers, developers can tailor the model to their specific domain, improving the overall performance of the RAG pipeline.

Semantic Search

What Is Semantic Search?

Semantic search is a data searching technique that goes beyond the traditional keyword-based search methods. It uses natural language processing and machine learning algorithms to improve the accuracy of search results by considering the searcher's intent and the contextual meaning of the terms used in their query. This approach aims to understand the underlying meaning of the search query and the content on web pages, rather than simply matching keywords. By doing so, semantic search can deliver more relevant results, even if they don't contain the exact words used in the original query.

To achieve this, semantic search employs various techniques, such as word-sense disambiguation, concept extraction, and query expansion. It also utilizes vector search and machine learning to return results that aim to match the user's intent and context. Semantic search is widely used in web search engines, such as Google, and is powered by technologies like the Knowledge Graph, which stores structured data about entities and their relationships. This allows search engines to understand the meaning behind search queries and provide more accurate and meaningful results.

Semantic search is very useful to businesses who want to answer questions on their own domain knowledge like their technical documentation, contracts, product descriptions...

What Is Retrieval Augmented Generation (RAG)?

Semantic Search

Retrieval Augmented Generation (RAG) is a technique that enhances the accuracy and reliability of generative AI models, such as Large Language Models (LLMs), by incorporating external data sources to provide more contextually accurate and up-to-date information. It integrates a retrieval component with a generative model, allowing the system to search and fetch relevant information from a database or knowledge base to supplement its internal knowledge when generating responses.

This approach ensures that the AI model can provide answers based on the most current, reliable facts and enables users to verify the sources of the information used in the model's responses. RAG helps to ground the AI model on external sources of knowledge, improving the quality of its output and providing a way to update the model without the need for extensive retraining.

RAG is a great strategy to mitigate potential hallucinations from generative AI models. RAG helps build a question answering system that has the best of both worlds: factual accuracy and human responses generated in natural language.

What Is The Sentence Transformers Library?

Sentence Transformers

The Sentence Transformers library is a powerful Python framework designed for state-of-the-art text embeddings. It is built upon transformer neural networks like BERT, RoBERTa, XLM-R, and others, achieving top performance in various tasks including semantic search, paraphrase mining, semantic similarity comparison, clustering, and more. This library allows for easy fine-tuning of sentence embedding methods, enabling the creation of task-specific sentence embeddings tailored to meet specific needs. Learn more on the Sentence Transformers website

The library offers a wide selection of pre-trained Sentence Transformers models for more than 50 languages, available on the Hugging Face platform. Users can also train or fine-tune their own models using the library, providing the flexibility to create custom models for unique use cases. The Sentence Transformers team recently released a new major version (v3) that considerably improves the capabilities of this library, especially its fine-tuning capabilities.

The Sentence Transformers library is fast, comprehensive, and well maintained, which is why we are using it in this tutorial.

Creating Your Own Semantic Search Model

Creating your own semantic search model is a great way to get accurate results while ensuring a very low latency. This is even more true if you deploy your own semantic search model on a GPU.

First, let's make a small dataset containing our data. Create a 1-column CSV file (called "dataset.csv") containing the following technical documentation about HP printers (in a real life scenario you will want to include many more examples of course):

"HP® LaserJets have unmatched printing speed, performance and reliability that you can trust. Enjoy Low Prices and Free Shipping when you buy now online."
"Every HP LaserJet comes with a one-year HP commercial warranty (or HP Limited Warranty)."
"HP LaserJet ; Lowest cost per page on mono laser printing. · $319.99 ; Wireless options available. · $109.00 ; Essential management features. · $209.00."

Each row can contain up to 512 tokens (roughly equivalent to 400 words), and in order to maximise accuracy it is recommended to stay below 128 tokens (roughly equivalent to 100 words). Now that we have our 3 pieces of documentation in our dataset, we can encode the data using our model with Sentence Transformers. Create a Python script with the following (make sure that PyTorch and Sentence Transformers are installed).

from sentence_transformers import SentenceTransformer
import csv
import torch

model_name = 'paraphrase-multilingual-mpnet-base-v2'
encoded_model_path = ''
dataset_path = 'dataset.csv'

bi_encoder = SentenceTransformer(model_name)

passages = []
with open(dataset_path) as csv_file:
    csv_reader = csv.reader(csv_file)
    for row in csv_reader:
corpus_embeddings = bi_encoder.encode(
    passages, batch_size=32, convert_to_tensor=True, show_progress_bar=True), encoded_model_path)

This script downloads and uses paraphrase-multilingual-mpnet-base-v2 as a base model and uses it to encode our data. You can choose among many available pre-trained models, depending on your requirements (model size, use case, supported languages, ...). Depending on your hardware you will want to adapt the "batch_size" parameter in order to speed up the encoding process.

Once created you can use your model for inference with the following Python script:

import csv
from sentence_transformers import SentenceTransformer, util
import torch

bi_encoder = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
semantic_search_model = torch.load('')

passages = []
with open('app/custom_model/dataset.csv') as csv_file:
    csv_reader = csv.reader(csv_file)
    for row in csv_reader:

question_embedding = bi_encoder.encode(
    "How long is the warranty on the HP Color LaserJet Pro?", convert_to_tensor=True)
hits = util.semantic_search(
    question_embedding, semantic_search_model, top_k=3)
hits = hits[0]

result = {"search_results": [
    {"score": hit['score'], "text": passages[hit['corpus_id']]} for hit in hits]}

The above inference script returns the following results:

"search_results": [
        "score": 0.99,
        "text": "Every HP LaserJet comes with a one-year HP commercial warranty (or HP Limited Warranty)."
        "score": 0.74,
        "text": "All consumer PCs and printers come with a standard one-year warranty. Care packs provide an enhanced level of support and/or an extended period of coverage for your HP hardware. All commercial PCs and printers come with either a one-year or three-year warranty."
        "score": 0.68,
        "text": "In-warranty plan · Available in 2-, 3-, or 4-year extension plans · Includes remote problem diagnosis support and Next Business Day Exchange Service."

In our inference script, the "top_k" parameter determines how many results we want to return. In the result, we show the matching text from the dataset with a confidence score. This score is important because it helps us decide whether we want to accept the response or not.

Generating a Response in Natural Language With Generative AI

As you could see, the main limitation of semantic search is that the model returns the raw text from the dataset without directly answering the question. So we now want to give this to a generative AI model as a context in order to answer the initial question in natural language.

We can easily achieve this by leveraging an advanced LLM like GPT-4 on OpenAI or LLaMA 3 and ChatDolphin on NLP Cloud. You can either decide to keep the best result from the semantic search model and pass this as a context to the LLM, or keep several results. Here is a prompt example using the best result only:

Context: Every HP LaserJet comes with a one-year HP commercial warranty (or HP Limited Warranty).
Based on the above context, answer the following question: How long is the warranty on the HP Color LaserJet Pro?

This request returns something like that:

The warranty on the HP Color LasertJet Pro lasts at least 1 year.

Semantic Search Model With Encoded Data VS Storing Embeddings In a Vector Database

In a Retrieval-Augmented Generation (RAG) system, creating a semantic search model with local encoded data or using a vector database are two interesting options.

When encoding our own data, we convert the data to tensors and then have the opportunity load the data on a GPU. On the other hand, a vector database is a specialized database designed to store, index, and query these high-dimensional vectors efficiently.

For businesses looking to achieve very low latencies, it is recommended to encode your own data and load the data in a GPU in order to improve the computation time. However it is at the expense of flexibility as your data has to be encoded again every time the dataset changes. If your underlying data changes very frequently, it might be simpler for you to extract the embeddings and incrementally store them in a vector database (like PG Vector for example).


Retrieval Augmented Generation is crucial for businesses looking into answering questions about specific data like technical documentation, contracts, etc, as it considerably increases the accuracy of the results. RAG is a key component of a support chatbot for example.

Sentence Transformers is a great library that can be used to create your own semantic search model based on your own data. When deployed on a GPU and coupled with an advanced generative AI model, such a model proves extremely powerful.

If you're not interested in creating and deploying your own semantic search model based on Sentence Transformers by yourself, you can easily do it in 1 click on NLP Cloud. Try semantic search on NLP Cloud now!

If you have questions about RAG, semantic search, and Sentence Transformers, please don't hesitate to ask us, it's always a pleasure to advise!

CTO at NLP Cloud