RAG: Question Answering On Domain Knowledge With Semantic Search And Generative AI

Answering questions based on domain knowledge (like internal documentation, contracts, books, etc.) is challenging. In this article we explore an advanced technique called Retrieval-Augmented Generation (RAG) in order to achieve this with great accuracy, mixing semantic search and text generation with models like ChatDolphin, LLaMA, ChatGPT, GPT-4...

Question Answering

The Challenges About Answering Questions On Domain Knowledge

Question answering on domain knowledge requires that you first send some context to the AI model, and then ask a question about it.

For example you could send the following context:

All NLP Cloud plans can be stopped anytime. You only pay for the time you used the service. In case of a downgrade, you will get a discount on your next invoice.

Now you might want to ask the following question:

When can plans be stopped?

The AI would answer something like this:

Anytime

For more details, see our documentation about question answering here.

The problem with this approach is that the size of your context (i.e. the size of your input text) is limited. So you cannot send your whole domain knowledge as a context.

Let's say that you want to build a support chatbot that knows everything about your product documentation, so end users can ask any product-related question to the chatbot without contacting a real support agent. Most likely, your documentation will be made up of several hundreds or thousands of words, or even millions of words...

Let's explore how to overcome this limitation and perform question answering on very large documents.

Semantic Search VS Generative AI

When it comes to question answering, 2 kinds of technologies can be used. Text generation (generative AI) and semantic search.

The first one, text generation, is basically what I just showed above. It usually takes an advanced text generation model like ChatDolphin, LLaMA, ChatGPT, GPT-4... It is able to understand a human question, and respond like a human too. However it does not work on large documents. Fine-tuning a generative AI model with your domain knowledge would not work well as fine-tuning is not a good technique to add knowledge to a model.

Semantic search is basically about searching a document the same way as Google, but based on your own domain knowledge.

In order to achieve that, you need to convert your internal documents into vectors (also known as "embeddings"). Then you should convert your question into a vector too, and then perform a vector search (also known as "semantic similarity") in order to retrieve the part of your domain knowledge that is the closest to your question.

See how to extract embeddings on NLP Cloud here.

A first solution is to store your embeddings in a dedicated vector database like PG Vector.

Another solution is to encode your own semantic search model with your own domain knowledge and deploy it on a GPU (which is the solution we propose at NLP Cloud because it offers the best response time). Then once you have your vector DB ready, or once your model is created, you can ask questions in natural language and your AI model will return an extract of your domain knowledge that best answers your question.

Semantic search is usually very fast and relatively cheap. It is also more reliable than the text generation fine-tuning strategy so you will not face any AI hallucination problem. But it is not able to properly "answer" a question. It simply returns a piece of text that contains an answer to your question. Then it is up to the user to read the whole piece of text in order to find the answer to his question.

For more details, see our documentation about semantic search here.

Good news is that it is possible to combine both semantic search and generative AI in order to achieve advanced results!

Question Answering Mixing Semantic Search And Generative AI

In order to answer questions on domain knowledge, the strategy we prefer at NLP Cloud is the following: first make a request with semantic search in order to retrieve the resources that best answers your question, and then use text generation to answer the question based on these resources like a human.

Let's say that we are a HP printers reseller, and we want to answer our customer's questions on our website.

First we will need to calculate embeddings and store them in a vector database, or create our own semantic search model. Here it will be made of 3 examples only, but in real life you can include up 1 million examples when using semantic search on NLP Cloud. We simply create a CSV file and put the following inside:

HP® LaserJets have unmatched printing speed, performance and reliability that you can trust. Enjoy Low Prices and Free Shipping when you buy now online.
Every HP printer comes with at least a one-year HP commercial warranty (or HP Limited Warranty). Some models automatically benefit from a three-year warranty, which is the case of the HP Color LaserJet Plus, the HP Color LaserJet Pro, and the HP Color LaserJet Expert.
HP LaserJet ; Lowest cost per page on mono laser printing. · $319.99 ; Wireless options available. · $109.00 ; Essential management features. · $209.00.

We then upload our CSV dataset to NLP Cloud and click "Create model". After a while, our own semantic search model containing our own domain knowledge will be ready and we will receive a private API URL in order to use it.

Let's ask a question to our brand new model using the NLP Cloud Python client:

import nlpcloud

# We use a fake model name and a fake API key for illustration reasons.
client = nlpcloud.Client("custom-model/5d8e6s8w5", "poigre5754gaefdsf5486gdsa56", gpu=True)
client.semantic_search("How long is the warranty on the HP Color LaserJet Pro?")

The model quickly returns the following with a short response time:

{
"search_results": [
    {
        "score": 0.99,
        "text": "Every HP printer comes with at least a one-year HP commercial warranty (or HP Limited Warranty). Some models automatically benefit from a three-year warranty, which is the case of the HP Color LaserJet Plus, the HP Color LaserJet Pro, and the HP Color LaserJet Expert."
    },
    {
        "score": 0.74,
        "text": "All consumer PCs and printers come with a standard one-year warranty. Care packs provide an enhanced level of support and/or an extended period of coverage for your HP hardware. All commercial PCs and printers come with either a one-year or three-year warranty."
    },
    {
        "score": 0.68,
        "text": "In-warranty plan · Available in 2-, 3-, or 4-year extension plans · Includes remote problem diagnosis support and Next Business Day Exchange Service."
    },
    ]
}

Now we retrieve the answer that has the highest score (we could perfectly retrieve several answers too): "Every HP printer comes with at least a one-year HP commercial warranty (or HP Limited Warranty). Some models automatically benefit from a three-year warranty, which is the case of the HP Color LaserJet Plus, the HP Color LaserJet Pro, and the HP Color LaserJet Expert."

This response is correct but it is not very user friendly since the user needs to read quite a long piece of text in order to get the answer. So now we ask the same question again to our question answering endpoint, using the ChatDolphin model. We will use the semantic search response as a context:

import nlpcloud

client = nlpcloud.Client("chatdolphin", "poigre5754gaefdsf5486gdsa56", gpu=True)
client.question(
    """How long is the warranty on the HP Color LaserJet Pro?""",
    context="""Every HP printer comes with at least a one-year HP commercial warranty (or HP Limited Warranty). Some models automatically benefit from a three-year warranty, which is the case of the HP Color LaserJet Plus, the HP Color LaserJet Pro, and the HP Color LaserJet Expert."""
)

It returns the following answer:

{
    "answer": "The warranty lasts for three years."
}

Pretty good isn't it?

Conclusion

Despite the recent progress made on generative AI models like ChatDolphin, LLaMA, ChatGPT, GPT-4, etc. the limited request size makes it impossible to use these great models on specific domain knowledge for question answering. Fine-tuning these models does not work well for such a use case unfortunately...

A good strategy is to implement a RAG system. First answer your question by converting your documents to embeddings and storing them in a vector database, or creating your own semantic search model out of your documents, and then use a regular question answering model based on generative AI in order to return a human answer to the initial question.

If you want to implement this strategy, don't hesitate to create your own semantic search model on NLP Cloud: see the related documentation here!

Mark
Application Engineer at NLP Cloud