Discover the Performance Gain with Retrieval Augmented Generation – The New Stack

Large Language Models (LLMs) are smart enough to understand context. They can answer questions, leveraging their vast training data to provide coherent and contextually relevant responses, no matter whether the topic is astronomy, history or even physics. However, LLMs tend to hallucinate (deliver compelling yet false facts) when asked to answer questions outside the scope of their training data, or when they cant remember the details in the training data.

A new technique, Retrieval Augmented Generation (RAG), fills the knowledge gaps, reducing hallucinations by augmenting prompts with external data. Combined with a vector database (like MyScale), it substantially increases the performance gain in extractive question answering.

To this end, this article focuses on determining the performance gain with RAG on the widely-used MMLU dataset. We find that both the performance of commercial and open source LLMs can be significantly improved when knowledge can be retrieved from Wikipedia using a vector database. More interestingly, this result is achieved even when Wikipedia is already in the training set of these models.

You can find the code for the benchmark framework and this example here.

But first, lets describe Retrieval Augmented Generation (RAG).

Research projects aim to enhance LLMs like gpt-3.5 by coupling them with external knowledge bases (like Wikipedia), databases, or the internet to create more knowledgeable and contextually aware systems. For example, lets assume a user asks an LLM what Newtons most important result is. To help the LLM retrieve the correct information, we can search for Newtons wiki and provide the wiki page as context to the LLM.

This method is called Retrieval Augmented Generation (RAG). Lewis et al. in Retrieval Augmented Generation for Knowledge-Intensive NLP Tasks define Retrieval Augmented Generation as:

A type of language generation model that combines pre-trained parametric and non-parametric memory for language generation.

Moreover, the authors of this academic paper go on to state that they:

Endow pre-trained, parametric-memory generation models with a non-parametric memory through a general-purpose fine-tuning approach.

Note: Parametric-memory LLMs are massive self-reliant knowledge repositories like ChatGPT and Googles PaLM. Non-parametric memory LLMs leverage external resources that add additional context to parametric-memory LLMs.

Combining external resources with LLMs seems feasible as LLMs are good learners, and referring to specific external knowledge domains can improve truthfulness. But how much of an improvement will this combination be?

Two major factors affect a RAG system:

Both of these factors are hard to evaluate. The knowledge gained by the LLM from the context is implicit, so the most practical way to assess these factors is to examine the LLMs answer. However, the accuracy of the retrieved context is also tricky to evaluate.

Measuring the relevance between paragraphs, especially in question answering or information retrieval, can be a complex task. The relevance assessment is crucial to determine whether a given section contains information directly related to a specific question. This is especially important in tasks that involve extracting information from large datasets or documents, like the WikiHop dataset.

Sometimes, datasets employ multiple annotators to assess the relevance between paragraphs and questions. Using multiple annotators to vote on relevance helps mitigate subjectivity and potential biases that can arise from individual annotators. This method also adds a layer of consistency and ensures that the relevance judgment is more reliable.

As a consequence of all these uncertainties, we developed an open-sourced end-to-end evaluation of the RAG system. This evaluation considers different model settings, retrieval pipelines, knowledge base choices, and search algorithms.

We aim to provide valuable baselines for RAG system designs and hope that more developers and researchers join us in building a comprehensive and systematic benchmark. More results will help us disentangle these two factors and create a dataset closer to real-world RAG systems.

Note: Share your evaluation results at GitHub. PRs are very welcome!

In this article, we focus on a simple baseline evaluated on an MMLU (Massive Multitask Language Understanding Dataset), a widely used benchmark for LLMs, containing multiple-choice single-answer questions on many subjects like history, astronomy and economy.

We set out to find out if an LLM can learn from extra contexts by letting it answer multiple-choice questions.

To achieve our aim, we chose Wikipedia as our source of truth because it covers many subjects and knowledge domains. And we used the version cleaned by Cohere.aion Hugging Face, which includes 34,879,571 paragraphs belonging to 5,745,033 titles. An exhaustive search of these paragraphs will take quite a long time, so we need to use the appropriate ANNS (Approximate Nearest Neighbor Search) algorithms to retrieve relevant documents. Additionally, we use the MyScale database with the MSTG vector index to retrieve the relevant documents.

Semantic search is a well-researched topic with many models with detailed benchmarks available. When incorporated with vector embeddings, semantic search gains the ability to recognize paraphrased expressions, synonyms, and contextual understanding.

Moreover, embeddings provide dense and continuous vector representations that enable the calculation of meaningful metrics of relevance. These dense metrics capture semantic relationships and context, making them valuable for assessing relevance in LLM information retrieval tasks.

Taking into account the factors mentioned above, we have decided to use the paraphrase-multilingual-mpnet-base-v2 model from Hugging Face to extract features for retrieval tasks. This model is part of the MPNet family, designed to generate high-quality embeddings suitable for various NLP tasks, including semantic similarity and retrieval.

For our LLMs, we chose OpenAIs gpt-3.5-turbo and llama2-13b-chat with quantization in six bits. These models are the most popular in commercial and open-source trends. The LLaMA2 model is quantized by llama.cpp. We chose this 6-bit quantization setup because it is affordable without sacrificing performance.

Note: You can also try other models to test their RAG performance.

The following image describes how to formulate a simple RAG system:

Figure 1: Simple Benchmarking RAG

Note: Transform can be anything as long as it can be fed into the LLM, returning the correct answer. In our use case, Transform injects context into the question.

Our final LLM prompt is as follows:

```pythontemplate = ("The following are multiple choice questions (with answers) with context:""nn{context}Question: {question}n{choices}Answer: ")```

```python

template =

("The following are multiple choice questions (with answers) with context:"

"nn{context}Question: {question}n{choices}Answer: ")

```

Now lets move on to the result.

Our benchmark test results are collated in Table 1 below.

But first, our summarized findings are:

In these benchmarking tests, we compared performance with and without context. The test without context represents how internal knowledge can solve questions. Secondly, the test with context shows how an LLM can learn from context.

Note: Both llama2-13b-chat and gpt-3.5-turbo are enhanced by around 3-5% overall, even with only one extra context.

The table reports that some numbers are negative, for example, when we insert context into clinical-knowledge to gpt-3.5-turbo.

This might be related to the knowledge base, saying that Wikipedia does not have much information on clinical knowledge or because OpenAIs terms of use and guidelines are clear that using their AI models for medical advice is strongly discouraged and may even be prohibited. Despite this, the increase is quite evident for both models.

Notably, the gpt-3.5-turbo results claim that the RAG system might be powerful enough to compete with other language models. Some of the reported numbers, such as those on prehistory and astronomy are pushing towards the performance of gpt4 with extra tokens, suggesting RAG could be another solution to specialized Artificial General Intelligence (AGI) when compared to fine-tuning.

Note: RAG is more practical than fine-tuning models as it is a plug-in solution and works with both self-hosted and remote models.

Figure 2: Performance Gain vs. the Number of Contexts

The benchmark above suggests that you need as much context as possible. In most cases, LLMs will learn from all the supplied contexts. Theoretically, the model provides better answers as the number of retrieved documents is increased. However, our benchmarking shows that some numbers dropped the greater the contexts retrieved.

By way of validating our benchmarking results, a paper by Stanford University titled: Lost in the Middle: How Language Models Use Long Contexts suggests the LLM only looks at the contexts head and tail. Therefore, choose fewer but more accurate contexts from the retrieval system to augment your LLM.

The larger the LLM, the more knowledge it stores. Larger LLMs tend to have a greater capacity to store and understand information, which often translates to a broader knowledge base of generally understood facts. Our benchmarking tests tell the same story: the smaller LLMs lack knowledge and are hungrier for more knowledge.

Our results report that llama2-13b-chat shows a more significant increase in knowledge than gpt-3.5-turbo, suggesting context injects more knowledge into an LLM for information retrieval. Additionally, these results imply gpt-3.5-turbo was given information it already knows while llama2-13b-chat is still learning from the context.

Almost every LLM uses the Wikipedia corpus as a training dataset, meaning both gpt-3.5-turbo and llama2-13b-chat should be familiar with the contexts added to the prompt. Therefore, the questions that beg are:

We currently dont have any answers to these questions. As a result, research is still needed.

Contribute to research to help others.

We can only cover a limited set of evaluations in this blog. But we know more is needed. The results of every benchmark test matter, regardless of whether they are replications of existing tests or some new findings based on novel RAGs.

With the aim of helping everyone create benchmark tests to test their RAG systems, we have open sourced our end-to-end benchmark framework. To fork our repository, check out our GitHub page.

This framework includes the following tools:

Its up to you to create your own benchmark. We believe RAG can be a possible solution to AGI. Therefore, we built this framework for the community to make everything trackable and reproducible.

PRs are welcome.

We have evaluated a small subset of MMLU with a simple RAG system built with different LLMs and vector search algorithms and described our process and results in this article. We also donated the evaluation framework to the community and called for more RAG benchmarks. We will continue to run benchmarking tests and update the latest results to GitHub and the MyScale blog, so follow us on Twitter or join us on Discord to stay updated.

Here is the original post:

Discover the Performance Gain with Retrieval Augmented Generation - The New Stack

Related Posts

Comments are closed.