Large Language Models (LLMs) are smart enough to understand context. They can answer questions, leveraging their vast training data to provide coherent and contextually relevant responses, no matter whether the topic is astronomy, history or even physics. However, LLMs tend to hallucinate (deliver compelling yet false facts) when asked to answer questions outside the scope of their training data, or when they cant remember the details in the training data.
A new technique, Retrieval Augmented Generation (RAG), fills the knowledge gaps, reducing hallucinations by augmenting prompts with external data. Combined with a vector database (like MyScale), it substantially increases the performance gain in extractive question answering.
To this end, this article focuses on determining the performance gain with RAG on the widely-used MMLU dataset. We find that both the performance of commercial and open source LLMs can be significantly improved when knowledge can be retrieved from Wikipedia using a vector database. More interestingly, this result is achieved even when Wikipedia is already in the training set of these models.
You can find the code for the benchmark framework and this example here.
But first, lets describe Retrieval Augmented Generation (RAG).
Research projects aim to enhance LLMs like gpt-3.5 by coupling them with external knowledge bases (like Wikipedia), databases, or the internet to create more knowledgeable and contextually aware systems. For example, lets assume a user asks an LLM what Newtons most important result is. To help the LLM retrieve the correct information, we can search for Newtons wiki and provide the wiki page as context to the LLM.
This method is called Retrieval Augmented Generation (RAG). Lewis et al. in Retrieval Augmented Generation for Knowledge-Intensive NLP Tasks define Retrieval Augmented Generation as:
A type of language generation model that combines pre-trained parametric and non-parametric memory for language generation.
Moreover, the authors of this academic paper go on to state that they:
Endow pre-trained, parametric-memory generation models with a non-parametric memory through a general-purpose fine-tuning approach.
Note: Parametric-memory LLMs are massive self-reliant knowledge repositories like ChatGPT and Googles PaLM. Non-parametric memory LLMs leverage external resources that add additional context to parametric-memory LLMs.
Combining external resources with LLMs seems feasible as LLMs are good learners, and referring to specific external knowledge domains can improve truthfulness. But how much of an improvement will this combination be?
Two major factors affect a RAG system:
Both of these factors are hard to evaluate. The knowledge gained by the LLM from the context is implicit, so the most practical way to assess these factors is to examine the LLMs answer. However, the accuracy of the retrieved context is also tricky to evaluate.
Measuring the relevance between paragraphs, especially in question answering or information retrieval, can be a complex task. The relevance assessment is crucial to determine whether a given section contains information directly related to a specific question. This is especially important in tasks that involve extracting information from large datasets or documents, like the WikiHop dataset.
Sometimes, datasets employ multiple annotators to assess the relevance between paragraphs and questions. Using multiple annotators to vote on relevance helps mitigate subjectivity and potential biases that can arise from individual annotators. This method also adds a layer of consistency and ensures that the relevance judgment is more reliable.
As a consequence of all these uncertainties, we developed an open-sourced end-to-end evaluation of the RAG system. This evaluation considers different model settings, retrieval pipelines, knowledge base choices, and search algorithms.
We aim to provide valuable baselines for RAG system designs and hope that more developers and researchers join us in building a comprehensive and systematic benchmark. More results will help us disentangle these two factors and create a dataset closer to real-world RAG systems.
Note: Share your evaluation results at GitHub. PRs are very welcome!
In this article, we focus on a simple baseline evaluated on an MMLU (Massive Multitask Language Understanding Dataset), a widely used benchmark for LLMs, containing multiple-choice single-answer questions on many subjects like history, astronomy and economy.
We set out to find out if an LLM can learn from extra contexts by letting it answer multiple-choice questions.
To achieve our aim, we chose Wikipedia as our source of truth because it covers many subjects and knowledge domains. And we used the version cleaned by Cohere.aion Hugging Face, which includes 34,879,571 paragraphs belonging to 5,745,033 titles. An exhaustive search of these paragraphs will take quite a long time, so we need to use the appropriate ANNS (Approximate Nearest Neighbor Search) algorithms to retrieve relevant documents. Additionally, we use the MyScale database with the MSTG vector index to retrieve the relevant documents.
Semantic search is a well-researched topic with many models with detailed benchmarks available. When incorporated with vector embeddings, semantic search gains the ability to recognize paraphrased expressions, synonyms, and contextual understanding.
Moreover, embeddings provide dense and continuous vector representations that enable the calculation of meaningful metrics of relevance. These dense metrics capture semantic relationships and context, making them valuable for assessing relevance in LLM information retrieval tasks.
Taking into account the factors mentioned above, we have decided to use the paraphrase-multilingual-mpnet-base-v2 model from Hugging Face to extract features for retrieval tasks. This model is part of the MPNet family, designed to generate high-quality embeddings suitable for various NLP tasks, including semantic similarity and retrieval.
For our LLMs, we chose OpenAIs gpt-3.5-turbo and llama2-13b-chat with quantization in six bits. These models are the most popular in commercial and open-source trends. The LLaMA2 model is quantized by llama.cpp. We chose this 6-bit quantization setup because it is affordable without sacrificing performance.
Note: You can also try other models to test their RAG performance.
The following image describes how to formulate a simple RAG system:
Figure 1: Simple Benchmarking RAG
Note: Transform can be anything as long as it can be fed into the LLM, returning the correct answer. In our use case, Transform injects context into the question.
Our final LLM prompt is as follows:
```pythontemplate = ("The following are multiple choice questions (with answers) with context:""nn{context}Question: {question}n{choices}Answer: ")```
```python
template =
("The following are multiple choice questions (with answers) with context:"
"nn{context}Question: {question}n{choices}Answer: ")
```
Now lets move on to the result.
Our benchmark test results are collated in Table 1 below.
But first, our summarized findings are:
In these benchmarking tests, we compared performance with and without context. The test without context represents how internal knowledge can solve questions. Secondly, the test with context shows how an LLM can learn from context.
Note: Both llama2-13b-chat and gpt-3.5-turbo are enhanced by around 3-5% overall, even with only one extra context.
The table reports that some numbers are negative, for example, when we insert context into clinical-knowledge to gpt-3.5-turbo.
This might be related to the knowledge base, saying that Wikipedia does not have much information on clinical knowledge or because OpenAIs terms of use and guidelines are clear that using their AI models for medical advice is strongly discouraged and may even be prohibited. Despite this, the increase is quite evident for both models.
Notably, the gpt-3.5-turbo results claim that the RAG system might be powerful enough to compete with other language models. Some of the reported numbers, such as those on prehistory and astronomy are pushing towards the performance of gpt4 with extra tokens, suggesting RAG could be another solution to specialized Artificial General Intelligence (AGI) when compared to fine-tuning.
Note: RAG is more practical than fine-tuning models as it is a plug-in solution and works with both self-hosted and remote models.
Figure 2: Performance Gain vs. the Number of Contexts
The benchmark above suggests that you need as much context as possible. In most cases, LLMs will learn from all the supplied contexts. Theoretically, the model provides better answers as the number of retrieved documents is increased. However, our benchmarking shows that some numbers dropped the greater the contexts retrieved.
By way of validating our benchmarking results, a paper by Stanford University titled: Lost in the Middle: How Language Models Use Long Contexts suggests the LLM only looks at the contexts head and tail. Therefore, choose fewer but more accurate contexts from the retrieval system to augment your LLM.
The larger the LLM, the more knowledge it stores. Larger LLMs tend to have a greater capacity to store and understand information, which often translates to a broader knowledge base of generally understood facts. Our benchmarking tests tell the same story: the smaller LLMs lack knowledge and are hungrier for more knowledge.
Our results report that llama2-13b-chat shows a more significant increase in knowledge than gpt-3.5-turbo, suggesting context injects more knowledge into an LLM for information retrieval. Additionally, these results imply gpt-3.5-turbo was given information it already knows while llama2-13b-chat is still learning from the context.
Almost every LLM uses the Wikipedia corpus as a training dataset, meaning both gpt-3.5-turbo and llama2-13b-chat should be familiar with the contexts added to the prompt. Therefore, the questions that beg are:
We currently dont have any answers to these questions. As a result, research is still needed.
Contribute to research to help others.
We can only cover a limited set of evaluations in this blog. But we know more is needed. The results of every benchmark test matter, regardless of whether they are replications of existing tests or some new findings based on novel RAGs.
With the aim of helping everyone create benchmark tests to test their RAG systems, we have open sourced our end-to-end benchmark framework. To fork our repository, check out our GitHub page.
This framework includes the following tools:
Its up to you to create your own benchmark. We believe RAG can be a possible solution to AGI. Therefore, we built this framework for the community to make everything trackable and reproducible.
PRs are welcome.
We have evaluated a small subset of MMLU with a simple RAG system built with different LLMs and vector search algorithms and described our process and results in this article. We also donated the evaluation framework to the community and called for more RAG benchmarks. We will continue to run benchmarking tests and update the latest results to GitHub and the MyScale blog, so follow us on Twitter or join us on Discord to stay updated.
Here is the original post:
Discover the Performance Gain with Retrieval Augmented Generation - The New Stack
- 'Godfather' of AI is now having second thoughts - The B.C. Catholic [Last Updated On: May 21st, 2023] [Originally Added On: May 21st, 2023]
- People warned AI is becoming like a God and a 'catastrophe' is ... - UNILAD [Last Updated On: May 21st, 2023] [Originally Added On: May 21st, 2023]
- Navigating artificial intelligence: Red flags to watch out for - ComputerWeekly.com [Last Updated On: May 21st, 2023] [Originally Added On: May 21st, 2023]
- Zoom Invests in and Partners With Anthropic to Improve Its AI ... - PYMNTS.com [Last Updated On: May 21st, 2023] [Originally Added On: May 21st, 2023]
- The Potential of AI in Tax Practice Relies on Understanding its ... - Thomson Reuters Tax & Accounting [Last Updated On: May 21st, 2023] [Originally Added On: May 21st, 2023]
- UK schools bewildered by AI and do not trust tech firms, headteachers say - The Guardian [Last Updated On: May 21st, 2023] [Originally Added On: May 21st, 2023]
- A glimpse of AI technologies at the WIC in N China's Tianjin - CGTN [Last Updated On: May 21st, 2023] [Originally Added On: May 21st, 2023]
- AI glossary: words and terms to know about the booming industry - NBC News [Last Updated On: May 21st, 2023] [Originally Added On: May 21st, 2023]
- Henry Kissinger says the U.S. and China are in a classic pre-World War I situation that could lead to conflict, but A.I. makes this not a normal... [Last Updated On: May 21st, 2023] [Originally Added On: May 21st, 2023]
- Programmed Values: The Role of Intention in Developing AI - Psychology Today [Last Updated On: May 21st, 2023] [Originally Added On: May 21st, 2023]
- Fear the fire or harness the flame: The future of generative AI - VentureBeat [Last Updated On: May 21st, 2023] [Originally Added On: May 21st, 2023]
- The Senate's hearing on AI regulation was dangerously friendly - The Verge [Last Updated On: May 21st, 2023] [Originally Added On: May 21st, 2023]
- Artificial intelligence GPT-4 shows 'sparks' of common sense, human-like reasoning, finds Microsoft - Down To Earth Magazine [Last Updated On: May 21st, 2023] [Originally Added On: May 21st, 2023]
- Why we need a "Manhattan Project" for A.I. safety - Salon [Last Updated On: May 21st, 2023] [Originally Added On: May 21st, 2023]
- What is AGI? The Artificial Intelligence that can do it all - Fox News [Last Updated On: May 21st, 2023] [Originally Added On: May 21st, 2023]
- Generative AI Thats Based On The Murky Devious Dark Web Might Ironically Be The Best Thing Ever, Says AI Ethics And AI Law - Forbes [Last Updated On: May 23rd, 2023] [Originally Added On: May 23rd, 2023]
- Artificial intelligence: World first rules are coming soon are you ... - JD Supra [Last Updated On: May 23rd, 2023] [Originally Added On: May 23rd, 2023]
- Today's AI boom will amplify social problems if we don't act now, says AI ethicist - ZDNet [Last Updated On: May 23rd, 2023] [Originally Added On: May 23rd, 2023]
- Artificial Intelligence May Be 'Threat' to Human Health, Experts Warn - HealthITAnalytics.com [Last Updated On: May 23rd, 2023] [Originally Added On: May 23rd, 2023]
- Amid job losses and fears of AI take-over, more tech majors are joining Artificial Intelligence race - The Tribune India [Last Updated On: May 23rd, 2023] [Originally Added On: May 23rd, 2023]
- Where AI evolves from here - Axios [Last Updated On: May 23rd, 2023] [Originally Added On: May 23rd, 2023]
- Parrots, paper clips and safety vs. ethics: Why the artificial intelligence debate sounds like a foreign language - CNBC [Last Updated On: May 23rd, 2023] [Originally Added On: May 23rd, 2023]
- How Microsoft Swallowed Its Pride to Make a Massive Bet on OpenAI - The Information [Last Updated On: May 28th, 2023] [Originally Added On: May 28th, 2023]
- Elon Musk on 2024 Politics, Succession Plans and Whether AI Will ... - The Wall Street Journal [Last Updated On: May 28th, 2023] [Originally Added On: May 28th, 2023]
- The AI Moment of Truth for Chinese Censorship by Stephen S. Roach - Project Syndicate [Last Updated On: May 28th, 2023] [Originally Added On: May 28th, 2023]
- Bard vs. ChatGPT vs. Offline Alpaca: Which Is the Best LLM? - MUO - MakeUseOf [Last Updated On: May 28th, 2023] [Originally Added On: May 28th, 2023]
- How AI and other technologies are already disrupting the workplace - The Conversation [Last Updated On: May 28th, 2023] [Originally Added On: May 28th, 2023]
- Meet PandaGPT: An AI Foundation Model Capable of Instruction-Following Data Across Six Modalities, Without The Need For Explicit Supervision -... [Last Updated On: May 28th, 2023] [Originally Added On: May 28th, 2023]
- AI education: Gather a better understanding of artificial intelligence with books, blogs, courses and more - Fox News [Last Updated On: May 28th, 2023] [Originally Added On: May 28th, 2023]
- 'Godfather of AI' says there's a 'serious danger' tech will get smarter than humans fairly soon - Fox News [Last Updated On: May 28th, 2023] [Originally Added On: May 28th, 2023]
- Israel aims to be 'AI superpower', advance autonomous warfare - Reuters.com [Last Updated On: May 28th, 2023] [Originally Added On: May 28th, 2023]
- Retail and Hospitality AI Revolution Forecast Model Report 2023 ... - GlobeNewswire [Last Updated On: May 28th, 2023] [Originally Added On: May 28th, 2023]
- 16 Jobs That Will Disappear in the Future Due to AI - Yahoo Finance [Last Updated On: May 28th, 2023] [Originally Added On: May 28th, 2023]
- What we lose when we work with a giant AI like ChatGPT - The Hindu [Last Updated On: May 28th, 2023] [Originally Added On: May 28th, 2023]
- Artificial general intelligence in the wrong hands could do 'really dangerous stuff,' experts warn - Fox News [Last Updated On: May 28th, 2023] [Originally Added On: May 28th, 2023]
- 5 things you should know about investing in artificial intelligence ... - The Motley Fool Australia [Last Updated On: June 17th, 2023] [Originally Added On: June 17th, 2023]
- Mint DIS 2023 | AI won't replace you, someone using AI will ... - TechCircle [Last Updated On: June 17th, 2023] [Originally Added On: June 17th, 2023]
- Satya Nadellas Oprah Moment: Microsoft CEO says he wants everyone to have an AI assistant - Firstpost [Last Updated On: June 17th, 2023] [Originally Added On: June 17th, 2023]
- Generative AI Will Have Profound Impact Across Sectors - Rigzone News [Last Updated On: June 17th, 2023] [Originally Added On: June 17th, 2023]
- Beware the EU's AI Regulations - theTrumpet.com [Last Updated On: June 17th, 2023] [Originally Added On: June 17th, 2023]
- Olbrain Founders launch blunder.one: Redefining Human Connections in the Post-AGI World - Devdiscourse [Last Updated On: June 17th, 2023] [Originally Added On: June 17th, 2023]
- Meet Sati-AI, a Non-Human Mindfulness Meditation Teacher Lions Roar - Lion's Roar [Last Updated On: June 17th, 2023] [Originally Added On: June 17th, 2023]
- How to Win the AI War - Tablet Magazine [Last Updated On: June 17th, 2023] [Originally Added On: June 17th, 2023]
- The Synergistic Potential of Blockchain and Artificial Intelligence - The Daily Hodl [Last Updated On: June 17th, 2023] [Originally Added On: June 17th, 2023]
- Dr. ChatGPT Will Interface With You Now - IEEE Spectrum [Last Updated On: July 11th, 2023] [Originally Added On: July 11th, 2023]
- Amazon tech guru: Eating less beef, more fish good for the planet, and AI helps us get there - Fox News [Last Updated On: July 11th, 2023] [Originally Added On: July 11th, 2023]
- Students who use AI to cheat warned they will be exposed as detection services grow in use - Fox News [Last Updated On: July 11th, 2023] [Originally Added On: July 11th, 2023]
- Crypto And AI Innovation: The London Attraction - Forbes [Last Updated On: July 11th, 2023] [Originally Added On: July 11th, 2023]
- AI would pick Bitcoin over centralized crypto Tether CTO - Cointelegraph [Last Updated On: July 11th, 2023] [Originally Added On: July 11th, 2023]
- What's missing from ChatGPT and other LLMs ... - Data Science Central [Last Updated On: July 11th, 2023] [Originally Added On: July 11th, 2023]
- 'Alarming' misuse of AI to spy on activists, journalists 'under guise of preventing terrorism': UN expert - Fox News [Last Updated On: July 11th, 2023] [Originally Added On: July 11th, 2023]
- Mastering ChatGPT: Introduction to ChatGPT | Thomas Fox ... - JD Supra [Last Updated On: July 11th, 2023] [Originally Added On: July 11th, 2023]
- Transparency is crucial over how AI is trained - and regulators must take the lead - Sky News [Last Updated On: July 11th, 2023] [Originally Added On: July 11th, 2023]
- Top 10 AI And Blockchain Projects Revolutionizing The World - Blockchain Magazine [Last Updated On: July 11th, 2023] [Originally Added On: July 11th, 2023]
- An Orb: the new crypto project by the creator of ChatGPT - The Cryptonomist [Last Updated On: July 11th, 2023] [Originally Added On: July 11th, 2023]
- AI must be emotionally intelligent before it is super-intelligent - Big Think [Last Updated On: July 11th, 2023] [Originally Added On: July 11th, 2023]
- NVIDIA CEO, European Generative AI Execs Discuss Keys to Success - Nvidia [Last Updated On: July 11th, 2023] [Originally Added On: July 11th, 2023]
- Tech Investors Bet on AI, Leave Crypto Behind - Yahoo Finance [Last Updated On: July 11th, 2023] [Originally Added On: July 11th, 2023]
- Its Going To Hit Like A Bomb: AI Experts Discuss The Technology And Its Future Impact On Storytelling KVIFF Industry Panel - Deadline [Last Updated On: July 11th, 2023] [Originally Added On: July 11th, 2023]
- AI tools trace the body's link between the brain and behavior - Axios [Last Updated On: July 11th, 2023] [Originally Added On: July 11th, 2023]
- Mission: Impossibles technology unpacked From AI to facial recognition - Yahoo Eurosport UK [Last Updated On: July 11th, 2023] [Originally Added On: July 11th, 2023]
- 27% of jobs at high risk from AI revolution, says OECD - Reuters [Last Updated On: July 11th, 2023] [Originally Added On: July 11th, 2023]
- AI likely to spell end of traditional school classroom, leading expert says - The Guardian [Last Updated On: July 11th, 2023] [Originally Added On: July 11th, 2023]
- AI humanoid robots hold UN press conference, say they could be more efficient and effective world leaders - Fox News [Last Updated On: July 11th, 2023] [Originally Added On: July 11th, 2023]
- China striving to be first source of artificial general intelligence, says think tank - The Register [Last Updated On: July 11th, 2023] [Originally Added On: July 11th, 2023]
- The Government's Role In Progressing AI In The UK - New ... - Mondaq News Alerts [Last Updated On: July 27th, 2023] [Originally Added On: July 27th, 2023]
- The AI Canon: A Curated List of Resources to Get Smarter About ... - Fagen wasanni [Last Updated On: July 27th, 2023] [Originally Added On: July 27th, 2023]
- Future of automotive journalism in India: Would AI take charge - Team-BHP [Last Updated On: July 27th, 2023] [Originally Added On: July 27th, 2023]
- OpenAI's Head of Trust and Safety Quits: What Does This Mean for ... - ReadWrite [Last Updated On: July 27th, 2023] [Originally Added On: July 27th, 2023]
- From vision to victory: How CIOs embrace the AI revolution - ETCIO [Last Updated On: July 27th, 2023] [Originally Added On: July 27th, 2023]
- Demis Hassabis - Information Age [Last Updated On: July 27th, 2023] [Originally Added On: July 27th, 2023]
- Why AI cant answer the fundamental questions of life | Mint - Mint [Last Updated On: July 27th, 2023] [Originally Added On: July 27th, 2023]
- This Health AI Startup Aims To Keep Doctors Up To Date On The ... - Forbes [Last Updated On: July 27th, 2023] [Originally Added On: July 27th, 2023]
- OpenAI Requires Millions of GPUs for Advanced AI Model - Fagen wasanni [Last Updated On: July 27th, 2023] [Originally Added On: July 27th, 2023]
- AI ethics experts warn safety principals could lead to 'ethicswashing' - Citywire [Last Updated On: July 27th, 2023] [Originally Added On: July 27th, 2023]
- AI bots could replace us, peer warns House of Lords during debate - The Guardian [Last Updated On: July 27th, 2023] [Originally Added On: July 27th, 2023]
- AI, Augmented Reality, The Metaverse | Media@LSE - London School of Economics and Political Science [Last Updated On: July 27th, 2023] [Originally Added On: July 27th, 2023]
- Will architects really lose their jobs to AI? - Dezeen [Last Updated On: July 27th, 2023] [Originally Added On: July 27th, 2023]
- Which US workers are exposed to AI in their jobs? - Pew Research Center [Last Updated On: July 27th, 2023] [Originally Added On: July 27th, 2023]
- AWS announces generative A.I. tool to save doctors time on paperwork - CNBC [Last Updated On: July 27th, 2023] [Originally Added On: July 27th, 2023]