Exploring RAG Applications Across Languages: Conversing with the Mishnah – Towards Data Science

15 min read

Im excited to share my journey of building a unique Retrieval-Augmented Generation (RAG) application for interacting with rabbinic texts in this post. MishnahBot aims to provide scholars and everyday users with an intuitive way to query and explore the Mishnah interactively. It can help solve problems such as quickly locating relevant source texts or summarizing a complex debate about religious law, extracting the bottom line.

I had the idea for such a project a few years back, but I felt like the technology wasnt ripe yet. Now, with advancements of large language models, and RAG capabilities, it is pretty straightforward.

This is what our final product will look like, which you could try out here:

RAG applications are gaining significant attention, for improving accuracy and harnessing the reasoning power available in large language models (LLMs). Imagine being able to chat with your library, a collection of car manuals from the same manufacturer, or your tax documents. You can ask questions, and receive answers informed by the wealth of specialized knowledge.

There are two emerging trends in improving language model interactions: Retrieval-Augmented Generation (RAG) and increasing context length, potentially by allowing very long documents as attachments.

One key advantage of RAG systems is cost-efficiency. With RAG, you can handle large contexts without drastically increasing the query cost, which can become expensive. Additionally, RAG is more modular, allowing you to plug and play with different knowledge bases and LLM providers. On the other hand, increasing the context length directly in language models is an exciting development that can enable handling much longer texts in a single interaction.

For this project, I used AWS SageMaker for my development environment, AWS Bedrock to access various LLMs, and the LangChain framework to manage the pipeline. Both AWS services are user-friendly and charge only for the resources used, soIreallyencourageyoutotryitoutyourselves. For Bedrock, youll need to request access to Llama 3 70b Instruct and Claude Sonnet.

Lets open a new Jupyter notebook, and install the packages we will be using:

The dataset for this project is the Mishnah, an ancient Rabbinic text central to Jewish tradition. I chose this text because it is close to my heart and also presents a challenge for language models since it is a niche topic. The dataset was obtained from the Sefaria-Export repository, a treasure trove of rabbinic texts with English translations aligned with the original Hebrew. This alignment facilitates switching between languages in different steps of our RAG application.

Note: The same process applied here can be applied to any other collection of texts of your choosing. This example also demonstrates how RAG technology can be utilized across different languages, as shown with Hebrew in this case.

First we will need to download the relevant data. We will use git sparse-checkout since the full repository is quite large. Open the terminal window and run the following.

And voila! we now have the data files that we need:

Now lets load the documents in our Jupyter notebook environment:

And take a look at the Data:

Looks good, we can move on to the vector database stage.

Next, we vectorize the text and store it in a local ChromaDB. In one sentence, the idea is to represent text as dense vectors arrays of numbers such that texts that are similar semantically will be close to each other in vector space. This is the technology that will enable us to retrieve the relevant passages given a query.

We opted for a lightweight vectorization model, the all-MiniLM-L6-v2, which can run efficiently on a CPU. This model provides a good balance between performance and resource efficiency, making it suitable for our application. While state-of-the-art models like OpenAIs text-embedding-3-large may offer superior performance, they require substantial computational resources, typically running on GPUs.

For more information about embedding models and their performance, you can refer to the MTEB leaderboard which compares various text embedding models on multiple tasks.

Heres the code we will use for vectorizing (should only take a few minutes to run on this dataset on a CPU machine):

With our dataset ready, we can now create our Retrieval-Augmented Generation (RAG) application in English. For this, well use LangChain, a powerful framework that provides a unified interface for various language model operations and integrations, making it easy to build sophisticated applications.

LangChain simplifies the process of integrating different components like language models (LLMs), retrievers, and vector stores. By using LangChain, we can focus on the high-level logic of our application without worrying about the underlying complexities of each component.

Heres the code to set up our RAG system:

Alright! Lets try it out! We will use a query related to the very first paragraphs in the Mishnah.

That seems pretty accurate.

Lets try a more sophisticated question:

Very nice.

I tried that out, heres what I got:

The response is long and not to the point, and the answer that is given is incorrect (reaping is the third type of work in the list, while selecting is the seventh). This is what we call a hallucination.

While Claude is a powerful language model, relying solely on an LLM for generating responses from memorized training data or even using internet searches lacks the precision and control offered by a custom database in a Retrieval-Augmented Generation (RAG) application. Heres why:

This structured retrieval process ensures users receive the most accurate and relevant answers, leveraging both the language generation capabilities of LLMs and the precision of custom data retrieval.

Finally, we will address the challenge of interacting in Hebrew with the original Hebrew text. The same approach can be applied to any other language, as long as you are able to translate the texts to Englishfortheretrievalstage.

Supporting Hebrew interactions adds an extra layer of complexity since embedding models and large language models (LLMs) tend to be stronger in English. While some embedding models and LLMs do support Hebrew, they are often less robust than their English counterparts, especially the smaller embedding models that likely focused more on English during training.

To tackle this, we could train our own Hebrew embedding model. However, another practical approach is to leverage a one-time translation of the text to English and use English embeddings for the retrieval process. This way, we benefit from the strong performance of English models while still supporting Hebrew interactions.

In our case, we already have professional human translations of the Mishnah text into English. We will use this to ensure accurate retrievals while maintaining the integrity of the Hebrew responses. Heres how we can set up this cross-lingual RAG system:

For generation, we use Claude Sonnet since it performs significantly better on Hebrew text compared to Llama 3.

Here is the code implementation:

Lets try it! We will use the same question as before, but in Hebrew this time:

We got an accurate, one word answer to our question. Pretty neat, right?

The translation with Llama 3 Instruct posed several challenges. Initially, the model produced nonsensical results no matter what I tried. (Apparently, Llama 3 instruct is very sensitive to prompts starting with a new line character!)

After resolving that issue, the model tended to output the correct response, but then continue with additional irrelevant text, so stopping the output at a newline character proved effective.

Controlling the output format can be tricky. Some strategies include requesting a JSON format or providing examples with few-shot prompts.

In this project, we also remove vowels from the Hebrew texts since most Hebrew text online does not include vowels, and we want the context for our LLM to be similar to text seen during pretraining.

Building this RAG application has been a fascinating journey, blending the nuances of ancient texts with modern AI technologies. My passion for making the library of ancient rabbinic texts more accessible to everyone (myself included) has driven this project. This technology enables chatting with your library, searching for sources based on ideas, and much more. The approach used here can be applied to other treasured collections of texts, opening up new possibilities for accessing and exploring historical and cultural knowledge.

Its amazing to see how all this can be accomplished in just a few hours, thanks to the powerful tools and frameworks available today. Feel free to check out the full code on GitHub, and play with the MishnahBot website.

Please share your comments and questions, especially if youre trying out something similar. If you want to see more content like this in the future, do let me know!

Follow this link:

Exploring RAG Applications Across Languages: Conversing with the Mishnah - Towards Data Science

Related Posts

Comments are closed.