Automating Prompt Engineering with DSPy and Haystack | by Maria Mestre | Jun, 2024 – Towards Data Science

Teach your LLM how to talk through examples Photo by Markus Winkler on Unsplash

One of the most frustrating parts of building gen-AI applications is the manual process of optimising prompts. In a publication made by LinkedIn earlier this year, they described what they learned after deploying an agentic RAG application. One of the main challenges was obtaining consistent quality. They spent 4 months tweaking various parts of the application, including prompts, to mitigate issues such as hallucination.

DSPy is an open-source library that tries to parameterise prompts so that it becomes an optimisation problem. The original paper calls prompt engineering brittle and unscalable and compares it to hand-tuning the weights for a classifier.

Haystack is an open-source library to build LLM applications, including RAG pipelines. It is platform-agnostic and offers a large number of integrations with different LLM providers, search databases and more. It also has its own evaluation metrics.

In this article, we will briefly go over the internals of DSPy, and show how it can be used to teach an LLM to prefer more concise answers when answering questions over an academic medical dataset.

This article from TDS provides a great in-depth exploration of DSPy. We will be summarising and using some of their examples.

In order to build a LLM application that can be optimised, DSPy offers two main abstractions: signatures and modules. A signature is a way to define the input and output of a system that interacts with LLMs. The signature is translated internally into a prompt by DSPy.

When using the DSPy Predict module (more on this later), this signature is turned into the following prompt:

Then, DSPy also has modules which define the predictors that have parameters that can be optimised, such as the selection of few-shot examples. The simplest module is dspy.Predict which does not modify the signature. Later in this article we will use the module dspy.ChainOfThought which asks the LLM to provide reasoning.

Things start to get interesting once we try to optimise a module (or as DSPy calls it compiling a module). When optimising a module, you typically need to specify 3 things:

When using the dspy.Predict or the dspy.ChainOfThought modules, DSPy searches through the training set and selects the best examples to add to the prompt as few-shot examples. In the case of RAG, it can also include the context that was used to get the final response. It calls these examples demonstrations.

You also need to specify the type of optimiser you want to use to search through the parameter space. In this article, we use the BootstrapFewShot optimiser. How does this algorithm work internally? It is actually very simple and the paper provides some simplified pseudo-code:

The search algorithm goes through every training input in the trainset , gets a prediction and then checks whether it passes the metric by looking at self.metric(example, prediction, predicted_traces). If the metric passes, then the examples are added to the demonstrations of the compiled program.

The entire code can be found in this cookbook with associated colab, so we will only go through some of the most important steps here. For the example, we use a dataset derived from the PubMedQA dataset (both under the MIT license). It has questions based on abstracts of medical research papers and their associated answers. Some of the answers provided can be quite long, so we will be using DSPy to teach the LLM to prefer more concise answers, while keeping the accuracy of the final answer high.

After adding the first 1000 examples to an in-memory document store (which can be replaced by any number of retrievers), we can now build our RAG pipeline:

Lets try it out!

The answer to the above question:

Ketamine inhibits the proliferation of rat neural stem cells in a dose-dependent manner at concentrations of 200, 500, 800, and 1000M. Additionally, ketamine decreases intracellular Ca(2+) concentration, suppresses protein kinase C- (PKC) activation, and phosphorylation of extracellular signal-regulated kinases 1/2 (ERK1/2) in rat neural stem cells. These effects do not seem to be mediated through caspase-3-dependent apoptosis.

We can see how the answers tend to be very detailed and long.

We start by creating a DSPy signature of the input and output fields:

As we can see, we already specify in our description that we are expecting a short answer.

Then, we create a DSPy module that will be later compiled:

We are using the Haystack retriever previously defined to search the documents in the document store results = retriever.run(query=question). The prediction step is done with the DSPy module dspy.ChainOfThought which teaches the LM to think step-by-step before committing to the response.

During compilation, the prompt that will be optimised to look like this:

Finally, we have to define the metrics that we would like to optimise. The evaluator will have two parts:

Our evaluation dataset is composed of 20 training examples and 50 examples in the devset.

If we evaluate the current naive RAG pipeline with the code below, we get an average score of 0.49.

Looking at some examples can give us some intuition on what the score is doing:

Question: Is increased time from neoadjuvant chemoradiation to surgery associated with higher pathologic complete response rates in esophageal cancer?

Predicted answer: Yes, increased time from neoadjuvant chemoradiation to surgery is associated with higher pathologic complete response rates in esophageal cancer.

Score: 0.78

But

Question: Is epileptic focus localization based on resting state interictal MEG recordings feasible irrespective of the presence or absence of spikes?

Predicted answer: Yes.

Score: 0.089

As we can see from the examples, if the answer is too short, it gets a low score because its similarity with the ground truth answer drops.

We then compile the RAG pipeline with DSPy:

After we do this and re-evaluate the compiled pipeline, the score is now 0.69!

Now its time to get the final optimised prompt and add it into our Haystack pipeline.

We can see the few-shot examples selected by DSPy by looking at the demos field in the compiled_rag object:

There are 2 types of examples provided in the final prompt: few-shot examples and bootstrapped demos, like in the prompt shown above. The few-shot examples are question-answer pairs:

Whereas the bootstrapped demo has the full trace of the LLM, including the context and reasoning provided (in the rationale field below):

All we need to do now is extract these examples found by DSPy and insert them in our Haystack pipeline:

Our new pipeline becomes:

Lets check the same question we tried before.

Question:

What effects does ketamine have on rat neural stem cells?

Before:

Ketamine inhibits the proliferation of rat neural stem cells in a dose-dependent manner at concentrations of 200, 500, 800, and 1000M. Additionally, ketamine decreases intracellular Ca(2+) concentration, suppresses protein kinase C- (PKC) activation, and phosphorylation of extracellular signal-regulated kinases 1/2 (ERK1/2) in rat neural stem cells. These effects do not seem to be mediated through caspase-3-dependent apoptosis.

After:

Ketamine at higher concentrations inhibits the proliferation of rat neural stem cells, while not affecting apoptosis. Additionally, it decreases intracellular calcium concentration and suppresses PKC activation and ERK1/2 phosphorylation in these cells.

Job done!

In this post, we have used DSPy to optimise the prompt used in a Haystack RAG pipeline. We have done so by using a custom metric based on Haystacks evaluation framework that penalised the LLM for long answers while keeping the similarity with the correct answer high. With this approach, we have managed to improve our performance by almost 40% without having to do any manual prompt engineering.

More:

Automating Prompt Engineering with DSPy and Haystack | by Maria Mestre | Jun, 2024 - Towards Data Science

Related Posts

Comments are closed.