Linear Attention Is All You Need. Self-attention at a fraction of the | by Sam Maddrell-Mander | Jun, 2024 – Towards Data Science

This is the kind of thing anyone whos spent much time working with transformers and self-attention will have heard a hundred times. Its both absolutely true, weve all experienced this as you try to increase the context size of your model everything suddenly comes to a grinding halt. But then at the same time, virtually every week it seems, theres a new state of the art model with a new record breaking context length. (Gemini has context length of 2M tokens!)

There are lots of sophisticated methods like RingAttention that make training incredibly long context lengths in large distributed systems possible, but what Im interested in today is a simpler question.

How far can we get with linear attention alone?

This will be a bit of a whistle stop tour, but bear with me as we touch on a few key points before digging into the results.

We can basically summarise the traditional attention mechanism with two key points:

This is expressed in the traditional form as:

It turns out if we ask our mathematician friends we can think about this slightly differently. The softmax can be thought of as one of many ways of describing the probability distribution relating tokens with each other. We can use any similarity measure we like (the dot product being one of the simplest) and so long as we normalise it, were fine.

Its a little sloppy to say this is attention, as in fact its only the attention we know and love when the similarity function is the exponential of the dot product of queries and keys (given below) as we find in the softmax. But this is where it gets interesting, if instead of using this this expression what if we could approximate it?

We can assume there is some feature map phi which gives us a result nearly the same as taking the exponential of the dot product. And crucially, writing the expression like this allows us to play with the order of matrix multiplication operations.

In the paper they propose the Exponential Lineaer Unit (ELU) as the feature map due to a number of useful properties:

We wont spend too much more time on this here, but this is pretty well empirically verified as a fair approximation to the softmax function.

What this allows us to do is change the order of operations. We can take the product of our feature map of K with V first to make a KV block, then the product with Q. The square product becomes over the model dimension size rather than sequence length.

Putting this all together into the linear attention expression gives us:

Where we only need to compute the terms in the brackets once per query row.

(If you want to dig into how the casual masking fits into this and how the gradients are calculated, take a look at the paper. Or watch this space for a future blog.)

The mathematical case is strong, but personally until Ive seen some benchmarks Im always a bit suspicious.

Lets start by looking at the snippets of the code to describe each of these terms. The softmax attention will look very familiar, were not doing anything fancy here.

Then for the linear attention we start by getting the Query, Key and Value matrices, then apply the ELU(x) feature mapping to the Query and Keys. Then we use einsum notation to perform the multiplications.

Seeing this written in code is all well and good, but what does it actually mean experimentally? How much of a performance boost are we talking about here? It can be hard to appreciate the degree of speed up going from a quadratic to a linear bottleneck, so Ive run the following experiemnt.

Were going to to take a single attention layer, with a fixed d_k model dimension of 64, and benchmark the time taken for a forward pass of a 32 batch size set of sequences. The only variable to change will be the sequence length, spanning 128 up to 6000 (the GPT-3 context length for reference if 2048). Each run is done 100 times to get a mean and standard deviation, and experiments are run using an Nvidia T4 GPU.

For such a simple experiment the results are pretty striking.

The results show for even an incredibly small toy example that we get a speed up of up to 60x.

There are a few obvious take-aways here:

For completeness also do not mistake this as saying linear attention is 60x faster for small models. In reality the feed-forward layers are often a bigger chunk of the parameters in a Transformer and the encoding / decoding is often a limiting size component as well. But in this tightly defined problem, pretty impressive!

Continue reading here:

Linear Attention Is All You Need. Self-attention at a fraction of the | by Sam Maddrell-Mander | Jun, 2024 - Towards Data Science

Related Posts

Comments are closed.