Understand SQL Window Functions Once and For All – Towards Data Science

Photo by Yasmina H on Unsplash

Window functions are key to writing SQL code that is both efficient and easy to understand. Knowing how they work and when to use them will unlock new ways of solving your reporting problems.

The objective of this article is to explain window functions in SQL step by step in an understandable way so that you dont need to rely on only memorizing the syntax.

Here is what we will cover:

Our dataset is simple, six rows of revenue data for two regions in the year 2023.

If we took this dataset and ran a GROUP BY sum on the revenue of each region, it would be clear what happens, right? It would result in only two remaining rows, one for each region, and then the sum of the revenues:

The way I want you to view window functions is very similar to this but, instead of reducing the number of rows, the aggregation will run in the background and the values will be added to our existing rows.

First, an example:

Notice that we dont have any GROUP BY and our dataset is left intact. And yet we were able to get the sum of all revenues. Before we go more in depth in how this worked lets just quickly talk about the full syntax before we start building up our knowledge.

The syntax goes like this:

Picking apart each section, this is what we have:

Dont stress over what each of these means yet, as it will become clear when we go over the examples. For now just know that to define a window function we will use the OVER keyword. And as we saw in the first example, thats the only requirement.

Moving to something actually useful, we will now apply a group in our function. The initial calculation will be kept to show you that we can run more than one window function at once, which means we can do different aggregations at once in the same query, without requiring sub-queries.

As said, we use the PARTITION BY to define our groups (windows) that are used by our aggregation function! So, keeping our dataset intact weve got:

Were also not restrained to a single group. Similar to GROUP BY we can partition our data on Region and Quarter, for example:

In the image we see that the only two data points for the same region and quarter got grouped together!

At this point I hope its clear how we can view this as doing a GROUP BY but in-place, without reducing the number of rows in our dataset. Of course, we dont always want that, but its not that uncommon to see queries where someone groups data and then joins it back in the original dataset, complicating what could be a single window function.

Moving on to the ORDER BY keyword. This one defines a running window function. Youve probably heard of a Running Sum once in your life, but if not, we should start with an example to make everything clear.

What happens here is that weve went, row by row, summing the revenue with all previous values. This was done following the order of the id column, but it couldve been any other column.

This specific example is not particularly useful because were summing across random months and two regions, but using what weve learned we can now find the cumulative revenue per region. We do that by applying the running sum within each group.

Take the time to make sure you understand what happened here:

Its quite interesting to notice here that when were writing these running functions we have the context of other rows. What I mean is that to get the running sum at one point, we must know the previous values for the previous rows. This becomes more obvious when we learn that we can manually chose how many rows before/after we want to aggregate on.

For this query we specified that for each row we wanted to look at one row behind and two rows ahead, so that means we get the sum of that range! Depending on the problem youre solving this can be extremely powerful as it gives you complete control on how youre grouping your data.

Finally, one last function I want to mention before we move into a harder example is the RANK function. This gets asked a lot in interviews and the logic behind it is the same as everything weve learned so far.

Just as before, we used ORDER BY to specify the order which we will walk, row by row, and PARTITION BY to specify our sub-groups.

The first column ranks each row within each region, meaning that we will have multiple rank ones in the dataset. The second calculation is the rank across all rows in the dataset.

This is a problem that shows up every now and then and to solve it on SQL it takes heavy usage of window functions. To explain this concept we will use a different dataset containing timestamps and temperature measurements. Our goal is to fill in the rows missing temperature measurements with the last measured value.

Here is what we expect to have at the end:

Before we start I just want to mention that if youre using Pandas you can solve this problem simply by running df.ffill() but if youre on SQL the problem gets a bit more tricky.

The first step to solve this is to, somehow, group the NULLs with the previous non-null value. It might not be clear how we do this but I hope its clear that this will require a running function. Meaning that its a function that will walk row by row, knowing when we hit a null value and when we hit a non-null value.

The solution is to use COUNT and, more specifically, count the values of temperature measurements. In the following query I run both a normal running count and also a count over the temperature values.

The normal_count column is useless for us, I just wanted to show what a running COUNT looked like. Our second calculation though, the group_count moves us closer to solving our problem!

Notice that this way of counting makes sure that the first value, just before the NULLs start, is counted and then, every time the function sees a null, nothing happens. This makes sure that were tagging every subsequent null with the same count we had when we stopped having measurements.

Moving on, we now need to copy over the first value that got tagged into all the other rows within that same group. Meaning that for the group 2 needs to all be filled with the value 15.0.

Can you think of a function now that we can use here? There is more than one answer for this, but, again, I hope that at least its clear that now were looking at a simple window aggregation with PARTITION BY .

We can use both FIRST_VALUE or MAX to achieve what we want. The only goal is that we get the first non-null value. Since we know that each group contains one non-null value and a bunch of null values, both of these functions work!

This example is a great way to practice window functions. If you want a similar challenge try to add two sensors and then forward fill the values with the previous reading of that sensor. Something similar to this:

Could you do it? It doesnt use anything that we havent learned here so far.

By now we know everything that we need about how window functions work in SQL, so lets just do a quick recap!

This is what weve learned:

Originally posted here:

Understand SQL Window Functions Once and For All - Towards Data Science

Related Posts

Comments are closed.