Optimizing Pandas Code: The Impact of Operation Sequence – Towards Data Science

PYTHON PROGRAMMING Learn how to rearrange your code to achieve significant speed improvements. 9 min read

Pandas offer a fantastic framework to operate on dataframes. In data science, we work with small, big and sometimes very big dataframes. While analyzing small ones can be blazingly fast, even a single operation on a big dataframe can take noticeable time.

In this article I will show that often you can make this time shorter by something that costs practically nothing: the order of operations on a dataframe.

Imagine the following dataframe:

With a million rows and 25 columns, its big. Many operation on such a dataframe will be noticeable on current personal computers.

Imagine we want to filter the rows, in order to take those which follow the following condition: a < 50_000 and b > 3000 and select five columns: take_cols=['a', 'b', 'g', 'n', 'x']. We can do this in the following way:

In this code, we take the required columns first, and then we perform the filtering of rows. We can achieve the same in a different order of the operations, first performing the filtering and then selecting the columns:

We can achieve the very same result via chaining Pandas operations. The corresponding pipes of commands are as follows:

Since df is big, the four versions will probably differ in performance. Which will be the fastest and which will be the slowest?

Lets benchmark this operations. We will use the timeit module:

Visit link:

Optimizing Pandas Code: The Impact of Operation Sequence - Towards Data Science

Related Posts

Comments are closed.