Scripting around a pandas DataFrame can turn into an awkward pile of (not-so-)good old spaghetti code. Me and my colleagues use this package a lot and while we try to stick to good programming practices, like splitting code in modules and unit testing, sometimes we still get in the way of one another by producing confusing code.
I have gathered some tips and pitfalls to avoid in order to make pandas code clean and infallible. Hopefully youll find them useful too. We'll get some help from Robert C. Martin's classic Clean code specifically for the context of the pandas package. TL;DR at the end.
Lets begin by observing some faulty patterns inspired by real life. Later on, well try to rephrase that code in order to favor readability and control.
Pandas DataFrames are value-mutable [2, 3] objects. Whenever you alter a mutable object, it affects the exact same instance that you originally created and its physical location in memory remains unchanged. In contrast, when you modify an immutable object (eg. a string), Python goes to create a whole new object at a new memory location and swaps the reference for the new one.
This is the crucial point: in Python, objects get passed to the function by assignment [4, 5]. See the graph: the value of df has been assigned to variable in_df when it was passed to the function as an argument. Both the original df and the in_df inside the function point to the same memory location (numeric value in parentheses), even if they go by different variable names. During the modification of its attributes, the location of the mutable object remains unchanged. Now all other scopes can see the changes too they reach to the same memory location.
Actually, since we have modified the original instance, its redundant to return the DataFrame and assign it to the variable. This code has the exact same effect:
Heads-up: the function now returns None, so be careful not to overwrite the df with None if you do perform the assignment: df = modify_df(df).
In contrast, if the object is immutable, it will change the memory location throughout the modification just like in the example below. Since the red string cannot be modified (strings are immutable), the green string is created on top of the old one, but as a brand new object, claiming a new location in memory. The returned string is not the same string, whereas the returned DataFrame was the exact same DataFrame.
The point is, mutating DataFrames inside functions has a global effect. If you dont keep that in mind, you may:
Well fix that problem later, but here is another don't before we pass to do's
The design from the previous section is actually an anti-pattern called output argument [1 p.45]. Typically, inputs of a function will be used to create an output value. If the sole point of passing an argument to a function is to modify it, so that the input argument changes its state, then its challenging our intuitions. Such behavior is called side effect [1 p.44] of a function and those should be well documented and minimized because they force the programmer to remember the things that go in the background, therefore making the script error-prone.
When we read a function, we are used to the idea of information going in to the function through arguments and out through the return value. We dont usually expect information to be going out through the arguments. [1 p.41]
Things get even worse if the function has a double responsibility: to modify the input and to return an output. Consider this function:
It does return a value as you would expect, but it also permanently modifies the original DataFrame. The side effect takes you by surprise - nothing in the function signature indicated that our input data was going to be affected. In the next step, we'll see how to avoid this kind of design.
To eliminate the side effect, in the code below we have created a new temporary variable instead of modifying the original DataFrame. The notation lengths: pd.Series indicates the datatype of the variable.
This function design is better in that it encapsulates the intermediate state instead of producing a side effect.
Another heads-up: please be mindful of the differences between deep and shallow copy [6] of elements from the DataFrame. In the example above we have modified each element of the original df["name"] Series, so the old DataFrame and the new variable have no shared elements. However, if you directly assign one of the original columns to a new variable, the underlying elements still have the same references in memory. See the examples:
You can print out the DataFrame after each step to observe the effect. Remember that creating a deep copy will allocate new memory, so its good to reflect whether your script needs to be memory-efficient.
Maybe for whatever reason you want to store the result of that length computation. Its still not a good idea to append it to the DataFrame inside the function because of the side effect breach as well as the accumulation of multiple responsibilities inside a single function.
I like the One Level of Abstraction per Function rule that says:
We need to make sure that the statements within our function are all at the same level of abstraction.
Mixing levels of abstraction within a function is always confusing. Readers may not be able to tell whether a particular expression is an essential concept or a detail. [1 p.36]
Also lets employ the Single responsibility principle [1 p.138] from OOP, even though were not focusing on object-oriented code right now.
Why not prepare your data beforehand? Lets split data preparation and the actual computation in separate functions.:
The individual task of creating the name_len column has been outsourced to another function. It does not modify the original DataFrame and it performs one task at a time. Later we retrieve the max element by passing the new column to another dedicated function. Notice how the aggregating function is generic for Collections.
Lets brush the code up with the following steps:
The way we have split the code really makes it easy to go back to the script later, take the entire function and reuse it in another script. We like that!
There is one more thing we can do to increase the level of reusability: pass column names as parameters to functions. The refactoring is going a little bit over the top, but sometimes it pays for the sake of flexibility or reusability.
Did you ever figure out that your preprocessing was faulty after weeks of experiments on the preprocessed dataset? No? Lucky you. I actually had to repeat a batch of experiments because of broken annotations, which could have been avoided if I had tested just a couple of basic functions.
Important scripts should be tested [1 p.121, 7]. Even if the script is just a helper, I now try to test at least the crucial, most low-level functions. Lets revisit the steps that we made from the start:
1. I am not happy to even think of testing this, its very redundant and we have paved over the side effect. It also tests a bunch of different features: the computation of name length and the aggregation of result for the max element. Plus it fails, did you see that coming?
2. This is much better we have focused on one single task, so the test is simpler. We also dont have to fixate on column names like we did before. However, I think that the format of the data gets in the way of verifying the correctness of the computation.
3. Here we have cleaned up the desk. We test the computation function inside out, leaving the pandas overlay behind. Its easier to come up with edge cases when you focus on one thing at a time. I figured out that Id like to test for None values that may appear in the DataFrame and I eventually had to improve my function for that test to pass. A bug caught!
4. Were only missing the test for find_max_element:
One additional benefit of unit testing that I never forget to mention is that it is a way of documenting your code, as someone who doesnt know it (like you from the future) can easily figure out the inputs and expected outputs, including edge cases, just by looking at the tests. Double gain!
These are some tricks I found useful while coding and reviewing other peoples code. Im far from telling you that one or another way of coding is the only correct one you take what you want from it, you decide whether you need a quick scratch or a highly polished and tested codebase. I hope this thought piece helps you structure your scripts so that youre happier with them and more confident about their infallibility.
If you liked this article, I would love to know about it. Happy coding!
TL;DR
Theres no one and only correct way of coding, but here are some inspirations for scripting with pandas:
Donts:
- dont mutate your DataFrame too much inside functions, because you may lose control over what and where gets appended/removed from it,
- dont write methods that mutate a DataFrame and return nothing because that's confusing.
Dos:
- create new objects instead of modifying the source DataFrame and remember to make a deep copy when needed,
- perform only similar-level operations inside a single function,
- design functions for flexibility and reusability,
- test your functions because this helps you design cleaner code, secure against bugs and edge cases and document it for free.
The graphs were created by me using Miro. The cover image was also created by me using the Titanic dataset and GIMP (smudge effect).
Read more from the original source:
Pandas: From Messy To Beautiful. This is how to make your pandas code | by Anna Zawadzka | Mar, 2024 - Towards Data Science
Read More..