Incorporate natural language queries and operations into your Python data cleaning workflow. Red panda drawing donated by Karen Walker, the artist.
Many of the series operations we need to do in our pandas data cleaning projects can be assisted by AI tools, including by PandasAI. PandasAI takes advantage of large language models, such as that from OpenAI, to enable natural language queries and operations on data columns. In this post, we examine how to use PandasAI to query Series values, create new Series, set Series values conditionally, and reshape our data.
You can install PandasAI by entering pip install pandasai into a terminal or into Windows Powershell. You will also need to get a token from openai.com to send a request to the OpenAI API.
As the PandasAI library is developing rapidly, you can anticipate different results depending on the versions of PandasAI and pandas you are using. In this article, I use version 1.4.8 of PandasAI and version 1.5.3 of pandas.
We will work with data from the National Longitudinal Study of Youth (NLS) conducted by the United States Bureau of Labor Statistics. The NLS has surveyed the same cohort of high school students for over 25 years, and has useful data items on educational outcomes and weeks worked for each of those years, among many other variables. It is available for public use at nlsinfo.org. (The NLS public releases are covered by the United States government Open Data Policy, which permits both non-commercial and commercial use.)
We will also work with COVID-19 data provided by Our World in Data. That dataset has one row per country per day with number of new cases and new deaths. This dataset is available for download at ourworldindata.org/covid-cases, with a Creative Commons CC BY 4.0 license. You can also download all code and data used in this post from GitHub.
We start by importing the OpenAI and SmartDataframe modules from PandasAI. We also have to instantiate an llm object:
Next, we load the DataFrames we will be using and create a SmartDataframe object from the NLS pandas DataFrame:
Now we are ready to generate summary statistics on Series from our SmartDataframe. We can ask for the average for a single Series, or for multiple Series:
We can also summarize Series values by another Series, usually one that is categorical:
We can also create a new Series with the chat method of SmartDataframe. We do not need to use the actual column names. For example, PandasAI will figure out that we want the childathome Series when we write child at home:
We can use the chat method to create Series values conditionally:
PandasAI is quite flexible regarding the language you might use here. For example, the following provides the same results:
We can do calculations across a number of similarly named columns:
This will calculate the average of all weeksworked00-weeksworked22 columns and assign that to a new column called weeksworked.
We can easily impute values where they are missing based on summary statistics:
We can also use PandasAI to do some reshaping. Recall that the COVID-19 case data has new cases for each day for each country. Lets say we only want the first row of data for each country. We can do that the traditional way with drop_duplicates:
We can get the same results by creating a SmartDataframe and using the chat method. The natural language I use here is remarkably straightforward, Show first casedate and location and other values for each country:
Notice that PandasAI makes smart choices about the columns to get. We get the columns we need rather than all of them. We could have also just passed the names of the columns we wanted to chat. (PandasAI sorted the rows by iso_code, rather than by location, which is why the first row is different.)
Much of the work when using PandasAI is really just importing the relevant libraries and instantiating large language model and SmartDataframe objects. Once thats done, simple sentences sent to the chat method of the SmartDataframe are sufficient to summarize Series values and create new Series.
PandasAI excels at generating simple statistics from Series. We dont even need to remember the Series name exactly. Often the natural language we might use can be more intuitive than traditional pandas methods like groupby. The Show satmath average by gender value passed to chat is a good example of that.
Operations on Series, including the creation of a new Series, is also quite straightforward. We created a total number of children Series (childnum) by instructing the SmartDataframe to add the number of children living at home to the number of children not living at home. We didnt even provide the literal Series names, childathome and childnotathome respectively. PandasAI figured out what we meant.
Since we are passing natural language instructions to chat for our Series operations, there is no one right way to get what we want. For example, we get the same result when we passed evermarried is No when maritalstatus is Never-married, else Yes to chat as we did with if maritalstatus is Never-married set evermarried2 to No, otherwise Yes.
We can also do fairly extensive DataFrame reshaping with simple natural language instructions, as in the last command we provided. We add and other values to the instructions to get columns other than casedate. PandasAI also figures out that location makes sense as the index.
You can read more about how to use PandasAI and SmartDataframes here:
Or in the second edition of my book, Python Data Cleaning Cookbook:
Good luck with your data cleaning and I would love to hear how things are going!
Read the original post:
Read More..