Data cleaning is a critical part of any data analysis process. It's the step where you remove errors, handle missing data, and make sure that your data is in a format that you can work with. Without a well-cleaned dataset, any subsequent analyses can be skewed or incorrect.
This article introduces you to several key techniques for data cleaning in Python, using powerful libraries like pandas, numpy, seaborn, and matplotlib.
Before diving into the mechanics of data cleaning, let's understand its importance. Real-world data is often messy. It can contain duplicate entries, incorrect or inconsistent data types, missing values, irrelevant features, and outliers. All these factors can lead to misleading conclusions when analyzing data. This makes data cleaning an indispensable part of the data science lifecycle.
Well cover the following data cleaning tasks.
Before getting started, let's import the necessary libraries. We'll be using pandas for data manipulation, and seaborn and matplotlib for visualizations.
Well also import the datetime Python module for manipulating the dates.
First, we'll need to load our data. In this example, we're going to load a CSV file using pandas. We also add the delimiter argument.
Next, it's important to inspect the data to understand its structure, what kind of variables we're working with, and whether there are any missing values. Since the data we imported is not huge, lets have a look at the whole dataset.
Heres how the dataset looks.
You can immediately see there are some missing values. Also, the date formats are inconsistent.
Now, lets take a look at the DataFrame summary using the info() method.
Heres the code output.
We can see that only the column square_feet doesnt have any NULL values, so well somehow have to handle this. Also, the columns advertisement_date, and sale_date are the object data type, even though this should be a date.
The column location is completely empty. Do we need it?
Well show you how to handle these issues. Well start by learning how to delete unnecessary columns.
There are two columns in the dataset that we dont need in our data analysis, so well remove them.
The first column is buyer. We dont need it, as the buyers name doesnt impact the analysis.
Were using the drop() method with the specified column name. We set the axis to 1 to specify that we want to delete a column. Also, the inplace argument is set to True so that we modify the existing DataFrame, and not create a new DataFrame without the removed column.
The second column we want to remove is location. While it might be useful to have this information, this is a completely empty column, so lets just remove it.
We take the same approach as with the first column.
Of course, you can remove these two columns simultaneously.
Both approaches return the following dataframe.
Duplicate data can occur in your dataset for various reasons and can skew your analysis.
Lets detect the duplicates in our dataset. Heres how to do it.
The below code uses the method duplicated() to consider duplicates in the whole dataset. Its default setting is to consider the first occurrence of a value as unique and the subsequent occurrences as duplicates. You can modify this behavior using the keep parameter. For instance, df.duplicated(keep=False) would mark all duplicates as True, including the first occurrence.
Heres the output.
The row with index 3 has been marked as duplicate because row 2 with the same values is its first occurrence.
Now we need to remove duplicates, which we do with the following code.
The drop_duplicates() function considers all columns while identifying duplicates. If you want to consider only certain columns, you can pass them as a list to this function like this: df.drop_duplicates(subset=['column1', 'column2']).
As you can see, the duplicate row has been dropped. However, the indexing stayed the same, with index 3 missing. Well tidy this up by resetting indices.
This task is performed by using the reset_index() function. The drop=True argument is used to discard the original index. If you do not include this argument, the old index will be added as a new column in your DataFrame. By setting drop=True, you are telling pandas to forget the old index and reset it to the default integer index.
For practice, try to remove duplicates from this Microsoft dataset.
Sometimes, data types might be incorrectly set. For example, a date column might be interpreted as strings. You need to convert these to their appropriate types.
In our dataset, well do that for the columns advertisement_date and sale_date, as they are shown as the object data type. Also, the date dates are formatted differently across the rows. We need to make it consistent, along with converting it to date.
The easiest way is to use the to_datetime() method. Again, you can do that column by column, as shown below.
When doing that, we set the dayfirst argument to True because some dates start with the day first.
You can also convert both columns at the same time by using the apply() method with to_datetime().
Both approaches give you the same result.
Now the dates are in a consistent format. We see that not all data has been converted. Theres one NaT value in advertisement_date and two in sale_date. This means the date is missing.
Lets check if the columns are converted to dates by using the info() method.
As you can see, both columns are not in datetime64[ns] format.
Now, try to convert the data from TEXT to NUMERIC in this Airbnb dataset.
Real-world datasets often have missing values. Handling missing data is vital, as certain algorithms cannot handle such values.
Our example also has some missing values, so lets take a look at the two most usual approaches to handling missing data.
If the number of rows with missing data is insignificant compared to the total number of observations, you might consider deleting these rows.
In our example, the last row has no values except the square feet and advertisement date. We cant use such data, so lets remove this row.
Heres the code where we indicate the rows index.
The DataFrame now looks like this.
The last row has been deleted, and our DataFrame now looks better. However, there are still some missing data which well handle using another approach.
If you have significant missing data, a better strategy than deleting could be imputation. This process involves filling in missing values based on other data. For numerical data, common imputation methods involve using a measure of central tendency (mean, median, mode).
In our already changed DataFrame, we have NaT (Not a Time) values in the columns advertisement_date and sale_date. Well impute these missing values using the mean() method.
The code uses the fillna() method to find and fill the null values with the mean value.
You can also do the same thing in one line of code. We use the apply() to apply the function defined using lambda. Same as above, this function uses the fillna() and mean() methods to fill in the missing values.
The output in both cases looks like this.
Our sale_date column now has times which we dont need. Lets remove them.
Well use the strftime() method, which converts the dates to their string representation and a specific format.
Original post:
Mastering the Art of Data Cleaning in Python - KDnuggets
Read More..