Streamline Data Pipelines: How to Use WhyLogs with PySpark for Data Profiling and Validation – Towards Data Science

Components of whylogs

Lets begin by understanding the important characteristics of whylogs.

This is all we need to know about whylogs. If youre curious to know more, I encourage you to check the documentation. Next, lets work to set things up for the tutorial.

Well use a Jupyter notebook for this tutorial. To make our code work anywhere, well use JupyterLab in Docker. This setup installs all needed libraries and gets the sample data ready. If youre new to Docker and want to learn how to set it up, check out this link.

Start by downloading the sample data (CSV) from here. This data is what well use for profiling and validation. Create a data folder in your project root directory and save the CSV file there. Next, create a Dockerfile in the same root directory.

This Dockerfile is a set of instructions to create a specific environment for the tutorial. Lets break it down:

By now your project directory should look something like this.

Awesome! Now, lets build a Docker image. To do this, type the following command in your terminal, making sure youre in your projects root folder.

This command creates a Docker image named pyspark-whylogs. You can see it in the Images tab of your Docker Desktop app.

Next step: lets run this image to start JupyterLab. Type another command in your terminal.

This command launches a container from the pyspark-whylogs image. It makes sure you can access JupyterLab through port 8888 on your computer.

After running this command, youll see a URL in the logs that looks like this: http://127.0.0.1:8888/lab?token=your_token. Click on it to open the JupyterLab web interface.

Great! Everythings set up for using whylogs. Now, lets get to know the dataset well be working with.

Well use a dataset about hospital patients. The file, named patient_data.csv, includes 100k rows with these columns:

As for where this dataset came from, dont worry. It was created by ChatGPT. Next, lets start writing some code.

First, open a new notebook in JupyterLab. Remember to save it before you start working.

Well begin by importing the needed libraries.

Then, well set up a SparkSession. This lets us run PySpark code.

After that, well make a Spark dataframe by reading the CSV file. Well also check out its schema.

Next, lets peek at the data. Well view the first row in the dataframe.

Now that weve seen the data, its time to start data profiling with whylogs.

To profile our data, we will use two functions. First, theres collect_column_profile_views. This function collects detailed profiles for each column in the dataframe. These profiles give us stats like counts, distributions, and more, depending on how we set up whylogs.

Each column in the dataset gets its own ColumnProfileView object in a dictionary. We can examine various metrics for each column, like their mean values.

whylogs will look at every data point and statistically decide wether or not that data point is relevant to the final calculation

For example, lets look at the average height.

Next, well also calculate the mean directly from the dataframe for comparison.

But, profiling columns one by one isnt always enough. So, we use another function, collect_dataset_profile_view. This function profiles the whole dataset, not just single columns. We can combine it with Pandas to analyze all the metrics from the profile.

We can also save this profile as a CSV file for later use.

The folder /home/jovyan in our Docker container is from Jupyter's Docker Stacks (ready-to-use Docker images containing Jupyter applications). In these Docker setups, 'jovyan' is the default user for running Jupyter. The /home/jovyan folder is where Jupyter notebooks usually start and where you should put files to access them in Jupyter.

And thats how we profile data with whylogs. Next, well explore data validation.

For our data validation, well perform these checks:

Now, lets start. Data validation in whylogs starts from data profiling. We can use the collect_dataset_profile_view function to create a profile, like we saw before.

However, this function usually makes a profile with standard metrics like average and count. But what if we need to check individual values in a column as opposed to the other constraints, that can be checked against aggregate metrics? Thats where condition count metrics come in. Its like adding a custom metric to our profile.

Lets create one for the visit_date column to validate each row.

visit_date_condition = {"is_date_format": Condition(Predicate().is_(check_date_format))}

Once we have our condition, we add it to the profile. We use a Standard Schema and add our custom check.

Then we re-create the profile with both standard metrics and our new custom metric for the visit_date column.

With our profile ready, we can now set up our validation checks for each column.

constraints = builder.build()constraints.generate_constraints_report()

We can also use whylogs to show a report of these checks.

Itll be an HTML report showing which checks passed or failed.

Heres what we find:

Lets double-check these findings in our dataframe. First, we check the visit_date format with PySpark code.

+----------+-----+|null_check|count|+----------+-----+|not_null |98977||null |1023 |+----------+-----+

It shows that 1023 out of 100,000 rows dont match our date format. Next, the weight column.

+------+-----+|weight|count|+------+-----+|0 |2039 |+------+-----+

Again, our findings match whylogs. Almost 2,000 rows have a weight of zero. And that wraps up our tutorial. You can find the notebook for this tutorial here.

Here is the original post:

Streamline Data Pipelines: How to Use WhyLogs with PySpark for Data Profiling and Validation - Towards Data Science

Related Posts

Comments are closed.