Profiling is a software engineering task in which software bottlenecks are analyzed programmatically. This process includes analyzing memory usage, the number of function calls and the runtime of those calls. Such analysis is important because it provides a rigorous way to detect parts of a software program that may be slow or resource inefficient, ultimately allowing for the optimization of software programs.
Profiling has use cases across almost every type of software program, including those used for data science and machine learning tasks. This includes extraction, transformation and loading (ETL) and machine learning model development. You can use the Pandas library in Python to conduct profiling on ETL, including profiling Pandas operations like reading in data, merging data frames, performing groupby operations, typecasting and missing value imputation.
Identifying bottlenecks in machine learning software is an important part of our work as data scientists. For instance, consider a Python script that reads in data and performs several operations on it for model training and prediction. Suppose the steps in the machine learning pipeline are reading in data, performing a groupby, splitting the data for training and testing, fitting three types of machine models, making predictions for each model type on the test data, and evaluating model performance. For the first deployed version, the runtime might be a few minutes.
After a data refresh, however, imagine that the scripts runtime increases to several hours. How do we know which step in the ML pipeline is causing the problem? Software profiling allows us to detect which part of the code is responsible so we can fix it.
Another example relates to memory. Consider the memory usage of the first version of a deployed machine learning pipeline. This script may run for an hour each month and use 100 GB of memory. In the future, an updated version of the model, trained on a larger data set, may run for five hours each month and require 500 GB of memory. This increase in resource usage is to be expected with an increase in data set size. Detecting such an increase may help data scientists and machine learning engineers decide if they would like to optimize the memory usage of the code in some way. Optimization can help prevent companies from wasting money on unnecessary memory resources.
Python provides useful tools for profiling software in terms of runtime and memory. One of the most basic and widely used is the timeit method, which offers an easy way to measure the execution times of software programs. The Python memory_profile module allows you to measure the memory usage of lines of code in your Python script. You can easily implement both of these methods with just a few lines of code.
We will work with the credit card fraud data set and build a machine learning model that predicts whether or not a transaction is fraudulent. We will construct a simple machine learning pipeline and use Python profiling tools to measure runtime and memory usage. This data has an Open Database License and is free to share, modify and use.
More From Sadrach PierreHow to Find Outliers (With Examples)
To start, lets import the Pandas library and read our data into a Pandas data frame:
Next, lets relax the display limits for columns and rows using the Pandas method set_option():
Next, lets display the first five rows of data using the head() method:
Next, to get an idea of how big this data set is, we can use the len method to see how many rows there are:
And we can do something similar for counting the number of columns. We can assess the columns attribute from our Pandas data frame object and use the len() method to count the number of columns:
We can see that this data set is relatively large: 284,807 rows and 31 columns. Further, it takes up 150 MB of space. To demonstrate the benefits of profiling in Python, well start with a small subsample of this data on which well perform ETL and train a classification model.
Lets proceed by generating a small subsample data set. Lets take a random sample of 10,000 records from our data. We will also pass a value for random_state, which will guarantee that we select the same set of records every time we run the script. We can do this using the sample() method on our Pandas data frame:
Next, we can write the subsample of our data to a new csv file:
Now we can start building out the logic for data preparation and model training. Lets define a method that reads in our csv file, stores it in a data frame and returns it:
Next, lets define a function that selects a subset of columns in the data. The function will take a data frame and a list of columns as inputs and return a new one with the selected columns:
Next, lets define a method that itself defines model inputs and outputs and returns these values:
We can then define a method used for splitting data for training and testing. First, at the top of our script, lets import the train_test_split method from the model_selection module in Scikit-learn:
Now we can define our method for splitting our data:
Next, we can define a method that will fit a model of our choice to our training data. Lets start with a simple logistic regression model. We can import the logistic regression class from the linear models module in Scikit-learn:
We will then define a method that takes our training data and an input that specifies the model type. The model type parameter we will use to define and train a more complex model later on:
Next, we can define a method that take our trained model and test data as inputs and returns predictions:
Finally, lets define a method that evaluates our predictions. Well use average precision, which is a useful performance metric for imbalance classification problems. An imbalance classification problem is one where one of the targets has significantly fewer examples than the other target(s). In this case, most of the transaction data correspond to legitimate transactions, whereas a small minority of transactions are fraudulent:
Now we have all of the logic in place for our simple ML pipeline. Lets execute this logic for our small subsample of data. First, lets define a main function that well use to execute our code. In this main function, well read in our subsampled data:
Next, use the data prep method to select our columns. Lets select V1, V2, V3, Amount and Class:
Lets then define inputs and output. We will use V1, V2, V3, and Amount as inputs; the class will be the output:
Well split our data for training and testing:
Fit our data:
Make predictions:
And, finally, evaluate model predictions:
We can then execute the main function with the following logic:
And we get the following output:
Now we can use some profiling tools to monitor memory usage and runtime.
Lets start by monitoring runtime. Lets import the default_timer from the timeit module in Python:
Next, lets start by seeing how long it takes to read our data into a Pandas data frame. We define start and end time variables and print the difference to see how much time has elapsed:
If we run our script, we see that it takes 0.06 seconds to read in our data:
Lets do the same for each step in the ML pipeline. Well calculate runtime for each step and store the results in a dictionary:
We get the following output upon executing:
We see that reading in the data and fitting it are the most time-consuming operations. Lets rerun this with the large data set. At the top of our main function, we change the file name to this:
And now, lets return our script:
We see that, when we use the full data set, reading the data into a data frame takes 1.6 seconds, compared to the 0.07 seconds it took for the smaller data set. Identifying that it was the step where we read in the data that led to the increase in runtime is important for resource management. Understanding bottleneck sources like these can prevent companies from wasting resources like compute time.
Next, lets modify our model training method such that CatBoost is a model option:
Lets rerun our script but now specifying a CatBoost model:
We see the following results:
We see that by using a CatBoost model instead of logistic regression, we increased our runtime from ~two seconds to ~ 22 seconds, which is more than a tenfold increase in runtime because we changed one line of code. Imagine if this increase in runtime happened for a script that originally took 10 hours: Runtime would increase to over 100 hours just by switching the model type.
Another important resource to keep track of is the memory. We can use the memory_usage module to monitor memory usage line-by-line in our code. First, lets install the memory_usage module in terminal using pip:
We can then simply add @profiler before each function definition. For example:
And so on.
Now, lets run our script using the logistic regression model type. Lets look at the step where we fit the model. We see that memory usage is for fitting our logistic regression model is around 4.4 MB (line 61):
Now, lets rerun this for CatBoost:
We see that memory usage for fitting our logistic regression model is 13.3 MB (line 64). This corresponds to a threefold increase in memory usage. For our simple example, this isn't a huge deal, but if a company deploys a newer version of a production and it goes from using 100 GB of memory to 300 GB, this can be significant in terms of resource cost. Further, having tools like this that can point to where the increase in memory usage is occurring is useful.
The code used in this post is available on GitHub.
More in Data ScienceUse Lux and Python to Automatically Create EDA Visualizations
Monitoring resource usage is an important part of software, data and machine learning engineering. Understanding runtime dependencies in your scripts, regardless of the application, is important in virtually all industries that rely on software development and maintenance. In the case of a newly deployed machine learning model, an increase in runtime can have a negative impact on business. A significant increase in production runtime can result in a diminished experience for a user of an application that serves realtime machine learning predictions.
For example, if the UX requirements are such that a user shouldnt have to wait more than a few seconds for a prediction result and this suddenly increases to minutes, this can result in frustrated customers who may eventually seek out a better/faster tool.
Understanding memory usage is also crucial because instances may occur in which excessive memory usage isnt necessary. This usage can translate to thousands of dollars being wasted on memory resources that arent necessary. Consider our example of switching the logistic regression model for the CatBoost model. What mainly contributed to the increased memory usage was the CatBoost packages default parameters. These default parameters may result in unnecessary calculations being done by the package.
By understanding this dynamic, the researcher can modify the parameters of the CatBoost class. If this is done well, the researcher can retain the model accuracy while decreasing the memory requirements for fitting the model. Being able to quickly identify bottlenecks for memory and runtime using these profiling tools are essential skills for engineers and data scientists building production-ready software.
Read the original here:
Python Profiling Tools: A Tutorial - Built In
Read More..