Data science is a practice that requires technical expertise in machine learning and code development. However, it also demands creativity (for instance, connecting dense numbers and data to real user needs) and lean thinking (like prioritizing the experiments and questions to explore next). In light of these needs, and to continuously innovate and create meaningful outcomes, its essential to adopt processes and techniques that facilitate high levels of energy, drive and communication in data science development.
Pair programming can increase communication, creativity and productivity in data science teams. Pair programming is a collaborative way of working in which two people take turns coding and navigating on the same problem, at the same time, on the same computer connected with two mirrored screens, two mice and two keyboards.
At VMware Tanzu Labs, our data scientists practice pair programming with each other and with our client-side counterparts. Pair programming is more widespread in software engineering than in data science. We see this as a missed opportunity. Lets explore the nuanced benefits of pair programming in the context of data science, delving into three aspects of the data science life cycle and how pair programming can help with each one.
When data scientists pick up a story for development, exploratory data analysis (EDA) is often the first step in which we start writing code. Arguably, among all components of the development cycle that require coding, EDA demands the most creativity from data scientists: The aim is to discover patterns in the data and build hypotheses around how we might be able to use this information to deliver value for the story at hand.
If new data sources need to be explored to deliver the story, we get familiar with them by asking questions about the data and validating what information they are able to provide to us. As part of this process, we scan sample records and iteratively design summary statistics and visualizations for reexamination.
Pairing in this context enables us to immediately discuss and spark a continuous stream of second opinions and tweaks on the statistics and visualizations displayed on the screen; we each build on the energy of our partner. Practicing this level of energetic collaboration in data science goes a long way toward building the creative confidence needed to generate a wider range of hypotheses, and it adds more scrutiny to synthesis when distinguishing between coincidence and correlation.
Based on what we learn about the data from EDA, we next try to summarize a pattern weve observed, which is useful in delivering value for the story at hand. In other words, we build or train a model that concisely and sufficiently represents a useful and valuable pattern observed in the data.
Arguably, this part of the development cycle demands the most science from data scientists as we continuously design, analyze and redesign a series of scientific experiments. We iterate on a cycle of training and validating model prototypes and make a selection as to which one to publish or deploy for consumption.
Pairing is essential to facilitating lean and productive experimentation in model training and validation. With so many options of model forms and algorithms available, balancing simplicity and sufficiency is necessary to shorten development cycles, increase feedback loops and mitigate overall risk in the product team.
As a data scientist, I sometimes need to resist the urge to use a sophisticated, stuffy algorithm when a simpler model fits the bill. I have biases based on prior experience that influence the algorithms explored in model training.
Having my paired data scientist as my data conscience in model training helps me put on the brakes when Im running a superfluous number of experiments, constructively challenges the choices made in algorithm selection and course-corrects me when I lose focus from training prototypes strictly in support of the current story.
In addition to aspects of pair programming that influence productivity in specific components of the development cycle such as EDA and model training/validation, there are also perhaps more mundane benefits of pairing for data science that affect productivity and reproducibility more generally.
Take the example of pipelining. Much of the code written for data science is sequential by nature. The metrics we discover and design in EDA are derived from raw data that requires sequential coding to clean and process. These same metrics are then used as key pieces of information (a.k.a. features) when we build experiments for model training. In other words, the code written to design these metrics is a dependency for the code written for model training. Within model training itself, we often try different versions of a previously trained model (which we have previously written code to build) by exploring different variations of input parameter values to improve accuracy. The components and dependencies described above can be represented as steps and segments in a logical, sequential pipeline of code.
Pairing in the context of pipelining brings benefits in shared accountability driven by a sense of shared ownership of the codebase. While all data scientists know and understand the benefits of segmenting and modularizing code, when coding without a pair, it is easy to slip into a habit of creating overly lengthy code blocks, losing count on similar code being copied-pasted-modified and discounting groups of code dependencies that are only obvious to the person coding. These habits create cobwebs in the codebase and increase risks in reproducibility.
Enter your paired data scientist, who can raise a hand when it becomes challenging to follow the code, highlight groups of code to break up into pipeline segments and suggest blocks of repeated similar code to bundle into reusable functions. Note that this works bidirectionally: when practicing pairing, the data scientist who is typing is fully aware of the shared nature of code ownership and is proactively driven to make efforts to write reproducible code. Pairing is thus an enabler for creating and maintaining a reproducible data science codebase.
If pair programming is new to your data science practice, we hope this post encourages you to explore it with your team. At Tanzu Labs, we have introduced pair programming to many of our client-side data scientists and have observed that the cycles of continuous communication and feedback inherent in pair programming instill a way of working that sparks more creativity in data discovery, facilitates lean experimentation in model training and promotes better reproducibility of the codebase. And lets not forget that we do all of this to deliver outcomes that delight users and drive meaningful business value.
Here are some practical tips to get started with pair programming in data science:
See the original post here:
Why Data Science Teams Should Be Using Pair Programming - The New Stack
Read More..