Framing Data Science Problems the Right Way From the Start – MIT Sloan

The failure rate of data science initiatives often estimated at over 80% is way too high. We have spent years researching the reasons contributing to companies low success rates and have identified one underappreciated issue: Too often, teams skip right to analyzing the data before agreeing on the problem to be solved. This lack of initial understanding guarantees that many projects are doomed to fail from the very beginning.

Of course, this issue is not a new one. Albert Einstein is often quoted as having said, If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute solving it.

Get Updates on Leading With AI and Data

Get monthly insights on how artificial intelligence impacts your organization and what it means for your company and customers.

Please enter a valid email address

Thank you for signing up

Privacy Policy

Consider how often data scientists need to clean up the data on data science projects, often as quickly and cheaply as possible. This may seem reasonable, but it ignores the critical why question: Why is there bad data in the first place? Where did it come from? Does it represent blunders, or are there legitimate data points that are just surprising? Will they occur in the future? How does the bad data impact this particular project and the business? In many cases, we find that a better problem statement is to find and eliminate the root causes of bad data.

Too often, we see examples where people either assume that they understand the problem and rush to define it, or they dont build the consensus needed to actually solve it. We argue that a key to successful data science projects is to recognize the importance of clearly defining the problem and adhere to proven principles in so doing. This problem is not relegated to technology teams; we find that many business, political, management, and media projects, at all levels, also suffer from poor problem definition.

Data science uses the scientific method to solve often complex (or multifaceted) and unstructured problems using data and analytics. In analytics, the term fishing expedition refers to a project that was never framed correctly to begin with and involves trolling the data for unexpected correlations. This type of data fishing does not meet the spirit of effective data science but is prevalent nonetheless. Consequently, defining the problem correctly needs to be step one. We previously proposed an organizational bridge between data science teams and business units, to be led by an innovation marshal someone who speaks the language of both the data and management teams and can report directly to the CEO. This marshal would be an ideal candidate to assume overall responsibility to ensure that the following proposed principles are utilized.

Get the right people involved. To ensure that your problem framing has the correct inputs, you have to involve all the key people whose contributions are needed to complete the project successfully from the beginning. After all, data science is an interdisciplinary, transdisciplinary team sport. This team should include those who own the problem, those who will provide data, those responsible for the analyses, and those responsible for all aspects of implementation. Think of the RACI matrix those responsible, accountable, to be consulted, and to be informed for each aspect of the project.

Recognize that rigorously defining the problem is hard work. We often find that the problem statement changes as people work to nail it down. Leaders of data science projects should encourage debate, allow plenty of time, and document the problem statement in detail as they go. This ensures broad agreement on the statement before moving forward.

Dont confuse the problem and its proposed solution. Consider a bank that is losing market share in consumer loans and whose leadership team believes that competitors are using more advanced models. It would be easy to jump to a problem statement that looks something like Build more sophisticated loan risk models. But that presupposes that a more sophisticated model is the solution to market share loss, without considering other possible options, such as increasing the number of loan officers, providing better training, or combating new entrants with more effective marketing. Confusing the problem and proposed solution all but ensures that the problem is not well understood, limits creativity, and keeps potential problem solvers in the dark. A better statement in this case would be Research root causes of market share loss in consumer loans, and propose viable solutions. This might lead to more sophisticated models, or it might not.

Understand the distinction between a proximate problem and a deeper root cause. In our first example, the unclean data is a proximate problem, whereas the root cause is whatever leads to the creation of bad data in the first place. Importantly, We dont know enough to fully articulate the root cause of the bad data problem is a legitimate state of affairs, demanding a small-scale subproject.

Do not move past problem definition until it meets the following criteria:

Taking the time needed to properly define the problem can feel uncomfortable. After all, we live and work in cultures that demand results and are eager to get on with it. But shortchanging this step is akin to putting the cart before the horse it simply doesnt work. There is no substitute for probing more deeply, getting the right people involved, and taking the time to understand the real problem. All of us data scientists, business leaders, and politicians alike need to get better at defining the right problem the right way.

Roger W. Hoerl (@rogerhoerl) teaches statistics at Union College in Schenectady, New York. Previously, he led the applied statistics lab at GE Global Research. Diego Kuonen (@diegokuonen) is head of Bern, Switzerland-based Statoo Consulting and a professor of data science at the Geneva School of Economics and Management at the University of Geneva. Thomas C. Redman (@thedatadoc1) is president of New Jersey-based consultancy Data Quality Solutions and coauthor of The Real Work of Data Science: Turning Data Into Information, Better Decisions, and Stronger Organizations (Wiley, 2019).

See the rest here:

Framing Data Science Problems the Right Way From the Start - MIT Sloan

Related Posts

Comments are closed.