Show me your data quality scorecard and Ill tell you whether you will be successful a year from now. 7 min read
Every day I talk to organizations ready to dedicate a tremendous amount of time and resources towards data quality initiatives doomed to fail.
Its no revelation that incentives and KPIs drive good behavior. Sales compensation plans are scrutinized so closely that they often rise to the topic of board meetings. What if we gave the same attention to data quality scorecards?
Even in their heyday, traditional data quality scorecards from the Hadoop era were rarely wildly successful. I know this because prior to starting Monte Carlo, I spent years as an operations VP trying to create data quality standards that drove trust and adoption.
Over the past few years, advances in the cloud and metadata management have made organizing silly amounts of data possible.
Data engineering processes are starting to trend towards the level of maturity and rigor of more longstanding engineering disciplines. And of course, AI has the potential to streamline everything.
While this problem isnt and probably will never be completely solved, I have seen organizations adopt best practices that are the difference between initiative successand having another kick-off meeting 12 months later.
Here are 4 key lessons for building data quality scorecards:
The most sure way to fail any data related initiative is to assume all data is of equal value. And the best only way to determine what matters is to talk to the business.
Brandon Beidel at Red Ventures articulates a good place to start:
Id ask:
Now, this may be easier said than done if you work for a sprawling organization with tens of thousands of employees distributed across the globe.
In these cases, my recommendation is to start with your most business critical data business units (if you dont know that, I cant help you!). Start a discussion on requirements and priorities.
Just remember: prove the concept first, scale second. Youd be shocked how many people do it the other way around.
One of the enduring challenges to this type of endeavor, in a nutshell, is data quality resists standardization. Quality is, and should be, in the eye of use case.
The six dimensions of data quality are a vital part of any data quality scorecard and an important starting point, but for many teams, thats just the beginning and every data product is different.
For instance, a financial report may need to be highly accurate with some margin for timeliness whereas a machine learning model may be the exact opposite.
From an implementation perspective this means measuring data quality has typically been radically federated. Data quality is measured on a table-by-table basis by different analysts or stewards with wildly different data quality rules given wildly different weights.
This makes sense to a degree, but so much gets lost in translation.
Data is multi-use and shared across use cases. Not only is one persons yellow quality score another persons green, but its often incredibly difficult for data consumers to even understand what a yellow score means or how its been graded. They also frequently miss the implications of a green table being fed data by a red one (you know, garbage in, garbage out).
Surfacing the number of breached rules is important, of course, but you also need to:
So then what else do you need? You need to measure the machine.
In other words, the components in the production and delivery of data that generally result in high quality. This is much easier to standardize. Its also easier to understand across business units and teams.
Airbnb Midas is one of the more well known internal data quality score and certification programs and rightfully so. They lean heavily into this concept. They measure data accuracy but reliability, stewardship, and usability actually comprise 60% of the total score.
Many data teams are still in the process of formalize their own standards, but the components we have found to highly correlate to data health include:
Yay, another set of processes were required to follow! said no one ever.
Remember the purpose of measuring data health isnt to measure data health. The point, as Clark at Airbnb put it, is to drive a preference for producing and using high quality data.
The best practices Ive seen here are to have a minimum set of requirements for data to be on-boarded onto the platform (stick) and a much more stringent set of requirements to be certified at each level (carrot).
Certification works as a carrot because producers actually want consumers to use their data, and consumers will quickly discern and develop a taste for highly reliable data.
Almost nothing in data management is successful without some degree of automation and the ability to self-serve. Airbnb discarded any scoring criteria that 1) wasnt immediately understandable and 2) couldnt be measured automatically.
Your organization must do the same. Even if its the best scoring criteria that has ever been conceived, if you do not have a set of solutions that will automatically collect and surface it, into the trash bin it must go.
The most common ways Ive seen this done are with data observability and quality solutions, and data catalogs. Roche, for example, does this and layers on access management as part of creating, surfacing and governing trusted data products.
Of course this can also be done by manually stitching together the metadata from multiple data systems into a homegrown discoverability portal, but just be mindful of the maintenance overhead.
Data teams have made big investments into their modern data and AI platforms. But to maximize this investment, the organization both data producers and consumers must fully adopt and trust the data being provided.
At the end of the day, whats measured is managed. And isnt that what matters?
View post:
Most Data Quality Initiatives Fail Before They Start. Heres Why. | by Barr Moses | Jul, 2024 - Towards Data Science
Read More..