Why is data preparation so hard and are we getting worse at it?

It’s said that the majority of every analytical project’s is taken up by data preparation tasks. But is this estimation trustworthy?

Most of us in the Business Analytics industry will tell you that about 60-80% of every analytical project’s budget & time is eaten up by the data preparation tasks. These are understood as the project setup, data cleaning, and solving data quality issues. Only the remaining 20% is committed for the actual analysis. This statistic is one of the mantras repeated by the industry, the universities, the software vendors, the press, and the analysts themselves. We know it’s true so well, none of us bothers to actually back it up… with data. These are just some of the public mentions in major tech and business media:

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets. New York Times, 2014. No source for the statistics.

Analysts will still spend up to 80% of their time just trying to create the data set to draw insights. Forrester, 2015. No source for the statistics.

Since the popular emergence of data science as a field, its practitioners have asserted that 80% of the work involved is acquiring and preparing data. Harvard Business Review, reprinting the statistic from Forbes in 2016. Forbes cites a “survey of about 80 data scientists was conducted for the second year in a row by CrowdFlower.”

These percentages differ slightly depending on who is using this statistic, and over the years I have observed that the time allocated to the analysis has been steadily dropping. While the lower bound of 60% has fallen out of fashion, a considerably more tragic 90% has made it way through to some marketing collateral.

Have we got worse at [ETL](https://en.m.wikipedia.org/wiki/Extract,transform,load? It is true that Big Data brought us Big Problems: funky data formats, data quality liberation movement, and a daily growth rate that makes us all twitch. Still, wouldn’t Moore’s Law suggest that with time we get better at crunching numbers? The impressive progress in machine learning and advanced analytics, and the rise of sleek new analytical tools, have all been a major aid in data processing efforts. ETL engines are doing better then ever, and some great people invented Notepad++ and Python to help non-IT heads automate boring stuff.

About two years ago I tracked down the source of this statistic: it comes from a research published by TDWI in 2003, “Evaluating ETL and Data Integration Platforms.” The study conducted by Wayne Eckerson and Colin White was based on questionnaires received from 741 respondents “who had both deployed ETL functionality and were either IT professionals or consultants at end user organizations.” In the paragraph suggestively called “Why ETL is Hard”, included is the spotlight quote: “According to most practitioners, ETL design and development work consumes 60 to 80 percent of an entire BI project.”

If in 2003 ETL took between 60 to 80% of time of any analytical project, today’s rate of 80 to 90% is not only no progress; it’s a decline in our data preparation capabilities. Why this downward trend? Based on the current state of software development and having repeatedly seen the original TDWI-reported quote in print, PowerPoints, and brochures, I suspect that whatever the trend is, we don’t know it. Instead, everyone has been using the same statistic for the last 15 years, to back their gut feeling. It is a fascinating case of a study that over time has evolved into its own living form. It’s become that old pair of jeans that had stretched over time, but that you insist of wearing because of that cosy feeling. It could be that our marketeers and journalists have gotten a bit ahead of themselves, and, like Alice has the white rabbit, we all followed. The quotes I opened with come from widely acclaimed journals: The New York Times, Forrester Research, Forbes, and Harvard Business Review. All these sources take the 80-20 ratio as granted (besides Forbes that uses some shady study of 80 people as a proof). To stay impartial, I left out marketing brochures of analytics’ vendors; check yourself: it’s a world of wonder on its own.

I am far from saying that we mastered data preparation today. For all I know we might be even worse at it now than in 2003. Yet it is true that challenging the status quo of something we take for granted – a statistic we use as affirmation of our believes – is just always so much fun.