Why is data preparation so hard and are we getting worse at it?

Some statistics for the aperitif:

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets. New York Times, 2014. No source for the statistics.

Analysts will still spend up to 80% of their time just trying to create the data set to draw insights. Forrester, 2015. No source for the statistics.

Since the popular emergence of data science as a field, its practitioners have asserted that 80% of the work involved is acquiring and preparing data. Harvard Business Review, reprinting the statistic from Forbes in 2016. Forbes cites a “survey of about 80 data scientists was conducted for the second year in a row by CrowdFlower.”

When I started this blog my idea was to focus on types of problems that impede data analysis. There is this famous statistics that most of us in the Business Analytics industry have internalised as true: about 60-80% of every analytical project’s budget & time is eaten up by the project setup, data cleaning, solving data quality issues, etc., while the actual analysis is done only with the remaining 20%. These percentages differ slightly depending on who is using this statistic, and over the years I have observed that the time allocated to the analysis has been steadily dropping. While the lower bound of 60% has fallen out of fashion, a considerably more tragic 90% has made it way through to some marketing collateral.

Have we got worse at ETL? It is true that Big Data brought us Big Problems: funky data formats, data quality liberation movement, and a daily growth rate that makes us all twitch. Still, wouldn’t Moore’s Law suggest that with time we get better at crunching numbers? The impressive progress in machine learning and advanced analytics, and the rise of sleek new analytical tools, have all been a major aid in data processing efforts. ETL engines are doing better then ever, and some great people invented Notepad++ and Python to help non-IT heads automate boring stuff.

About two years ago I tracked down the source of this statistic: it comes from a research published by TDWI in 2003, “Evaluating ETL and Data Integration Platforms.” The study conducted by Wayne Eckerson and Colin White was based on questionnaires received from 741 respondents “who had both deployed ETL functionality and were either IT professionals or consultants at end user organizations.” In the paragraph suggestively called “Why ETL is Hard”, included is the spotlight quote: “According to most practitioners, ETL design and development work consumes 60 to 80 percent of an entire BI project.”

If in 2003 ETL took between 60 to 80% of time of any analytical project, today’s rate of 80 to 90% is not only no progress; it’s a decline in our data preparation capabilities. Why this downward trend? Based on the current state of software development and having repeatedly seen the original TDWI-reported quote in print, PowerPoints, and brochures, I suspect that whatever the trend is, we don’t know it. Instead, everyone has been using the same statistic for the last 15 years, to back their gut feeling. It is a fascinating case of a study that over time has evolved into its own living form. It’s become that old pair of jeans that had stretched over time, but that you insist of wearing because of that cosy feeling. It could be that our marketeers and journalists have gotten a bit ahead of themselves, and, like Alice has the white rabbit, we all followed. The quotes I opened with come from widely acclaimed journals: The New York Times, Forrester Research, Forbes, and Harvard Business Review. All these sources take the 80-20 ratio as granted (besides Forbes that uses some shady study of 80 people as a proof). To stay impartial, I left out marketing brochures of analytics’ vendors; check yourself: it’s a world of wonder on its own.

I am far from saying that we mastered data preparation today. For all I know we might be even worse at it now than in 2003. Yet it is true that challenging the status quo of something we take for granted – a statistic we use as affirmation of our believes – is just always so much fun.



3 thoughts on “Why is data preparation so hard and are we getting worse at it?

  1. The issue is that while some time ago we had less tools and less power for calcualtions, we also had smaller expectations and less data. Today with the explosion of the cloud even a mid-sized company risks to scatter all its data among many sources. Without someone controlling and enforcing rules of conduct it becomes impossible to match all the sources. So I would say that is not ETL that got worse, it is preparation that is more difficult than ever

  2. Back in 2014, Sean Kandel (Stanford and now Trifacta) wrote this article for Harvard Business Review (https://hbr.org/2014/04/the-sexiest-job-of-the-21st-century-is-tedious-and-that-needs-to-change). He mention some interviews conducted with Data Analysts, and I found this one available:

    Enterprise Data Analysis and Visualization: An Interview Study, Sean Kandel, Andreas Paepcke, Joseph Hellerstein, Jeffrey Heer Proc. IEEE Visual Analytics Science & Technology (VAST), Oct 2012. Best Paper Honorable Mention.

    Link: http://vis.stanford.edu/files/2012-EnterpriseAnalysisInterviews-VAST.pdf

    Time was not quantified, though efforts were.

Leave a Reply

Your email address will not be published. Required fields are marked *