Why is data preparation so hard and are we getting worse at it?

Some statistics for the aperitif:

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets. New York Times, 2014. No source for the statistics.

Analysts will still spend up to 80% of their time just trying to create the data set to draw insights. Forrester, 2015. No source for the statistics.

Since the popular emergence of data science as a field, its practitioners have asserted that 80% of the work involved is acquiring and preparing data. Harvard Business Review, reprinting the statistic from Forbes in 2016. Forbes cites a “survey of about 80 data scientists was conducted for the second year in a row by CrowdFlower.”

Continue reading “Why is data preparation so hard and are we getting worse at it?”

A recipe proved to solve every* VM’s proxy problems

Summary: Intro | Obtaining the proxy server address | Configuring proxy settings – checklist | Other tools

*Fedora / RHEL / CentOS

At first I wanted to title this post as Welcome to Proxy Hell, because – at least at first – getting the proxy settings right on a VM can feel like a nightmare. Especially, if you have no idea about were to start or, more depressingly, when none of your attempts to fix the problem seem to be successful. Nearly inevitably, if working in an office, you have come across proxies. It has become a standard for companies to guard their network traffic with a proxy server. The idea is that the server acts as an intermediary between the private company network and the internet, which both hides the web traffic from the outside eyes and can serve as a base for implementing access authentication and bandwidth control.

Perhaps the first time you consciously acknowledge the presence of a proxy is when your browser’s homepage instead of directing you to Google.co.uk goes to Google.in or Google.pl, and the Search button shows up in a different language than expected. That’s your proxy server location that’s just fooled Google. The second time you come across proxies is less amusing: this is when you start working with, or worse, configuring Virtual Environments and realise that even the basic tasks, like accessing a webpage or installing a package don’t work. For instance, if you followed the Hadoop clustering guide from my last post in an office environment you wouldn’t have been able to get most of it working it without setting up a proxy. Yet, the guide conveniently skips that topic with a vague warning: make sure you’re not behind a proxy. So, what to do if you were? Continue reading “A recipe proved to solve every* VM’s proxy problems”

Your first DIY Hadoop cluster

Summary: Intro | Linux VM Setup | VM Networking | Extending a Hadoop Cluster

At times I wish I had started my journey with Big Data earlier so that I could enter the market in 2008-2009. Though Hadoopmania is still going strong in IT, these years were a gold era for Hadoop professionals. With any sort of Hadoop experience you could be considered for a £80,000 position. There was such shortage of Hadoop skills in the job market that even a complete beginner could land a wonderfully overpaid job. Today you can’t just wing it at the interview; the market has matured and there are many talented and qualified people pursuing careers in Big Data. That said, after years, the demand for Hadoop knowledge is still on the rise, making it a profitable career choice for the foreseeable future.

Hadoop Salaries in the UK (Source: IT Jobs Watch)


While these days there seem to be a separation between analyst and administrator/developer roles on the market, I am of opinion that either role has to be aware of the objectives of the other. That is: an analyst should understand the workings of a Hadoop cluster, just as a developer needs to understand the demand an analysis will put on the worker nodes. It’s very similar to a skilled Business Intelligence specialist that appreciates the impact a database design has on the speed of query processing and the availability of the system. That philosophy is the why behind this post: getting to know Hadoop by configuring a cluster yourself. You could be creating a cluster simply because you want to see how it’s done, or perhaps you are looking to extend the processing power of your system by an extra server. Continue reading “Your first DIY Hadoop cluster”