Last Thursday evening I had the opportunity to talk about Hadoop at a Glolent Global Talent community virtual meetup. Glolent connects remote IT workers across the globe and facilitates skill-sharing sessions that any member can join or present at. I rarely demonstrate Hadoop anymore so I needed to take a couple of evenings to brush up on the fundamentals. The talk is a distilled product of that study. I approached Hadoop’s architecture from a historical perspective: I started the talk by introducing the root problem – I/O bottleneck in processing Big Data – and positioned Hadoop Distributed File System as its panacea. There was an obligatory intro to the original processing paradigm on Hadoop: MapReduce, and a classic word count example of Shakespeare’s collected works. That was followed by a review of programming abstractions built on MapReduce and some alternative processing engines to MR – with an emphasis on Spark. A 30-minutes talking slot is very little time so I had to cut out any mention of resource or cluster management. You can judge for yourself how it turned out – I shared the recording below.
It’s said that the majority of every analytical project’s is taken up by data preparation tasks. But is this estimation trustworthy?
Summary: Intro | Linux VM Setup | VM Networking | Extending a Hadoop Cluster
At times I wish I had started my journey with Big Data earlier so that I could enter the market in 2008-2009. Though Hadoopmania is still going strong in IT, these years were a gold era for Hadoop professionals. With any sort of Hadoop experience you could be considered for a £80,000 position. There was such shortage of Hadoop skills in the job market that even a complete beginner could land a wonderfully overpaid job. Today you can’t just wing it at the interview; the market has matured and there are many talented and qualified people pursuing careers in Big Data. That said, after years, the demand for Hadoop knowledge is still on the rise, making it a profitable career choice for the foreseeable future.
Hadoop Salaries in the UK (Source: IT Jobs Watch)
While these days there seem to be a separation between analyst and administrator/developer roles on the market, I am of opinion that either role has to be aware of the objectives of the other. That is: an analyst should understand the workings of a Hadoop cluster, just as a developer needs to understand the demand an analysis will put on the worker nodes. It’s very similar to a skilled Business Intelligence specialist that appreciates the impact a database design has on the speed of query processing and the availability of the system. That philosophy is the why behind this post: getting to know Hadoop by configuring a cluster yourself. You could be creating a cluster simply because you want to see how it’s done, or perhaps you are looking to extend the processing power of your system by an extra server. Continue reading “Your first DIY Hadoop cluster”