It’s said that the majority of every analytical project’s is taken up by data preparation tasks. But is this estimation trustworthy?
Summary: Intro | A case of tl;dr | Where was the graph police? | A quick fix
Today’s “from scratch” example with D3 is a must-have element of any data visualisation portfolio: a line chart. Line charts are great of visualizing changes in data over time. Just as in the previous posts in the series, my visualisation is a variation of a piece of code I found on the web. I started with a basic template created by Mike Bostock and then re-worked some of its elements to boost its usability & readability. As with the previous examples, all code can be downloaded, reused, adjusted, and it scales up and down to include extra data series or to remove one.
Summary: Intro | A Simple Bar Chart | A Multi-Series Bar Chart
If this post was a painting, it would probably be one of Mark Ryden’s works: it seems I have just gone and done a one detailed blog post. The funny thing is that it’s about bar charts, and everything has already been said about bar charts. In fact a bar chart is a graph so simple, this post should never have been written: yet, the simpleness of a bar chart is actually it’s most dangerous trap. It’s very easy to overdo, and with so few elements it’s tempting to tweak or enhance at least some of them. So this blog post is, above all, about resistance. I will look at what – and why – constitutes as a good bar chart, what are the best practices, and how to fight the horror vacui of a simple plot. We will use D3.js and the blank canvas we have built with zero coding skills in the last post to create a reusable template of a simple bar graph, and then of a multi-series bar graph. This is part of a data visualisation with D3 series, throughout which we will create a set of graphics that can be easily re-purposed for data visualisation projects.
Summary: Intro | About D3.js | Initial Setup & Python Server | Canvas Setup
In the following series I will cover the basics of data visualisation. There are many data visualisation tools available (free & paid versions) on the market, so for an everyday analyst the knowledge of how to build graphs from scratch is not essential. However, most (if not all) of these pre-built tools fall short as soon as any customisation is required: it could be a graph type that is not supported, or the design that cannot be adjusted to follow the company branding guidelines. Therefore, there are cases when the knowledge of how to build something yourself is essential.
Summary: Intro | Obtaining the proxy server address | Configuring proxy settings – checklist | Other tools
*Fedora / RHEL / CentOS
At first I wanted to title this post as Welcome to Proxy Hell, because – at least at first – getting the proxy settings right on a VM can feel like a nightmare. Especially, if you have no idea about were to start or, more depressingly, when none of your attempts to fix the problem seem to be successful. Nearly inevitably, if working in an office, you have come across proxies. It has become a standard for companies to guard their network traffic with a proxy server. The idea is that the server acts as an intermediary between the private company network and the internet, which both hides the web traffic from the outside eyes and can serve as a base for implementing access authentication and bandwidth control.
Perhaps the first time you consciously acknowledge the presence of a proxy is when your browser’s homepage instead of directing you to Google.co.uk goes to Google.in or Google.pl, and the Search button shows up in a different language than expected. That’s your proxy server location that’s just fooled Google. The second time you come across proxies is less amusing: this is when you start working with, or worse, configuring Virtual Environments and realise that even the basic tasks, like accessing a webpage or installing a package don’t work. For instance, if you followed the Hadoop clustering guide from my last post in an office environment you wouldn’t have been able to get most of it working it without setting up a proxy. Yet, the guide conveniently skips that topic with a vague warning: make sure you’re not behind a proxy. So, what to do if you were? Continue reading “A recipe proved to solve every* VM’s proxy problems”
Summary: Intro | Linux VM Setup | VM Networking | Extending a Hadoop Cluster
At times I wish I had started my journey with Big Data earlier so that I could enter the market in 2008-2009. Though Hadoopmania is still going strong in IT, these years were a gold era for Hadoop professionals. With any sort of Hadoop experience you could be considered for a £80,000 position. There was such shortage of Hadoop skills in the job market that even a complete beginner could land a wonderfully overpaid job. Today you can’t just wing it at the interview; the market has matured and there are many talented and qualified people pursuing careers in Big Data. That said, after years, the demand for Hadoop knowledge is still on the rise, making it a profitable career choice for the foreseeable future.
Hadoop Salaries in the UK (Source: IT Jobs Watch)
While these days there seem to be a separation between analyst and administrator/developer roles on the market, I am of opinion that either role has to be aware of the objectives of the other. That is: an analyst should understand the workings of a Hadoop cluster, just as a developer needs to understand the demand an analysis will put on the worker nodes. It’s very similar to a skilled Business Intelligence specialist that appreciates the impact a database design has on the speed of query processing and the availability of the system. That philosophy is the why behind this post: getting to know Hadoop by configuring a cluster yourself. You could be creating a cluster simply because you want to see how it’s done, or perhaps you are looking to extend the processing power of your system by an extra server. Continue reading “Your first DIY Hadoop cluster”
Summary: Intro | Virtualisation Software | Cloudera’s QuickStart VM | Importing a VM
In this post, I will introduce Virtual Machines: the core platform of every data scientist. If you, like me, get to experiment with different technologies at work, you are familiar with Virtual Machines. VMs are the best way of getting to test something out without having to install it on your computer and risking messing up your working environment. In its essence, a VM is like a mini (virtual!) computer you put on your computer; that computer has its own environment, like Windows, Linux or MacOS, and it would usually come with a bunch of pre-installed and configured tools, so that you don’t have too worry about any (or much) setup. So you might have a Windows machine installed on your actual Windows machine, and while these two share computing resources and space, they are separate instances of Windows. Plus, the virtual machine you can delete or change as you please, you can have many and, by definition, this has no impact on your original working environment.
A traditional first post, then!
I want this space to become a journal of my wanderings in the world of data analysis. While I’ve been working as a consultant for a few years now, there is a magnitude of topics I have never tackled and technologies I know nothing about. The idea for this learning space is to tackle a different problem every week-two weeks.
I am no writer, and my previous experience is in creating functional specification where every term had to be precise, and the sentences kept short and simple. In my Business Analyst beginnings the documentation I produced was poor, but with time I saw my writing quality go up. With that in mind I think it will be a learning curve to learn ‘blogging’ (but I still keep the positive outlook).
Eve the Analyst