It's gonna be #phun

I will be at the Oracle Code conference on the 6th of June in Brussels, redefining fun with Python on Oracle database. The idea is to live demo working with tables, handling JSON files, (new) spatial data queries, data visualisation, and a couple of good practices of database application development with Python.

There will be snakes, ridiculous maps, and misbehaving queries. Come if you’re in town, it’s gonna be #phun

Phun

Continue reading →

Why is data preparation so hard and are we getting worse at it?

It’s said that the majority of every analytical project’s is taken up by data preparation tasks. But is this estimation trustworthy?

Continue reading →

Butterfly effect: OECD’s data visualisation fail leads to media panic

Summary: Intro | A case of tl;dr | Where was the graph police? | A quick fix

This is a short story about a graph that could have been done better and an article that has gone awry. The Organisation for Economic Co-operation and Development (OECD), one of the most powerful research bodies there is, has published an excellent report on the influence of robotics on the job market: conclusion of which was misinterpreted in a Polish influential newspaper, Krytyka Polityczna, and in other media. In this post I will analyze both the article and the report, to then theorize on what has gone wrong and who (or what) is to blame.

Continue reading →

One Magical Configuration Proved to Solve Every VMs Proxy Problems

Summary: Intro | Obtaining the proxy server address | Configuring proxy settings – checklist | Other tools

*Fedora / RHEL / CentOS

Initially, I wanted to title this post as Welcome to Proxy Hell, because – at least at first – getting the proxy settings right on a VM can feel like a nightmare. Especially, if you have no idea about were to start or, more depressingly, when none of your attempts to fix the problem seem to be successful. Nearly inevitably, if working in an office, you have come across proxies. It has become a standard for companies to guard their network traffic with a proxy server. The idea is that the server acts as an intermediary between the private company network and the internet, which both hides the web traffic from the outside eyes and can serve as a base for implementing access authentication and bandwidth control.

Continue reading →

Your First DIY Hadoop Cluster

Summary: Intro | Linux VM Setup | VM Networking | Extending a Hadoop Cluster

At times I wish I had started my journey with Big Data earlier so that I could enter the market in 2008-2009. Though Hadoopmania is still going strong in IT, these years were a gold era for Hadoop professionals. With any sort of Hadoop experience you could be considered for a £80,000 position. There was such shortage of Hadoop skills in the job market that even a complete beginner could land a wonderfully overpaid job. Today you can’t just wing it at the interview; the market has matured and there are many talented and qualified people pursuing careers in Big Data. That said, after years, the demand for Hadoop knowledge is still on the rise, making it a profitable career choice for the foreseeable future.

Continue reading →

My Computer AKA My First Big Data Machine

Summary: Intro | Virtualisation Software | Cloudera’s QuickStart VM | Importing a VM

In this post, I will introduce Virtual Machines: the core platform of every data scientist. If you, like me, get to experiment with different technologies at work, you are familiar with Virtual Machines. VMs are the best way of getting to test something out without having to install it on your computer and risking messing up your working environment. In its essence, a VM is like a mini (virtual!) computer you put on your computer; that computer has its own environment, like Windows, Linux or MacOS, and it would usually come with a bunch of pre-installed and configured tools, so that you don’t have too worry about any (or much) setup. So you might have a Windows machine installed on your actual Windows machine, and while these two share computing resources and space, they are separate instances of Windows. Plus, the virtual machine you can delete or change as you please, you can have many and, by definition, this has no impact on your original working environment.

Continue reading →

Hello World

A traditional first post, then!

I want this space to become a journal of my wanderings in the world of data analysis. While I’ve been working as a consultant for a few years now, there is a magnitude of topics I have never tackled and technologies I know nothing about. The idea for this learning space is to tackle a different problem every week-two weeks.

I am no writer, and my previous experience is in creating functional specification where every term had to be precise, and the sentences kept short and simple. In my Business Analyst beginnings the documentation I produced was poor, but with time I saw my writing quality go up. With that in mind I think it will be a learning curve to learn ‘blogging’ (but I still keep the positive outlook).

Continue reading →

Previous page