Harmful statistics: John Oliver on forensic science

John Oliver goes John Oliver on forensic science and it’s a must watch. Forensic science brought statistics to courts, but itself it’s only rarely exposed to scientific scrutiny. Justice system’s failure to question forensics methodology and common fear of scientific jargon has led to convicting innocent people for crimes they haven’t commited. Courts have given life sentences solely based on bite marks, partial fingerprints, and hair collected at crime scene despite of no proof these methods are infallible.

Watch John Oliver’s Last Week Tonight below and make sure to read the excellent investigative article by Jordan Smith on the Intercept that the show has been based on.


The search is long and the goal is elusive: Data Scientist (Desperately) Wanted

***A long holiday comes to an end – back to business!***

Summary: Wanted: Data Scientist | A bird in the hand is worth two in the bush | A little stir | The infallible art of taking steps back

In this article I will look at how organisations can engineer their own Data Science team without loosing their mind in the process nor spending big money. As more and more companies want to be data-driven, they join the frantic search for the right staff to fuel these initiatives. Finding a data-fluent resource is not easy: according to McKinsey, “by 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.” The hunt is on for a skill set that is still relatively new to the market, and it is only starting to be taught at the universities. My belief is that fishing for a Data Scientist ‘superstar’ is often counterproductive and inevitably leads to a realisation that one person cannot do it all. Instead, investing in appropriate training of the current staff can lead to long-lasting benefits for the company.

Data Scientists, just like the mythical sea monsters are hard to come by and feared by many. Image source: Carta Marina, ca 1544 – via Wikipedia

Continue reading “The search is long and the goal is elusive: Data Scientist (Desperately) Wanted”

Taking down the NPS Score: KO by Probability

Summary: Intro | Measuring the importance of gossip | Too good to be true | Arbitrary measures produce arbitrary results | tl;dr

Disclaimer: The probability computation of the article is a recap of a talk delivered by Professor Mark Whitehorn at the University of Dundee in 2015, and at PASS Business Analytics Conference in San Jose, CA in 2014. Opinions expressed are my own.

Aren’t we post-NPS hype yet? Such was my thinking until a random article came up on my feed: as one of its core objectives, a tech giant was planning to improve its Net Promoter Score by 2020. A quick internet search told me there are some companies very excited about increasing their NPS. Google Trends suggests the Net Promoter methodology is on the steady growth rate since 2004; a mortal blow to my presumption. There is something problematic about the Net Promoter methodology that I’d like to talk about: on one hand an indicator of an outstanding business delivery, on the other a possibly dangerous framework for workforce assessment. This article decomposes the NPS algorithm, reviews its criticism, and tests its validity from the probability perspective. I have based the scenario and the probability computation on an excellent talk delivered by professor Mark Whitehorn. If you happen to be a manager, a person whose performance is scored with NPS, you are into probability computations, or simply you like debunking managerial fads then this is a tale for you. 

Continue reading “Taking down the NPS Score: KO by Probability”

It’s gonna be #phun

I will be at Oracle Code conference on the 6th of June in Brussels, redefining fun with Python on Oracle database. The idea is to live demo working with tables, handling JSON files, (new) spatial data queries, data visualisation, and a couple of good practices of database application development with Python.

There will be snakes, ridiculous maps, and misbehaving queries. Come if you’re in town, it’s gonna be #phun


Why is data preparation so hard and are we getting worse at it?

Some statistics for the aperitif:

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets. New York Times, 2014. No source for the statistics.

Analysts will still spend up to 80% of their time just trying to create the data set to draw insights. Forrester, 2015. No source for the statistics.

Since the popular emergence of data science as a field, its practitioners have asserted that 80% of the work involved is acquiring and preparing data. Harvard Business Review, reprinting the statistic from Forbes in 2016. Forbes cites a “survey of about 80 data scientists was conducted for the second year in a row by CrowdFlower.”

Continue reading “Why is data preparation so hard and are we getting worse at it?”

Butterfly effect: OECD’s graph hiccup leads to media panic

Summary: Intro | A case of tl;dr | Where was the graph police? | A quick fix

This is a short story about a graph that could have been done better and an article that has gone awry. The Organisation for Economic Co-operation and Development (OECD), one of the most powerful research bodies there is, has published an excellent report on the influence of robotics on the job market: conclusion of which was misinterpreted in a Polish influential newspaper, Krytyka Polityczna, and in other media. In this post I will analyze both the article and the report, to then theorize on what has gone wrong and who (or what) is to blame.

Getting Philosophical About a Line Chart. Data Visualisation from Scratch P.3

Today’s “from scratch” example with D3 is a must-have element of any data visualisation portfolio: a line chart. Line charts are great of visualizing changes in data over time. Just as in the previous posts in the series, my visualisation is a variation of a piece of code I found on the web. I started with a basic template created by Mike Bostock and then re-worked some of its elements to boost its usability & readability. As with the previous examples, all code can be downloaded, reused, adjusted, and it scales up and down to include extra data series or to remove one.

Continue reading “Getting Philosophical About a Line Chart. Data Visualisation from Scratch P.3”

Stories from a Bar (Chart). Data Visualisation from Scratch P.2

Summary: Intro | A Simple Bar Chart | A Multi-Series Bar Chart

If this post was a painting, it would probably be one of Mark Ryden’s works: it seems I have just gone and done a one detailed blog post. The funny thing is that it’s about bar charts, and everything has already been said about bar charts. In fact a bar chart is a graph so simple, this post should never have been written: yet, the simpleness of a bar chart is actually it’s most dangerous trap. It’s very easy to overdo, and with so few elements it’s tempting to tweak or enhance at least some of them. So this blog post is, above all, about resistance. I will look at what – and why – constitutes as a good bar chart, what are the best practices, and how to fight the horror vacui of a simple plot. We will use D3.js and the blank canvas we have built with zero coding skills in the last post to create a reusable template of a simple bar graph, and then of a multi-series bar graph. This is part of a data visualisation with D3 series, throughout which we will create a set of graphics that can be easily re-purposed for data visualisation projects.

Continue reading “Stories from a Bar (Chart). Data Visualisation from Scratch P.2”

D3 Canvas Setup with 0 Coding Skills. Data Visualisation from Scratch P.1

Summary: Intro | About D3.js | Initial Setup & Python Server | Canvas Setup

In the following series I will cover the basics of data visualisation. There are many data visualisation tools available (free & paid versions) on the market, so for an everyday analyst the knowledge of how to build graphs from scratch is not essential. However, most (if not all) of these pre-built tools fall short as soon as any customisation is required: it could be a graph type that is not supported, or the design that cannot be adjusted to follow the company branding guidelines. Therefore, there are cases when the knowledge of how to build something yourself is essential.

Continue reading “D3 Canvas Setup with 0 Coding Skills. Data Visualisation from Scratch P.1”

A recipe proved to solve every* VM’s proxy problems

Summary: Intro | Obtaining the proxy server address | Configuring proxy settings – checklist | Other tools

*Fedora / RHEL / CentOS

At first I wanted to title this post as Welcome to Proxy Hell, because – at least at first – getting the proxy settings right on a VM can feel like a nightmare. Especially, if you have no idea about were to start or, more depressingly, when none of your attempts to fix the problem seem to be successful. Nearly inevitably, if working in an office, you have come across proxies. It has become a standard for companies to guard their network traffic with a proxy server. The idea is that the server acts as an intermediary between the private company network and the internet, which both hides the web traffic from the outside eyes and can serve as a base for implementing access authentication and bandwidth control.

Perhaps the first time you consciously acknowledge the presence of a proxy is when your browser’s homepage instead of directing you to Google.co.uk goes to Google.in or Google.pl, and the Search button shows up in a different language than expected. That’s your proxy server location that’s just fooled Google. The second time you come across proxies is less amusing: this is when you start working with, or worse, configuring Virtual Environments and realise that even the basic tasks, like accessing a webpage or installing a package don’t work. For instance, if you followed the Hadoop clustering guide from my last post in an office environment you wouldn’t have been able to get most of it working it without setting up a proxy. Yet, the guide conveniently skips that topic with a vague warning: make sure you’re not behind a proxy. So, what to do if you were? Continue reading “A recipe proved to solve every* VM’s proxy problems”