Articles in category Data Science

Making an Interactive Line Chart in D3.js v.5

Static graphs are a big improvement over no graphs but we can all agree that static information is not particularly engaging. On the web there is no presenter to talk over a picture. It is the role of a visualisation to grab the reader’s attention and get its point across. Making a graph interactive is a good step towards increasing its understandability. This post in an addendum to the previous tutorial on how to make a line chart. It will explore two techniques of making the previous project interactive.

Continue reading →

Making a Line Chart in D3.js v.5

The time has come to step up our game and create a line chart from scratch. And not just any line chart: a multi-series graph that can accommodate any number of lines. Besides handling multiple lines, we will work with time and linear scales, axes, and labels – or rather, have them work for us. There is plenty to do, so I suggest you fire off your D3 server and let’s get cracking.

Continue reading →

Advanced Bar Chart in D3.js v.5

Or should I say more advanced than the construction from the previous post. This part of the tutorial will cover scales and axes. Let the fun begin!

Continue reading →

Simple Bar Chart in D3.js v.5

This is actually happening! I’ve put myself together (the key to more time is less Netflix, people) and wrote up a couple of examples in D3.js version 5 (yes, version 5!) that should get people started in the transition over to the tricky number 5. The guide assumes that you have some basics in D3 (you have an idea about SVG, DOM, HTML, and CSS), or better yet that you come from an earlier version. In this chapter we’ll create a simple bar chart. The objectives of the day are: data upload from a csv, data format setup, and drawing the data. As basic as this! Next time we will tackle scales and grids.

Make sure to check out my library for more fun examples!

Continue reading →

New blocks on bl.ocks

You can now take a look at my D3 projects on bl.ocks.org! I will soon add some more visualisations in version 5 of the D3 library.

Blocks

Continue reading →

Merging Historical Maps in D3.js v.5

Spoiling you as usual, I have another exciting D3 example for today: merging historical maps! I’ve been meaning to cover this topic ever since I developed a similar project for my Master’s thesis 3 years ago. Merging maps is challenge-worthy for every D3 enthusiast as it requires a number of things to be aligned: the data format should be compatible with D3.js, the maps should be drawn in the same projection, and cover the same time period as country or regional boundaries are far from static. I will demonstrate the idea by mashing up two maps: a digitalised map of II Polish Republic from 1934 with European boundaries from 1939.

Continue reading →

Creating fun shapes in D3.js

I’m happy to announce that more SVG fun is coming! I’ve been blown away by the stats on my previous D3-related posts and it really motivated me to keep going with this series. I’ve fell in love with D3.js for the way it transforms storytelling. I want to get better with advanced D3 graphics so I figured I will start by getting the basics right. So today you will see me doodling around with some basic SVG elements. The goal is to create a canvas and add onto it a rectangle, a line, and a radial shape.

Continue reading →

Drawing radial shapes in D3.js

This post demystifies one of the most feared vector functions available in D3.js: the radial line, or d3.radialLine(). Radial lines are constructed with only two attributes: an angle and a radius. The product of the function is a line, but unlike the basic line function, there are no x and y co-ordinates. I fundamentally misunderstood the radial line logic the first time I used it – in fact I had to bring in my boyfriend one late Thursday evening to help me get it right. This guide should help you avoid my mistakes.

Continue reading →

Recordings: Data Security Webinar Series for Protegrity

A couple of posts back I shared a sneak-peak of my webinar on data security for data monetization initiatives. That was one of many sessions we’ve ran for APAC as Protegrity entered that market in 2018. All sessions had been recorded and now I’m happy to say we’ve published excerpts of those videos on youtube. There are bits on Big Data protection, best security practices for analytical workflows, sessions on hybrid- and multi-cloud environments, as well as spotlights on specific technologies (AWS S3, OneDrive, Salesforce, Elastic Map Reduce, Tableau). The videos are 2- to 7-minute-long and go straight to the demo. I do sympathise with those of you that don’t have the time or patience to sit through a one hour webinar – I hope you like this compact format. Check out the playlist below and get to know the superpowers of Protegrity tech!

Continue reading →

Recording: Introduction to Hadoop for Glolent community meetup

Last Thursday evening I had the opportunity to talk about Hadoop at a Glolent Global Talent community virtual meetup. Glolent connects remote IT workers across the globe and facilitates skill-sharing sessions that any member can join or present at...

Continue reading →

Making a Map in D3.js v.5

A pretty specific title, huh? The versioning is key in this map-making how-to. D3.js version 5 has gotten serious with the Promise class which resulted in some subtle syntax changes that proven big enough to cause confusion among the D3.js old dogs and the newcomers. This post guides you through creating a simple map in this specific version of the library. If you’d rather dive deeper into the art of making maps in D3 try the classic guides produced by Mike Bostock.

Continue reading →

R | Point-in-polygon, a mathematical cookie-cutter

Point-in-polygon is a textbook problem in geographical analysis: given a list of geocoordinates return those that fall within a boundary of an area. You could feed the algorithm a list of cities across the globe and it will recognise which of them belong to Sri Lanka and which to a completely random shape you drew on planet Earth. It applies to many scenarios: analyses that aren’t based on administrative boundaries, situations in which polygons change over time, or problems that aren’t geographical at all, like computer graphics. Not so long ago, I turned to point-in-polygon to generate a set of towns and villages to plot on a map of Poland from 1933. Such list has not been made available on the web and I wasn’t super keen on typing out thousands of locations. Instead, I used that mathematical cookie-cutter to extract only those locations from today’s Poland, Ukraine, Belarus, and Russia that were present within the interwar Poland boundaries. In this post I will show how to perform a point-in-polygon analysis in R and possibly automate a significant chunk of data preparation for map visualisations.

Continue reading →

Changing dataset projection with OGR2OGR

Previously we used OGR2OGR to extract a couple of features from a large geographical dataset. OGR2OGR can do so much more – today we’ll look at its reprojecting capabilities. Reprojection is a mathematical translation of a dataset’s coordinate reference system to another one, like Albers to Mercator. Sometimes the geographical data we receive has to be reprojected to conform to our other datasets before further use. My first encounter with mismatching projections came about during the works on my master thesis project. I struggled with a point-in-polygon function that was supposed to filter a set of points based on a geographical boundary and stubbornly just wouldn’t return anything. I soon found out that my map was digitalised in EPSG:3857 (AKA web mercator used by Google maps) projection and my village coordinates used WGS84 coordinate system. That’s how OGR2OGR and me met.

Continue reading →

Extracting countries from GeoJSON with OGR2OGR

More often than not geographical data visualisation is performed on a a single country or a cluster of countries rather than on all 195 of them. Just as typically, acquired datasets have more features than what’s needed for the analysis. While D3.js allows for filtering the datasets so that we have full control over the visualisation’s output, the size of original datasets can slow down your website load times. To reduce this impact, datasets can be cropped beforehand. This post will explain how to shrink a standard Eurostat geographical dataset to just a handful of countries with OGR2OGR.

Continue reading →

D3.js v5: Promise syntax & examples

Here is my take on promises, the latest addition to D3.js syntax. The other week I attempted to brush up on my D3.js skills and got stuck at the most basic task of printing csv data on my html webpage. This is how I learnt that version 5 of D3.js substituted asynchronous callbacks with promises – and irreversibly changed the way we used to work with data sets. Getting your head around promises can take time, especially if you – like me – aren’t a JavaScript programming pro. In this post I’ll share my lessons learnt and provide some guidance for the ones lost in the world of promises.

Continue reading →

Automate the Boring Stuff with CMD

25 years after being introduced to the Windows’ toolkit, CMD still has it. This post collects a couple of every day file manipulation scenarios that can be accomplished with the command-line interpreter.

Windows’ command prompt is a command-line interface...

Continue reading →

GDPR in 10 Steps: a Guide for Small Businesses

By now every business owner in Europe would have heard about GDPR: if it didn’t hit them on the news or through social circles, the swarm of pop-ups and emails announcing policy updates would have been telling enough. GDPR awareness might be mainstream, but it comes a tad too late to believe its practice is correspondingly widespread. Timing aside, putting GDPR to action proves confusing as the regulators provide little guidance in GDPR’s practical application. Among the most puzzled are small companies. GDPR dictates they bear the same responsibilities as governments or corporations, pressuring them to make do with less subject-matter knowledge and fewer budget for the lawyers to get their heads round the regulation.

This checklist summarises the principles behind GDPR from which each business can derive their data protection strategy. I should note that I am not a lawyer but a data security consultant: nevertheless it is my belief that abiding to these principles should guarantee that a business operates legally and securely.

Continue reading →

Interview for Przegląd: Our Data is a Commodity as Any Other

Earlier this month I had a chance to speak with a Polish magazine Przegląd about today’s data economy, the marketing evolution over the last couple of decades: from database invention to machine learning, and how it all relates to Cambridge Analytica scandal from March. The article is available in Polish on Przegląd’s website (behind a paywall), loosely based on an article I previously published in English on the blog.

Continue reading →

Data Science Conference: Lands of Loyal 2018

I’m just back from Scotland where – besides having some lovely time hanging out with friends – I attended the annual Lands of Loyal (LOL) Data Science conference in Alyth. The event is an informal get together for the Data Science & Business Intelligence graduates from the Dundee University. This time my talk was among those selected for the day: I decided to rework my last blog post on Personally Identifiable Data (and not!) to a 20-minute presentation and packed it with questions (some of which I cannot answer) in regard to our identity on the web.

Continue reading →

We are data points: identity on the web post-Cambridge Analytica scandal

Summary: Protected by law (*when there is a law) | Many faces of PII | Here be dragons: data outside the PII realm

There is a silver lining in the Cambridge Analytica + Facebook scandal in that it started a debate about our privacy rights online. Our virtual house was invaded: the government came in and took our identities away. Putting aside the question whether it was us who invited the aggressor*, today we will examine the core of the scandal: the idea of identity on the web. What is it exactly that bugs us about this case? What is it that we are standing for by deleting Facebook? To channel our outrage, let’s review what constitutes personal data in the light of law, what slipped regulation, and if our online footprint should have us worried.

Continue reading →

Right to Explanation: a Right that Never Was (in GDPR)

The conversation around the Right to Explanation reminded me of the Mandela Effect. Just as Mandela’s death is believed by many to have happened before his real time of death, Right to Explanation is falsely attributed to GDPR’s collection of laws. An offshoot from early GDPR conversations, the rule has now developed its own literature on the internet. Posts suggesting that the law threatens Artificial Intelligence have flooded Google (examples here, here, and here), while uncertainty-fueled paranoia has taken over LinkedIn. Is it misinformation spread on the internet in its finest or is there more to the discussion? I suggest we review what a Right to Explanation is and why an absent law is causing so much stir on the world wide web.

Continue reading →

Geek Christmas: 10 best books to entertain a Data Scientist

Oh, the every year’s Christmas epiphany: shopping for people we’d like to think we know is hard. To help those looking for a perfect gift for their befriended Data Scientist, or those wanting to indulge themselves with a good read, here is a roundup of books that would let no data geek down. The selection is subjective: I deliberately missed out the classics, and focused on less-obvious choices that are guaranteed to entertain and enlighten.

Continue reading →

Colourful nonsense: what does your data visualisation actually say?

The democratisation of data visualisation tools brought us two major advancements: we can make great analytical products, faster. We can deceive easier, too.

Previously a domain ruled by statisticians and IT departments, analytics have now opened up to anyone with a laptop. For marketers and managers alone, BI apps such as Tableau, QlikView, or MS Excel have become a commodity. Tools have matured too: programming fluency was overruled by drag-and-drop interfaces. A visually stunning chart is literally a click away. The software intelligently picks the graph type and the colour scheme for us. For the more ambitious users the adjustment options are plenty, although within the range of pre-programmed configurations. While some of these visual endeavours lead to great analytical products, some result in colourful nonsense.

Continue reading →

About managers who love aggregations a bit too much

Summary: Business instinct | When sums add up | Data-driven decision patching

This is a story about companies who like aggregations a bit too much. Data-driven decision making seems to be the new holy grail in management, but can the numbers always be trusted? What is key in data-savvy businesses: the people, the right technology, or – spoiler alert – is it something more fundamental? These questions become particularly urgent in the new economy as failing to embrace data can be a major growth impediment or worse, a dead sentence to the business.

Continue reading →

Harmful statistics: John Oliver on forensic science

John Oliver goes John Oliver on forensic science and it’s a must watch. Forensic science brought statistics to courts, but itself it’s only rarely exposed to scientific scrutiny. Justice system’s failure to question forensics methodology and common fear of scientific jargon has led to convicting innocent people for crimes they haven’t commited. Courts have given life sentences solely based on bite marks, partial fingerprints, and hair collected at crime scene despite of no proof these methods are infallible.

Continue reading →

The search is long and the goal is elusive: Data Scientist (Desperately) Wanted

Summary: Wanted: Data Scientist | A bird in the hand is worth two in the bush | A little stir | The infallible art of taking steps back

In this article I will look at how organisations can engineer their own Data Science team without loosing their mind in the process nor spending big money. As more and more companies want to be data-driven, they join the frantic search for the right staff to fuel these initiatives. Finding a data-fluent resource is not easy: according to McKinsey, “by 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.” The hunt is on for a skill set that is still relatively new to the market, and it is only starting to be taught at the universities. My belief is that fishing for a Data Scientist ‘superstar’ is often counterproductive and inevitably leads to a realisation that one person cannot do it all. Instead, investing in appropriate training of the current staff can lead to long-lasting benefits for the company.

Continue reading →

Taking down the NPS Score: KO by Probability

Summary: Intro | Measuring the importance of gossip | Too good to be true | Arbitrary measures produce arbitrary results | tl;dr

Disclaimer: The probability computation of the article is a recap of a talk delivered by Professor Mark Whitehorn at the University of Dundee in 2015, and at PASS Business Analytics Conference in San Jose, CA in 2014. Opinions expressed are my own.

Aren’t we post-NPS hype yet? Such was my thinking until a random article came up on my feed: as one of its core objectives, a tech giant was planning to improve its Net Promoter Score by 2020. A quick internet search told me there are some companies very excited about increasing their NPS. Google Trends suggests the Net Promoter methodology is on the steady growth rate since 2004; a mortal blow to my presumption. There is something problematic about the Net Promoter methodology that I’d like to talk about: on one hand an indicator of an outstanding business delivery, on the other a possibly dangerous framework for workforce assessment. This article decomposes the NPS algorithm, reviews its criticism, and tests its validity from the probability perspective. I have based the scenario and the probability computation on an excellent talk delivered by professor Mark Whitehorn. If you happen to be a manager, a person whose performance is scored with NPS, you are into probability computations, or simply you like debunking managerial fads then this is a tale for you.

Continue reading →

It's gonna be #phun

I will be at the Oracle Code conference on the 6th of June in Brussels, redefining fun with Python on Oracle database. The idea is to live demo working with tables, handling JSON files, (new) spatial data queries, data visualisation, and a couple of good practices of database application development with Python.

There will be snakes, ridiculous maps, and misbehaving queries. Come if you’re in town, it’s gonna be #phun

Phun

Continue reading →

Why is data preparation so hard and are we getting worse at it?

It’s said that the majority of every analytical project’s is taken up by data preparation tasks. But is this estimation trustworthy?

Continue reading →

Butterfly effect: OECD’s data visualisation fail leads to media panic

Summary: Intro | A case of tl;dr | Where was the graph police? | A quick fix

This is a short story about a graph that could have been done better and an article that has gone awry. The Organisation for Economic Co-operation and Development (OECD), one of the most powerful research bodies there is, has published an excellent report on the influence of robotics on the job market: conclusion of which was misinterpreted in a Polish influential newspaper, Krytyka Polityczna, and in other media. In this post I will analyze both the article and the report, to then theorize on what has gone wrong and who (or what) is to blame.

Continue reading →

Your First DIY Hadoop Cluster

Summary: Intro | Linux VM Setup | VM Networking | Extending a Hadoop Cluster

At times I wish I had started my journey with Big Data earlier so that I could enter the market in 2008-2009. Though Hadoopmania is still going strong in IT, these years were a gold era for Hadoop professionals. With any sort of Hadoop experience you could be considered for a £80,000 position. There was such shortage of Hadoop skills in the job market that even a complete beginner could land a wonderfully overpaid job. Today you can’t just wing it at the interview; the market has matured and there are many talented and qualified people pursuing careers in Big Data. That said, after years, the demand for Hadoop knowledge is still on the rise, making it a profitable career choice for the foreseeable future.

Continue reading →

My Computer AKA My First Big Data Machine

Summary: Intro | Virtualisation Software | Cloudera’s QuickStart VM | Importing a VM

In this post, I will introduce Virtual Machines: the core platform of every data scientist. If you, like me, get to experiment with different technologies at work, you are familiar with Virtual Machines. VMs are the best way of getting to test something out without having to install it on your computer and risking messing up your working environment. In its essence, a VM is like a mini (virtual!) computer you put on your computer; that computer has its own environment, like Windows, Linux or MacOS, and it would usually come with a bunch of pre-installed and configured tools, so that you don’t have too worry about any (or much) setup. So you might have a Windows machine installed on your actual Windows machine, and while these two share computing resources and space, they are separate instances of Windows. Plus, the virtual machine you can delete or change as you please, you can have many and, by definition, this has no impact on your original working environment.

Continue reading →