Point-in-polygon is a textbook problem in geographical analysis: given a list of geocoordinates return those that fall within a boundary of an area. You could feed the algorithm a list of all European cities and it will recognise which of them belong to Sri Lanka and which to a completely random shape you drew on planet Earth. It applies to many scenarios: analyses that aren’t based on administrative boundaries, situations in which polygons change over time, or problems that aren’t geographical at all, like computer graphics. Not so long ago, I turned to point-in-polygon to generate a set of towns and villages to plot on a map of Poland from 1933. Such list has not been made available on the web and I wasn’t super keen on typing out thousands of locations. Instead, I used that mathematical cookie-cutter to extract only those locations from today’s Poland, Ukraine, Belarus, and Russia that were present within the interwar Poland boundaries. In this post I will show how to perform a point-in-polygon analysis in R and possibly automate a significant chunk of data preparation for map visualisations.Continue reading “R | Point-in-polygon, a mathematical cookie-cutter”
Previously we used OGR2OGR to extract a couple of features from a large geographical dataset. OGR2OGR can do so much more – today we’ll look at its reprojecting capabilities. Reprojection is a mathematical translation of a dataset’s coordinate reference system to another one, like Albers to Mercator. Sometimes the geographical data we receive has to be reprojected to conform to our other datasets before further use. My first encounter with mismatching projections came about during the works on my master thesis project. I struggled with a point-in-polygon function that was supposed to filter a set of points based on a geographical boundary and stubbornly just wouldn’t return anything. I soon found out that my map was digitalised in EPSG:3857 (AKA web mercator used by Google maps) projection and my village coordinates used WGS84 coordinate system. That’s how OGR2OGR and me met.Continue reading “Changing dataset projection with OGR2OGR”
Data and analytics are fundamentally changing the way business is done. Make sure you’re ahead of the game by enabling your strategic data monetization initiatives while keeping your customer and commercial data secure.
More often than not geographical data visualisation is performed on a a single country or a cluster of countries rather than on all 195 of them. Just as typically, acquired datasets have more features than what’s needed for the analysis. While D3.js allows for filtering the datasets so that we have full control over the visualisation’s output, the size of original datasets can slow down your website load times. To reduce this impact, datasets can be cropped beforehand. This post will explain how to shrink a standard Eurostat geographical dataset to just a handful of countries with OGR2OGR.Continue reading “Extracting countries from GeoJSON with OGR2OGR”
Protegrity webinar: Building secure analytical models – Registration Link
Wednesday 6th August, 10AM CET / 4PM Singapore time
The replay will available shortly after the session.
25 years after being introduced to the Windows’ toolkit, CMD still has it. This post collects a couple of every day file manipulation scenarios that can be accomplished with the command-line interpreter.
Windows’ command prompt is a command-line interface for file and process management on Windows. A big deal in the 90’s, today the tool is not overwhelmingly popular among data scientists, or any Windows user for that matter. But this old school tool still proves useful for basic file manipulation. It might not have the capabilities of, say, Python, but in a situation when you cannot use a programming language or you are looking for a challenge, CMD will always be there for you. Recently, I helped a friend automate some tedious copy and paste operations and reduced his workload by days. Our collaboration is documented below, along with some code snippets.
*The title is a reference to the awesome Automate the Boring Stuff with Python by Al Sweigart.
By now every business owner in Europe would have heard about GDPR: if it didn’t hit them on the news or through social circles, the swarm of pop-ups and emails announcing policy updates would have been telling enough. GDPR awareness might be mainstream, but it comes a tad too late to believe its practice is correspondingly widespread. Timing aside, putting GDPR to action proves confusing as the regulators provide little guidance in GDPR’s practical application. Among the most puzzled are small companies. GDPR dictates they bear the same responsibilities as governments or corporations, pressuring them to make do with less subject-matter knowledge and fewer budget for the lawyers to get their heads round the regulation.
This checklist summarises the principles behind GDPR from which each business can derive their data protection strategy. I should note that I am not a lawyer but a data security consultant: nevertheless it is my belief that abiding to these principles should guarantee that a business operates legally and securely.
Earlier this month I had a chance to speak with a Polish magazine Przegląd about today’s data economy, the marketing evolution over the last couple of decades: from database invention to machine learning, and how it all relates to Cambridge Analytica scandal from March. The article is available in Polish on Przegląd’s website (behind a paywall), loosely based on an article I previously published in English on the blog.
I’m just back from Scotland where – besides having some lovely time hanging out with friends – I attended the annual Lands of Loyal (LOL) Data Science conference in Alyth. The event is an informal get together for the Data Science & Business Intelligence graduates from the Dundee University. This time my talk was among those selected for the day: I decided to rework my last blog post on Personally Identifiable Data (and not!) to a 20-minute presentation and packed it with questions (some of which I cannot answer) in regard to our identity on the web.