Still using Go/Ruby/Python/Node to do IP geolocation? Time to think about how PostgreSQL can take the load for you! (N.B. probably only relevant if you use Postgres in your stack, of course…)
Firstly, IP Geolocation is the process of finding the geographical location (i.e. where in the world) an IP address is allocated. Typically if a user is accessing your service from an IP address that geolocates to “Camden, London, UK” you can be reasonably confident that the user is in the Camden area of London.
Secondly, IP Geolocation is not 100% reliable. IP address allocations “move around” so it’s important to treat the results with a small pinch of salt. If you try and resolve location to a more granular location (i.e. city instead of country) you will introduce more uncertainty. Results can also become out of date quickly if you do not keep your source data up to date.
What are we going to do?
To demonstrate IP Geolocation we are going to create some tables to (a) store and update IP address ranges and (b) store geographical reference data.
We can then write a query that takes an IP address as a parameter and gives you the country code it’s allocated to. We will also look at an update process to keep our IP address data to up to date.
We are going to focus on Country level resolution, but it is fairly straightforward to take these ideas and apply them to City level data.
Getting Started
We can use PostgreSQL to geolocate IP addresses thanks to the hard work of others:
-
ip4r
is a PostgreSQL extension by RhodiumToad that makes querying and indexing of IP addresses very easy. Despite the name it also does support IPv6 -
GeoLite2
by MaxMind is a free IP geolocation database that supports city and country resolution. This is the defacto IP reference data. At the start of 2019 theGeoLite
data was sunset and is now legacy, so we need to work off the newGeoLite2
structure -
GeoNames
by … GeoNames - a creative commons licensed geographical database (where places are in the world). Since the move to GeoLite2, Maxmind includes a GeoNames identifier which you need to cross reference to work out the city/country.
The first step is to download and install the ip4r Postgres extension. It is distrubuted as source code, so you need to make
and make install
the extension; typically the latter will require root privileges.
Activating the ipr4 Extension
Because we need to create a new table with field types provided by this new extension, we need to make sure it’s activated in our database. So, with a postgres user role that has permissions to create an extension, select your database and run:
Assuming this ran without issue, we’re ready to setup our tables.
Table Setup
We need a table to store our geographical reference data (from GeoNames, but Maxmind kindly package it up in their data, too) and then a table to store our IP addresses. We’re going to create 2 tables for our IP data so we can bulk load our raw data and only update rows that are new or have been updated.
These column headings lazily reference the CSV provided by Maxmind because it’s easier to bulk load that way. Note this provides names in English - if you want regionalised/local names or names in other languages you’ll need to source the data direction from GeoNames.
The iprange
column is unique by the nature of it being a primary key. This is good for us as we want to use the ON CONFLICT
upsert feature in Postgres to insert new rows or update changes rows when we refresh our reference data.
Getting Data
All the reference data we need comes courtesy Maxmind. We’re going to download and unzip it:
We’ll now have the following CSV files:
GeoLite2-Country-Blocks-IPv4.csv
GeoLite2-Country-Blocks-IPv6.csv
GeoLite2-Country-Locations-en.csv
These three files are all we need to load up our reference data and get geolocating.
Loading Data
First up we’ll load the GeoNames reference data. We haven’t bothered with any upserting here. From a psql
session in your relevant database:
If you’re doing this on a system that is continusouly online you’ll want to do it in a more transactionally safe manner (i.e. temporary tables, then a drop/rename or truncate/insert). Note that COPY FROM
appends data, so we want to clear it down first.
Next up is our IP reference data, which is a tad more complex:
This effectively adds and refreshes the country code allocations for IP ranges. To take a look at how many allocations per country you can run this query:
The top of the table is pretty predictable:
| geonameid | country_name | count |
|-----------|----------------|-------|
| 6252001 | United States | 99934 |
| 3017382 | France | 28527 |
| 2635167 | United Kingdom | 25639 |
| 2921044 | Germany | 23590 |
| 6251999 | Canada | 17149 |
| 2750405 | Netherlands | 16746 |
The tail probably isn’t surprising, but still interesting to look at:
| geonameid | country_name | count |
|-----------|--------------------------------|-------|
| 1873107 | North Korea | 8 |
| 4031074 | Tokelau | 7 |
| 2081918 | Federated States of Micronesia | 7 |
| 4034749 | Wallis and Futuna | 7 |
| 2078138 | Christmas Island | 6 |
For the data I’ve used, there are unfortunately around 10,000 ranges that do not have geonameid
attached to them. We can see the country they are registered in by looking at the ref_ip_blocks_tmp
table. I’m not familiar enough with IP addressing to understand why this is.
Now, Let’s Geolocate an IP
Now we can run a join query to pull the country name/code of a given IP address. For this I picked a random Facebook server IP via Google:
What on Earth is >>=
, you might ask? It’s a CIDR/network operator in core Postgres that means “contains or equals”. In this context we’re asking if the iprange
value CONTAINS the IP 69.63.176.0
.
And hey presto:
| iprange | country_name | country_iso_code |
|----------------|---------------|------------------|
| 69.63.128.0/18 | United States | US |
I’d advise sticking to the 2 character ISO code when geolocating rows; it’s more compact and can be almost universally joined onto other reference data. There is a 3 character ISO code, but I’ve personally always worked with 2 chars.
A typical use case here will be batch geolocating a table of data with IP addresses in. You can do an UPDATE FROM
in Postgres to join the relevant tables:
Staying up to date
If you’re relying on reasonably accurate geolocation, you’d do well to put a scheduled job in place NOW that keeps your reference data up to date. Maxmind update the CSV file on the First Tuesday of every month. If you have historic data to geolocate, you’re probably out of luck as (fairly) Maxmind don’t publicise archives of this free data.