Still using Go/Ruby/Python/Node to do IP geolocation? Time to think about how PostgreSQL can take the load for you! (N.B. probably only relevant if you use Postgres in your stack, of course…)
Firstly, IP Geolocation is the process of finding the geographical location (i.e. where in the world) an IP address is allocated. Typically if a user is accessing your service from an IP address that geolocates to “Camden, London, UK” you can be reasonably confident that the user is in the Camden area of London.
Secondly, IP Geolocation is not 100% reliable. IP address allocations “move around” so it’s important to treat the results with a small pinch of salt. If you try and resolve location to a more granular location (i.e. city instead of country) you will introduce more uncertainty. Results can also become out of date quickly if you do not keep your source data up to date.
What are we going to do?
To demonstrate IP Geolocation we are going to create some tables to (a) store and update IP address ranges and (b) store geographical reference data.
We can then write a query that takes an IP address as a parameter and gives you the country code it’s allocated to. We will also look at an update process to keep our IP address data to up to date.
We are going to focus on Country level resolution, but it is fairly straightforward to take these ideas and apply them to City level data.
We can use PostgreSQL to geolocate IP addresses thanks to the hard work of others:
ip4ris a PostgreSQL extension by RhodiumToad that makes querying and indexing of IP addresses very easy. Despite the name it also does support IPv6
GeoLite2by MaxMind is a free IP geolocation database that supports city and country resolution. This is the defacto IP reference data. At the start of 2019 the
GeoLitedata was sunset and is now legacy, so we need to work off the new
GeoNamesby … GeoNames - a creative commons licensed geographical database (where places are in the world). Since the move to GeoLite2, Maxmind includes a GeoNames identifier which you need to cross reference to work out the city/country.
The first step is to download and install the ip4r Postgres extension. It is distrubuted as source code, so you need to
make install the extension; typically the latter will require root privileges.
Activating the ipr4 Extension
Because we need to create a new table with field types provided by this new extension, we need to make sure it’s activated in our database. So, with a postgres user role that has permissions to create an extension, select your database and run:
Assuming this ran without issue, we’re ready to setup our tables.
We need a table to store our geographical reference data (from GeoNames, but Maxmind kindly package it up in their data, too) and then a table to store our IP addresses. We’re going to create 2 tables for our IP data so we can bulk load our raw data and only update rows that are new or have been updated.
These column headings lazily reference the CSV provided by Maxmind because it’s easier to bulk load that way. Note this provides names in English - if you want regionalised/local names or names in other languages you’ll need to source the data direction from GeoNames.
iprange column is unique by the nature of it being a primary key. This is good for us as we want to use the
ON CONFLICT upsert feature in Postgres to insert new rows or update changes rows when we refresh our reference data.
All the reference data we need comes courtesy Maxmind. We’re going to download and unzip it:
We’ll now have the following CSV files:
GeoLite2-Country-Blocks-IPv4.csv GeoLite2-Country-Blocks-IPv6.csv GeoLite2-Country-Locations-en.csv
These three files are all we need to load up our reference data and get geolocating.
First up we’ll load the GeoNames reference data. We haven’t bothered with any upserting here. From a
psql session in your relevant database:
If you’re doing this on a system that is continusouly online you’ll want to do it in a more transactionally safe manner (i.e. temporary tables, then a drop/rename or truncate/insert). Note that
COPY FROM appends data, so we want to clear it down first.
Next up is our IP reference data, which is a tad more complex:
This effectively adds and refreshes the country code allocations for IP ranges. To take a look at how many allocations per country you can run this query:
The top of the table is pretty predictable:
| geonameid | country_name | count | |-----------|----------------|-------| | 6252001 | United States | 99934 | | 3017382 | France | 28527 | | 2635167 | United Kingdom | 25639 | | 2921044 | Germany | 23590 | | 6251999 | Canada | 17149 | | 2750405 | Netherlands | 16746 |
The tail probably isn’t surprising, but still interesting to look at:
| geonameid | country_name | count | |-----------|--------------------------------|-------| | 1873107 | North Korea | 8 | | 4031074 | Tokelau | 7 | | 2081918 | Federated States of Micronesia | 7 | | 4034749 | Wallis and Futuna | 7 | | 2078138 | Christmas Island | 6 |
For the data I’ve used, there are unfortunately around 10,000 ranges that do not have
geonameid attached to them. We can see the country they are registered in by looking at the
ref_ip_blocks_tmp table. I’m not familiar enough with IP addressing to understand why this is.
Now, Let’s Geolocate an IP
Now we can run a join query to pull the country name/code of a given IP address. For this I picked a random Facebook server IP via Google:
What on Earth is
>>=, you might ask? It’s a CIDR/network operator in core Postgres that means “contains or equals”. In this context we’re asking if the
iprange value CONTAINS the IP
And hey presto:
| iprange | country_name | country_iso_code | |----------------|---------------|------------------| | 18.104.22.168/18 | United States | US |
I’d advise sticking to the 2 character ISO code when geolocating rows; it’s more compact and can be almost universally joined onto other reference data. There is a 3 character ISO code, but I’ve personally always worked with 2 chars.
A typical use case here will be batch geolocating a table of data with IP addresses in. You can do an
UPDATE FROM in Postgres to join the relevant tables:
Staying up to date
If you’re relying on reasonably accurate geolocation, you’d do well to put a scheduled job in place NOW that keeps your reference data up to date. Maxmind update the CSV file on the First Tuesday of every month. If you have historic data to geolocate, you’re probably out of luck as (fairly) Maxmind don’t publicise archives of this free data.