Still using Go/Ruby/Python/Node to do IP geolocation? Time to think about how PostgreSQL can take the load for you! (N.B. probably only relevant if you use Postgres in your stack, of course…)
Firstly, IP Geolocation is the process of finding the geographical location (i.e. where in the world) an IP address is allocated. Typically if a user is accessing your service from an IP address that geolocates to “Camden, London, UK” you can be reasonably confident that the user is in the Camden area of London.
Secondly, IP Geolocation is not 100% reliable. IP address allocations “move around” so it’s important to treat the results with a small pinch of salt. If you try and resolve location to a more granular location (i.e. city instead of country) you will introduce more uncertainty. Results can also become out of date quickly if you do not keep your source data up to date.
What are we going to do?
To demonstrate IP Geolocation we are going to create some tables to (a) store and update IP address ranges and (b) store geographical reference data.
We can then write a query that takes an IP address as a parameter and gives you the country code it’s allocated to. We will also look at an update process to keep our IP address data to up to date.
We are going to focus on Country level resolution, but it is fairly straightforward to take these ideas and apply them to City level data.
Getting Started
We can use PostgreSQL to geolocate IP addresses thanks to the hard work of others:
-
ip4r
is a PostgreSQL extension by RhodiumToad that makes querying and indexing of IP addresses very easy. Despite the name it also does support IPv6 -
GeoLite2
by MaxMind is a free IP geolocation database that supports city and country resolution. This is the defacto IP reference data. At the start of 2019 theGeoLite
data was sunset and is now legacy, so we need to work off the newGeoLite2
structure -
GeoNames
by … GeoNames - a creative commons licensed geographical database (where places are in the world). Since the move to GeoLite2, Maxmind includes a GeoNames identifier which you need to cross reference to work out the city/country.
The first step is to download and install the ip4r Postgres extension. It is distrubuted as source code, so you need to make
and make install
the extension; typically the latter will require root privileges.
Activating the ipr4 Extension
Because we need to create a new table with field types provided by this new extension, we need to make sure it’s activated in our database. So, with a postgres user role that has permissions to create an extension, select your database and run:
create extension if not exists ip4r;
Assuming this ran without issue, we’re ready to setup our tables.
Table Setup
We need a table to store our geographical reference data (from GeoNames, but Maxmind kindly package it up in their data, too) and then a table to store our IP addresses. We’re going to create 2 tables for our IP data so we can bulk load our raw data and only update rows that are new or have been updated.
create table if not exists ref_geonames_countries (
geoname_id bigint primary key,
locale_code varchar,
continent_code varchar,
continent_name varchar,
country_iso_code varchar,
country_name varchar,
is_in_european_union boolean -- boo Brexit (my view only)
);
These column headings lazily reference the CSV provided by Maxmind because it’s easier to bulk load that way. Note this provides names in English - if you want regionalised/local names or names in other languages you’ll need to source the data direction from GeoNames.
-- Data gets bulk loaded here
create table if not exists ref_ip_blocks_tmp
(
iprange iprange,
geonameid int,
registered_country_geoname_id int,
represented_country_geoname_id int,
is_anonymous_proxy bool,
is_satellite_provider bool
);
-- Incremently update this table
create table if not exists ref_ip_blocks
(
iprange iprange primary key,
geonameid integer
);
create index ref_ip_blocks_ip4r_idx on ref_ip_blocks using gist(iprange);
The iprange
column is unique by the nature of it being a primary key. This is good for us as we want to use the ON CONFLICT
upsert feature in Postgres to insert new rows or update changes rows when we refresh our reference data.
Getting Data
All the reference data we need comes courtesy Maxmind. We’re going to download and unzip it:
wget "https://geolite.maxmind.com/download/geoip/database/GeoLite2-Country-CSV.zip"
unzip -o GeoLite2-Country-CSV.zip
We’ll now have the following CSV files:
GeoLite2-Country-Blocks-IPv4.csv
GeoLite2-Country-Blocks-IPv6.csv
GeoLite2-Country-Locations-en.csv
These three files are all we need to load up our reference data and get geolocating.
Loading Data
First up we’ll load the GeoNames reference data. We haven’t bothered with any upserting here. From a psql
session in your relevant database:
truncate table ref_geonames_countries;
copy ref_geonames_countries from '/path/to/CSV/Files/GeoLite2-Country-Locations-en.csv' HEADER CSV DELIMITER E',';
If you’re doing this on a system that is continusouly online you’ll want to do it in a more transactionally safe manner (i.e. temporary tables, then a drop/rename or truncate/insert). Note that COPY FROM
appends data, so we want to clear it down first.
Next up is our IP reference data, which is a tad more complex:
truncate table ref_ip_blocks_tmp;
copy ref_ip_blocks_tmp FROM '/path/to/CSV/Files/GeoLite2-Country-Blocks-IPv4.csv' HEADER CSV DELIMITER E',';
copy ref_ip_blocks_tmp FROM '/path/to/CSV/Files/GeoLite2-Country-Blocks-IPv6.csv' HEADER CSV DELIMITER E',';
-- ref_ip_blocks_tmp now contains a clean, up to date set of IPv4 and IPv6 data
insert into ref_ip_blocks (iprange, geonameid)
select
iprange, geonameid::integer
from
ref_ip_blocks_tmp
where
geonameid != 0
on conflict (iprange)
do update set geonameid = excluded.geonameid;
This effectively adds and refreshes the country code allocations for IP ranges. To take a look at how many allocations per country you can run this query:
SELECT
geonameid,
country_name,
count(1)
FROM
ref_ip_blocks
JOIN ref_geonames_countries ON geonameid = geoname_id
GROUP BY
geonameid,
country_name
ORDER BY
count DESC
The top of the table is pretty predictable:
| geonameid | country_name | count |
|-----------|----------------|-------|
| 6252001 | United States | 99934 |
| 3017382 | France | 28527 |
| 2635167 | United Kingdom | 25639 |
| 2921044 | Germany | 23590 |
| 6251999 | Canada | 17149 |
| 2750405 | Netherlands | 16746 |
The tail probably isn’t surprising, but still interesting to look at:
| geonameid | country_name | count |
|-----------|--------------------------------|-------|
| 1873107 | North Korea | 8 |
| 4031074 | Tokelau | 7 |
| 2081918 | Federated States of Micronesia | 7 |
| 4034749 | Wallis and Futuna | 7 |
| 2078138 | Christmas Island | 6 |
For the data I’ve used, there are unfortunately around 10,000 ranges that do not have geonameid
attached to them. We can see the country they are registered in by looking at the ref_ip_blocks_tmp
table. I’m not familiar enough with IP addressing to understand why this is.
Now, Let’s Geolocate an IP
Now we can run a join query to pull the country name/code of a given IP address. For this I picked a random Facebook server IP via Google:
select
iprange,
country_name,
country_iso_code
from
ref_ip_blocks
join ref_geonames_countries on geonameid = geoname_id
where
iprange >>= '69.63.176.0'::ip4r;
What on Earth is >>=
, you might ask? It’s a CIDR/network operator in core Postgres that means “contains or equals”. In this context we’re asking if the iprange
value CONTAINS the IP 69.63.176.0
.
And hey presto:
| iprange | country_name | country_iso_code |
|----------------|---------------|------------------|
| 69.63.128.0/18 | United States | US |
I’d advise sticking to the 2 character ISO code when geolocating rows; it’s more compact and can be almost universally joined onto other reference data. There is a 3 character ISO code, but I’ve personally always worked with 2 chars.
A typical use case here will be batch geolocating a table of data with IP addresses in. You can do an UPDATE FROM
in Postgres to join the relevant tables:
update
my_table_with_ips
set
my_table_with_ips.country_iso_code = country_iso_code
from
ref_ip_blocks
inner join ref_geonames_countries on geonameid = geoname_id
where
and iprange >>= my_table_with_ips.ip_address_column::ip4r
and country_iso_code is null;
Staying up to date
If you’re relying on reasonably accurate geolocation, you’d do well to put a scheduled job in place NOW that keeps your reference data up to date. Maxmind update the CSV file on the First Tuesday of every month. If you have historic data to geolocate, you’re probably out of luck as (fairly) Maxmind don’t publicise archives of this free data.