IP Geolocation in PostgreSQL

Written on 10 May 2019

Still using Go/Ruby/Python/Node to do IP geolocation? Time to think about how PostgreSQL can take the load for you! (N.B. probably only relevant if you use Postgres in your stack, of course…)

Firstly, IP Geolocation is the process of finding the geographical location (i.e. where in the world) an IP address is allocated. Typically if a user is accessing your service from an IP address that geolocates to “Camden, London, UK” you can be reasonably confident that the user is in the Camden area of London.

Secondly, IP Geolocation is not 100% reliable. IP address allocations “move around” so it’s important to treat the results with a small pinch of salt. If you try and resolve location to a more granular location (i.e. city instead of country) you will introduce more uncertainty. Results can also become out of date quickly if you do not keep your source data up to date.

What are we going to do?

To demonstrate IP Geolocation we are going to create some tables to (a) store and update IP address ranges and (b) store geographical reference data.

We can then write a query that takes an IP address as a parameter and gives you the country code it’s allocated to. We will also look at an update process to keep our IP address data to up to date.

We are going to focus on Country level resolution, but it is fairly straightforward to take these ideas and apply them to City level data.

Getting Started

We can use PostgreSQL to geolocate IP addresses thanks to the hard work of others:

  • ip4r is a PostgreSQL extension by RhodiumToad that makes querying and indexing of IP addresses very easy. Despite the name it also does support IPv6

  • GeoLite2 by MaxMind is a free IP geolocation database that supports city and country resolution. This is the defacto IP reference data. At the start of 2019 the GeoLite data was sunset and is now legacy, so we need to work off the new GeoLite2 structure

  • GeoNames by … GeoNames - a creative commons licensed geographical database (where places are in the world). Since the move to GeoLite2, Maxmind includes a GeoNames identifier which you need to cross reference to work out the city/country.

The first step is to download and install the ip4r Postgres extension. It is distrubuted as source code, so you need to make and make install the extension; typically the latter will require root privileges.

Activating the ipr4 Extension

Because we need to create a new table with field types provided by this new extension, we need to make sure it’s activated in our database. So, with a postgres user role that has permissions to create an extension, select your database and run:

create extension if not exists ip4r;

Assuming this ran without issue, we’re ready to setup our tables.

Table Setup

We need a table to store our geographical reference data (from GeoNames, but Maxmind kindly package it up in their data, too) and then a table to store our IP addresses. We’re going to create 2 tables for our IP data so we can bulk load our raw data and only update rows that are new or have been updated.

create table if not exists ref_geonames_countries (
    geoname_id bigint primary key,
    locale_code varchar,
    continent_code varchar,
    continent_name varchar,
    country_iso_code varchar,
    country_name varchar,
    is_in_european_union boolean  -- boo Brexit (my view only)
);

These column headings lazily reference the CSV provided by Maxmind because it’s easier to bulk load that way. Note this provides names in English - if you want regionalised/local names or names in other languages you’ll need to source the data direction from GeoNames.

-- Data gets bulk loaded here
create table if not exists ref_ip_blocks_tmp
(
  iprange iprange,
  geonameid int,
  registered_country_geoname_id int,
  represented_country_geoname_id int,
  is_anonymous_proxy bool,
  is_satellite_provider bool

);

-- Incremently update this table
create table if not exists ref_ip_blocks
(
   iprange        iprange primary key,
   geonameid      integer
);
create index ref_ip_blocks_ip4r_idx on ref_ip_blocks using gist(iprange);

The iprange column is unique by the nature of it being a primary key. This is good for us as we want to use the ON CONFLICT upsert feature in Postgres to insert new rows or update changes rows when we refresh our reference data.

Getting Data

All the reference data we need comes courtesy Maxmind. We’re going to download and unzip it:

wget "https://geolite.maxmind.com/download/geoip/database/GeoLite2-Country-CSV.zip"
unzip -o GeoLite2-Country-CSV.zip

We’ll now have the following CSV files:

GeoLite2-Country-Blocks-IPv4.csv
GeoLite2-Country-Blocks-IPv6.csv
GeoLite2-Country-Locations-en.csv

These three files are all we need to load up our reference data and get geolocating.

Loading Data

First up we’ll load the GeoNames reference data. We haven’t bothered with any upserting here. From a psql session in your relevant database:

truncate table ref_geonames_countries;
copy ref_geonames_countries from '/path/to/CSV/Files/GeoLite2-Country-Locations-en.csv' HEADER CSV DELIMITER E',';

If you’re doing this on a system that is continusouly online you’ll want to do it in a more transactionally safe manner (i.e. temporary tables, then a drop/rename or truncate/insert). Note that COPY FROM appends data, so we want to clear it down first.

Next up is our IP reference data, which is a tad more complex:

truncate table ref_ip_blocks_tmp;
copy ref_ip_blocks_tmp FROM '/path/to/CSV/Files/GeoLite2-Country-Blocks-IPv4.csv' HEADER CSV DELIMITER E',';
copy ref_ip_blocks_tmp FROM '/path/to/CSV/Files/GeoLite2-Country-Blocks-IPv6.csv' HEADER CSV DELIMITER E',';

-- ref_ip_blocks_tmp now contains a clean, up to date set of IPv4 and IPv6 data

insert into ref_ip_blocks (iprange, geonameid)
select
    iprange, geonameid::integer
from
    ref_ip_blocks_tmp
where
    geonameid != 0
on conflict (iprange)
    do  update  set geonameid = excluded.geonameid;

This effectively adds and refreshes the country code allocations for IP ranges. To take a look at how many allocations per country you can run this query:

SELECT
    geonameid,
    country_name,
    count(1)
FROM
    ref_ip_blocks
    JOIN ref_geonames_countries ON geonameid = geoname_id
GROUP BY
    geonameid,
    country_name
ORDER BY
    count DESC

The top of the table is pretty predictable:

| geonameid | country_name   | count |
|-----------|----------------|-------|
|   6252001 | United States  | 99934 |
|   3017382 | France         | 28527 |
|   2635167 | United Kingdom | 25639 |
|   2921044 | Germany        | 23590 |
|   6251999 | Canada         | 17149 |
|   2750405 | Netherlands    | 16746 |

The tail probably isn’t surprising, but still interesting to look at:

| geonameid | country_name                   | count |
|-----------|--------------------------------|-------|
|   1873107 | North Korea                    |     8 |
|   4031074 | Tokelau                        |     7 |
|   2081918 | Federated States of Micronesia |     7 |
|   4034749 | Wallis and Futuna              |     7 |
|   2078138 | Christmas Island               |     6 |

For the data I’ve used, there are unfortunately around 10,000 ranges that do not have geonameid attached to them. We can see the country they are registered in by looking at the ref_ip_blocks_tmp table. I’m not familiar enough with IP addressing to understand why this is.

Now, Let’s Geolocate an IP

Now we can run a join query to pull the country name/code of a given IP address. For this I picked a random Facebook server IP via Google:

select
    iprange,
    country_name,
    country_iso_code
from
    ref_ip_blocks
        join ref_geonames_countries on geonameid = geoname_id
where
    iprange >>= '69.63.176.0'::ip4r;

What on Earth is >>=, you might ask? It’s a CIDR/network operator in core Postgres that means “contains or equals”. In this context we’re asking if the iprange value CONTAINS the IP 69.63.176.0.

And hey presto:

| iprange        | country_name  | country_iso_code |
|----------------|---------------|------------------|
| 69.63.128.0/18 | United States | US               |

I’d advise sticking to the 2 character ISO code when geolocating rows; it’s more compact and can be almost universally joined onto other reference data. There is a 3 character ISO code, but I’ve personally always worked with 2 chars.

A typical use case here will be batch geolocating a table of data with IP addresses in. You can do an UPDATE FROM in Postgres to join the relevant tables:

update
    my_table_with_ips
set
    my_table_with_ips.country_iso_code = country_iso_code
from
    ref_ip_blocks
    inner join ref_geonames_countries on geonameid = geoname_id
where
        and iprange >>= my_table_with_ips.ip_address_column::ip4r
        and country_iso_code is null;

Staying up to date

If you’re relying on reasonably accurate geolocation, you’d do well to put a scheduled job in place NOW that keeps your reference data up to date. Maxmind update the CSV file on the First Tuesday of every month. If you have historic data to geolocate, you’re probably out of luck as (fairly) Maxmind don’t publicise archives of this free data.

Copyright © 2006 - 2019 Kester Dobson

WIP