GeoIP database
Create MaxMind-compatible GeoIP databases for IP geolocation from scratch, using only freely available address registry information and Whois-based Geofeeds.
Seems that a lot of internet services still rely on the outdated or basically unavailable GeoIP databases from MaxMind for IP geolocation. While they do not disclose their sources, IPv4 block to country registry information as well as so-called Geofeeds are freely available. So time to create a new version from scratch!
MaxMind-compatible GeoIP.dat
databases are still widely used by various tools, for example for access restriction or analytics.
However, support was dropped completely in the meantime.
By directly using authoritative IPv4 address space registry information, no conversion from a third-party geolocation provider is involved, which possibly could taint contents or licensing.
A regularly updated database export is also available for download and direct online lookup.
Comparison
In order to validate results, the reported country code for all IPv4 addresses is compared, not considering reserved blocks or continent code fallbacks – with an ancient 2015 MaxMind and a recent 2024 DB-IP database. Shown is the number of IPs in millions: How often all lookups give the same result, where two agree, and where one has a unique result.
Most noteworthy seems that the great majority is covered with the same values by any source – or both, if comparing two databases. Otherwise, the performances are relatively similar, and further interpretation would need an ultimately authoritative baseline to tell who’s most accurate or which ranges are especially important or volatile.
Usage and Quickstart
The whole processing pipeline bases on a Makefile
with a collection of small Python scripts.
Apart from a recent Python environment (tested with 3.8 and 3.10), no additional requirements are needed.
TLDR: Running make
will automatically download all needed files (i.e., IPv4 address space registry information as well as 3rd-party geofeeds) and convert them into MaxMind-formatted CSV files:
make -j
python3 -m src.parse_locations --in-file data/countryInfo.txt --out-file out/GeoLite2-Country-Locations-en.csv
python3 -m src.parse_transfers --in-files data/transfers-*.json --out-file data/transfers.json
python3 -m src.parse_delegations --format address-space --source iana --in-file data/ipv4-address-space.csv --out-file data/ipv4-address-space.json
python3 -m src.parse_delegations --format delegated --source afrinic --in-file data/delegated-afrinic.txt --out-file data/delegated-afrinic.json
python3 -m src.parse_delegations --format delegated --source apnic --in-file data/delegated-apnic.txt --out-file data/delegated-apnic.json
python3 -m src.parse_delegations --format delegated --source arin --in-file data/delegated-arin.txt --out-file data/delegated-arin.json
python3 -m src.parse_delegations --format delegated --source lacnic --in-file data/delegated-lacnic.txt --out-file data/delegated-lacnic.json
python3 -m src.parse_delegations --format delegated --source ripe --in-file data/delegated-ripe.txt --out-file data/delegated-ripe.json
python3 -m src.parse_whois --in-file data/whois-afrinic.db.gz --out-file data/whois-afrinic.json
python3 -m src.parse_whois --in-file data/whois-apnic.db.gz --out-file data/whois-apnic.json
python3 -m src.parse_whois --in-file data/whois-arin.db.gz --out-file data/whois-arin.json
python3 -m src.parse_whois --in-file data/whois-lacnic.db.gz --out-file data/whois-lacnic.json
python3 -m src.parse_whois --in-file data/whois-ripe.db.gz --out-file data/whois-ripe.json
python3 -m src.fetch_geofeeds --in-files data/whois-*.json --out-file data/geofeeds.json --out-dir data/
python3 -m src.parse_geofeeds --in-file data/geofeeds.json --out-file data/geofeed.json
python3 -m src.merge_delegations --transfers-file data/transfers.json --in-files data/delegated-*.json data/geofeed.json --out-file data/delegated.json
python3 -m src.merge_countries --format range --address-space-file data/ipv4-address-space.json --in-file data/delegated.json --out-file out/ranges.json
python3 -m src.merge_countries --format net --address-space-file data/ipv4-address-space.json --in-file data/delegated.json --out-file out/networks.json
python3 -m src.dump_csv --format range --location-file out/GeoLite2-Country-Locations-en.csv --in-file out/ranges.json --out-file out/geoip-ranges.csv
python3 -m src.dump_csv --format net --location-file out/GeoLite2-Country-Locations-en.csv --in-file out/networks.json --out-file out/geoip-networks.csv
python3 -m src.dump_csv --format geoip2 --location-file out/GeoLite2-Country-Locations-en.csv --in-file out/networks.json --out-file out/GeoLite2-Country-Blocks-IPv4.csv
python3 -m src.dump_csv --format legacy --location-file out/GeoLite2-Country-Locations-en.csv --in-file out/ranges.json --out-file out/GeoIP.csv
While other artifacts could also be of interest, most importantly, this will create:
GeoLite2-Country-Locations-en.csv
- Most recent MaxMind2-compatible country information, originating from the GeoNames
countryInfo
export. GeoLite2-Country-Blocks-IPv4.csv
- IPv4 network to country mappings as MaxMind GeoLite CSV database. GeoName-IDs refer to the also created locations file.
GeoIP.csv
- All-in-one IPv4 range to country database in MaxMind “legacy” format. Suitable for generating a
.dat
file in the next step.
Given that for example the geoip-bin
Ubuntu or Debian package is installed, make release
will create a corresponding GeoIP.dat
MaxMind “legacy” database:
make release
/usr/lib/geoip/geoip-generator -o out/GeoIP.dat out/GeoIP.csv
The result of geoip-generator
can directly be checked by using the geoiplookup
tool:
geoiplookup -f out/GeoIP.dat 1.2.3.4
GeoIP Country Edition: AU, Australia
Running make clean
removes all output files, make reallyclean
also removes cached input files such that they will be freshly downloaded again.
For more information on the involved input and output files, see below.
Background
The approach is quite straight-forward: Parse address space country delegations, merge results by extending ranges, enrich with country codes and names, and dump into the different CSV formats.
Additionally, ISPs or similar institutions can publish so-called Geofeeds via Whois entries, which are used to refine the results.
Multiple simple standalone Python3 scripts are involved, which are orchestrated by a single call to make
for a simple unified interface, parallelism (make -j
), and caching.
IP Address Space Registry Input Files
Everything bases on the following authoritative input files, mostly representing the current address space country delegations:
- GeoNames country database export, for ISO codes and unique identifiers used by the
GeoLite2
CSV files. - IANA address space registry, which maps every
/8
IPv4 network to the registry that is responsible for further subnetwork delegations of it. - Reports of which IP network has been delegated to which country – the actual GeoIP information.
Involves the five worldwide region registries, namely AFRINIC (Africa), APNIC (Asia Pacific), ARIN (North America), LACNIC (Latin America), and RIPE (Europe).
- ftp://ftp.afrinic.net/pub/stats/afrinic/delegated-afrinic-extended-latest
- ftp://ftp.apnic.net/pub/stats/apnic/delegated-apnic-extended-latest
- ftp://ftp.arin.net/pub/stats/arin/delegated-arin-extended-latest
- ftp://ftp.lacnic.net/pub/stats/lacnic/delegated-lacnic-extended-latest
- ftp://ftp.ripe.net/pub/stats/ripencc/delegated-ripencc-extended-latest
- Just as the per-registry delegations, transfer logs are fetched as well from the respective FTP servers. These indicate when blocks are moved from one registry to another and are used to resolve (very rare) conflicts.
- The third type of downloaded region NIC data are Whois database dumps. Whois entries are searched for Geofeed URLs, which are possibly made public there for example by route owners.
- CSV-based Geofeeds found via Whois are crawled from various locations. For more information, see the corresponding section below.
All input sources will be fetched automatically during the first run.
GeoIP Database Output Files
The following GeoIP database files are provided as output and cover the most recent address space registry information in a MaxMind-compatible format:
GeoLite2-Country-Locations-en.csv
GeoLite2-Country-Blocks-IPv4.csv
GeoIP.csv
GeoIP.dat
See the above section for more details and how to run the corresponding scripts. In addition, the following files might also be of interest:
networks.json
,ranges.json
-
The whole IPv4 address space as JSON in address/mask or start-end notation, respectively.
Maps all addresses to either a country code, a registry (if not reported as country-delegated), or
RESERVED
(e.g., local or multicast networks). geoip-networks.csv
,geoip-ranges.csv
-
All-in-one CSV exports of the whole address space in address/mask or start-end notation, respectively.
This format has each entry already resolved to country or continent Alpha-2 ISO codes (or
ZZ
). As these files contain the “most” information, they are the recommended ones for further processing.
Note that multiple netmask-based notations might be needed to represent a single address range. The range format is thus more expressive and leads to less “duplicate” consecutive entries (currently 137343 vs. 175817 in total).
Automatically updated database exports are also available for download.
Custom Extension: Region Codes
The resulting “extended” databases cover the whole IPv4 address space, instead of only the networks for which country information is available. This is done by setting at least the continent code as country code fallback, corresponding to the responsible registry for that block. Also, reserved ranges – for example local or multicast networks – are marked as such. Despite additional information, after optimizing/merging results, there are overall fewer entries than in the “original”.
However, the GeoIP.csv
and GeoIP.dat
“legacy” formats cannot profit from this and still cover the usual concrete countries only:
The hardcoded
set of country codes (and names) in the underlying libGeoIP library
does not allow for using additional continents, regions, or reserved codes.
Using any of the other output formats can thus give better results.
On the other hand, this ensures that there is no ambiguity when working with Alpha-2 ISO codes, for example regarding AF
for either Africa or Afghanistan.
Geofeed: IP Geolocation Feeds
Country codes do not always indicate where an address might be used from, but the place of the institution that is responsible for it. In addition, for example an ISP can serve multiple regions by also having multiple blocks at its disposal – this seems to be relatively common, e.g., in Europe. The location where an IP prefix is provisioned can thus change on short notice, while the internet registry information is relatively static and not optimized towards timely dynamic updates.
As defined by RFC 8805,
Geofeeds are simple CSV files that assign ISO country codes to IP address prefixes.
Thereby, service providers can publish the current state by a plain file that can be downloaded from an arbitrary webserver.
Per RFC 9092,
Geofeed URLs are announced in the remarks
or geofeed
RPSL (“Whois”) attributes.
The approach that uses Geofeed data for refining geolocation results is as follows:
- Download Whois database dumps from the five worldwide region internet registries
- Parse route and IPv4 block Whois entries for feed URLs
- Download Geofeeds from the 3rd party servers if not already locally cached
- Parse CSV feeds for prefixes and countries, which must intersect with the announced routes
- Use results the same way as “static” assignments, but taking precedence during conflict resolution