How to obtain IP geolocation data?
IP geolocation data specify the geographical location (or geolocation) of networked devices. Many of these devices, especially non-mobile ones do not have GPS or any other means of determining their positions. Even if the device contains information about its own position it is not necessarily communicated for e.g., privacy reasons.
IP geolocation data are important in many applications ranging from network security to social sciences (see our other blog for details). Different accuracies maybe targeted: region level, city level, exact latitude and longitude coordinates. But where do they really come from? Here we describe this on two levels. First we outline at least some of the methods to generate these data from scratch in order to elucidate how sophisticated this task is. Next we describe how to obtain these data in practice, a much simpler business indeed.
On the methods of determining IP geolocations
In this Section we describe briefly some of the known methods for obtaining IP geolocation data. This overview is based on the open scientific literature, we as a provider do not reveal our own methods. It is far from being complete: our aim is just to illustrate the complex nature of the problem and the value of data.
The simplest approach is to set up databases from user-entered information. It is clearly an ad-hoc approach not very accurate, however, it can yield relevant supplementary information. To go further, active, probe-based measurements have to be implemented. In this context, the following terminology is used. Targets are hosts with unknown geolocation to be determined. To measure them they maybe required to respond to probes. Landmarks are network infrastructure with accurately known geolocations. These are sometimes referred to as active landmarks. In contrast to these, monitors or passive landmarks are network resources with known geographic location and the ability to send ping measurements to both landmarks and targets (B. Eriksson et al, 2012).
As for methods, IP2Geo was amongst the first measurement-based approaches to assign geolocations to IP addresses (V. Padmanabhan and L. Subramanian, 2001.) It uses various algorithms to achieve its goal: GeoPing, GeoCluster, and GeoTrack. GeoPing correlates latency values (such as Round Trip Time; RTT) with geographical distances. Its granularity depends on the available landmarks. The results can be distorted by routing loops and other ping issues. GeoTrack, a more accurate approach uses primarily data of the Border Gateway Protocol routing tables and address prefixes obtained from service providers in order to divide the whole IP address space into clusters. Additional sources of information such as registry data and WHOIS information is also extensively used in this process. The clusters are then assigned geographical information, and the location of the targets is obtained from this cluster memberships. This approach faces problems originating from the inaccuracy of the input data. E.g., WHOIS records typically contain addresses of the headquarters of registrants, whereas especially autonomous systems can be geographically widely distributed. Finally, the main idea of GeoTrack is to examine Full Qualified Domain names for city names, airport codes, and other geographical indicators to infer geolocation. It is a difficulty that there are no standards for the form of this information to appear. The obtained data are augmented with route tracing (traceroute) to reveal all intermediate stations on the way to the target. Traceroute, however, relies on UDP or ICMP which is dropped on many routers.
Constraint-based geolocation (B. Gueye et al., 2004.) uses estimates of minimum and maximum distance of the target host and landmarks. These estimates are based on physical considerations based on signal propagation on optical fibers and towards satellites. The actual location is determined based on the intersection of circular functions defined with these distances. Unfortunately as it uses ping data extensively, affected by firewalls, proxies, etc. A more advanced approach in this direction is Octant, a modular framework using Bézier curves and taking into account additional information, such as, e.g., demographical information of areas, too.
There are several other approaches to be found in the literature, see e.g. the work of R. Koch et al. (2013) where, in addition, a more detailed overview is available about the above recapitulated methods. IP geolocation is a subject of active research, which continually faces new challenges. As an end-user we can conclude that accurate IP geolocation data are hard to obtain and thus valuable. However, we are in the need of these data and do not want to become scientists. So let us see how to get these data quickly and easily.
Obtaining IP geolocation data: practice
Owing to the apparent practical value of IP geolocation data and the significant effort to maintain a proper geoIP database, it is obvious that there are a lot of providers of these on the market. If you are in the need of fresh and more accurate data, you have to go for commercial solutions.
And indeed, there are many of these available on the market. We cannot undertake to review or compare all of them. And, after all, we are here to recommend our solution: the IP Geolocation API by WhoisXML API, Inc. It is reasonably priced solution offered primarily in the form of a RESTful API which is handy to use in many environments. We provide sample code to illustrate its use. As for accuracy, we identify locations with precision up to the city and postal code, which could determine the area of the city. The same hold true for the latitude/longitude: they normally point to the center of the city. You can test if our data fit to your aims for free, and even use it regularly: we have a free subscription enabling 1,000 queries a month. If you need more, you can choose from a variety of subscription plans to meet your requirements. We also offer a possibility of downloading our entire database.