Posted by: Ian | August 28, 2009

IP Identification, a workflow

Basic premise

The foundation for this is that I want the lookup to be as fast as possible: this is a web-based service, a piece of middleware, and it cannot be a bottleneck.

For reasons I’ll cover later, I don’t want to just do dynamic “Whois” lookups: I have a requirement for an “augmented” dataset to work from… which means a long-term cache (ie, a database) somewhere.

Relying on a remote backend database for live queries adds in two dependencies (network being present & database being up) and adds lag into the system (the network handshake to connect to the remote system, and querying the database).

By creating a large data-object from the database, and “freezing” than to the local disk, I forgo the immediacy of direct queries in favour of a manifold increase in response times.

Why not a simple “Whois” lookup?

If I’m looking for the name of an institution, then one could assume that looking up the IP number of the client in the Whois tables would tell me where they are from, however this is not 100% accurate.

Taking a sample list of hosts (1,400 Repositories listed in ROAR), I find that 233 give me cause for concern. There are a few of reasons for this:

  • Some institutions actually have their repositories hosted outside their institution, so my testing system resolves to the hosting company, not the institution.
  • The “name” of institution is not always easy to find (I’ve seen it on line 2 of the description; buried in the address field; in the orgname fields; etc..)
  • The whois records are not always accurate (institutions change their name, merge, name the network after the part of the organisation it serves, etc)
  • Some Whois servers are difficult to interact with (the Taiwanese one, for example)
  • There are situations where an outpost of one organisation is located within another: medical researchers within a hospital; Research Council or UKLON staff in Universities, within the UK)

Given these issues, there is a need to augment the whois record… and provide a separate table of exceptions.

But still, a first cut unhappiness level of <17% across the whole international stage is pretty good.

Augmenting the data, knowing where to look in the whois data-return, and using real consumer IP numbers will all improve the accuracy levels.

The Data object

We know we want a data object, frozen to the disk.

In essence we take the reference data from the database, merge in any “authoritive” data EDINA has available (we have subscribers – they tell us the correct name for their institution), build a big nested data object designed to make finding records as easy as possible, in a number of ways (by IP, by name, etc…. it’s easily extendable) and freeze that to the disk.

I also see a need for a periodic script to verify the data we have – and flag any changes in the whois records for a human to examine.

Queries

OARJ_Identification_workflowWhen a query is made, the first call is against the local data object. Hopefully we have cached the information, and can return the (possibly augmented) data we have.

If there is no local data, then we need to dynamically lookup whois. Having successfully found some information, we insert it into the dataobject for the next query, and create a list for the “update” tool to harvest when rebuilding the data (let the off-line rebuild tool deal with the database lag, keep the query tool as streamlined as possible.)

Having got something, we then query the Exceptions dataset…. and then we can make information available to the end user.

Maintenance

The data needs maintainance: Whois records change, IP allocations alter, new records need to be added.

Additions are easy: they are simply new data being inserted (though one needs to check there are no overlaps in IP allocations).

Updates are harder: I suspect they will require a human to review, and approve/reject the change.

This needs careful thought, and planning.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: