looking for hostname geographic hint validation

We are currently working on an algorithm that automatically detects
geographic hints inside of hostnames. At this point we are seeking
operators who can validate some of our inferences. Please contact me
if you can valid one of the inferences below or can provide us with one
we have missed.

not in every case is iata helpful for yahoo.

There is lax.yahoo.com and sjc.yahoo.com, but that's really only true
for a few limited peering-points.
for non-US, most of the actual data centres have names related to the
country. in US often more city related, but even that's a bit hairy with
places like 'mud.yahoo.com'
peering points are still somewhat more random, may be city, country, or
partner related ['the' is in london, for example]

Dear Bradley,

So basically you're asking others to do your homework for you ?

The only useful purpose your list serves is to demonstrate why people shouldn't try to build fancy algorithms that rely on an entirely unreliable datasource.

All you end up with are hacked together algorithms that contain a whole load of assumptions and will be obsolete by the time you release version 1.0 because people will have changed their naming conventions a million times.

For example, picking one example from your list ....

<iata>([^a-z]+[a-z]+\d*){3}.ic.ac.uk

ic.ac.uk = Imperial College. A well known and respected ivory towers institution in the UK. The vast majority of their campus sites are located in London and only one or to outside London in South East England.

It is therefore very unlikely they'll be using IATA code, infact, last time I checked they were using conventions such as hostname.doc.ic.ac.uk, hostname.ch.ic.ac.uk.

Far from being IATA codes, the intermediate subdomains actually refer to departments (DepartmentOfComputing and CHemistry in the two I quoted).

Sorry to rain on your parade, but someone had to say it.

Dear Bradley,

So basically you're asking others to do your homework for you ? :wink:

Actually no, I'm asking people to do something which I can not.

While it is true I could test against a manual inference, I would simply
be checking one inference against another. Agreement would only prove
that the algorithm does what I expect. Only the operators, who actually
know what they are doing, can give me the ground truth I need to test my
inferences against reality.

For example, picking one example from your list ....

<iata>([^a-z]+[a-z]+\d*){3}.ic.ac.uk

Far from being IATA codes, the intermediate subdomains actually refer to
departments (DepartmentOfComputing and CHemistry in the two I quoted).

Sorry to rain on your parade, but someone had to say it. :wink:

You are most likely right, but I am not looking for perfection. I am
hoping for an inference that will get me with in 10 km of the actual
city most of the time.

Given the validation I have so far, out of the 19,611 hostnames for which a
location is inferred, and I have validation data, we infer the city
correctly 93% of the time.

While there is work left to do, it is far from the lost cause you
present.

> We are currently working on an algorithm that automatically detects
> geographic hints inside of hostnames. At this point we are seeking
> operators who can validate some of our inferences. Please contact me
> if you can valid one of the inferences below or can provide us with one
> we have missed.
>
> ###########################################
> # Inferences
> ###########################################
>
> <iata> (International Air Transport Association airport code)
>
IATA airport code - Wikipedia
> <iaco> International Civil Aviation Organization airport code
>
ICAO airport code - Wikipedia
> <clli> COMMON LANGUAGE Location Identifier Code
> CLLI code - Wikipedia
> <city name> largest populated city with the given name
> for example "sandiego" is "San Diego, CA, US"
> <iata>.yahoo.com
>
not in every case is iata helpful for yahoo.

There is lax.yahoo.com and sjc.yahoo.com, but that's really only true
for a few limited peering-points.
for non-US, most of the actual data centres have names related to the
country. in US often more city related, but even that's a bit hairy with
places like 'mud.yahoo.com'

Hey, MUD made sense at the time; it's the "Mid US Datacenter". :stuck_out_tongue:
(now, good luck fitting that into any pattern scheme...)

peering points are still somewhat more random, may be city, country, or
partner related ['the' is in london, for example]

THE makes sense; everyone knows TeleHouse East.

I actually didn't even know about the IATA acronym
until this thread, so I can honestly say it didn't enter
into the naming discussions; I dare say there's a lot
of other networks out there in a similar situation.
Hitting 93% accuracy is actually pretty mindblowing
the naming choices are. ^_^;

Matt

This is the number of times we think we have an answer and it is wrong.
It does not include the number of times we failed to find an answer that
is there. Although we have plans to search for nonstandard names in the
future, we curreently do not look for them and so can't get them wrong.

> Hitting 93% accuracy is actually pretty mindblowing
> from my perspective, given how random some of
> the naming choices are. ^_^;

This is the number of times we think we have an answer and it is wrong.

Ah, so that would include cases like thinking CH1 and CHE might
be nearby, rather than halfway around the planet, but wouldn't include
things like MUD, where there wouldn't even be a guess at an answer.

It does not include the number of times we failed to find an answer that
is there. Although we have plans to search for nonstandard names in the
future, we currently do not look for them and so can't get them wrong.

Thanks for the clarification around the number--makes much
more sense now. :slight_smile:

Matt