RE: How to get a list of research and academic ISP ?

Dear All,

Thank you very much for numerous and quick replies for my email. I must say
that nanog list is really highly responsive.

I needed some time to digest your comments and try some new ideas. I share
the preliminary results with you now, begging for further comments.

The problem was (and still is) to find a good heuristic to distinguish
between commercial (COM) and educational/research/academic (EDU) ASes.

*EDU_Abilene*

My first approach (see my original email) was to extract a list of all
destinations announced by Abilene. (The assumption is that Abilene generally
does not announce commercial prefixes.) This results in a list, call it
"EDU_Abilene", of 1333 ASes.

*EDU_description*

Some of you suggested looking at the names and descriptions of ASes. I used
the AS list available at:

<http://www.multicasttech.com/status/asn_expand.txt>
http://www.multicasttech.com/status/asn_expand.txt

and searched the last column ("Organization") for the following strings:

"Universit|Univerz|Universida|research|education|science|scientif|academic|c
ollege>institut>laborator>school>ecole>

edu>R&D|library|academy|Etudes"

This approach finds 1796 "educational" ASes, call this set
"EDU_description".

Of course, these two lists overlap, but less than I expected. In particular:

len(EDU_Abilene)=1333

len(EDU_description)=1796

union(EDU_Abilene, EDU_description)=2269

intersection(EDU_Abilene, EDU_description)=860

For many reasons, these lists are far from being very precise. For instance
EDU_Abilene contains AS 7132 (AT&T) and AS 8075 (Microsoft). Therefore I
need further data sets or filtering methodology. This raises some questions:

1) What other EDU networks (preferably with BGP tables available in the web)
can I take as examples of ASes that (generally) do not announce commercial
prefixes? Based on them I could construct lists similar in spirit to
EDU_Abilene. I guess, the more the better.

2) Do you know of other lists, similar to
<http://www.multicasttech.com/status/asn_expand.txt>
http://www.multicasttech.com/status/asn_expand.txt ? Maybe a longer
description or a www related to an AS would help the method I use to create
EDU_description. Do you think the strings I use in my search are
appropriate?

*AS relationships*

Another approach is to exploit the AS relationships. Most of you agree that
usually EDU ASes are not providers for COM customers. This suggests a way to
detect false positives in EDU_Abilene and EDU_description (or in their
union). For every EDU node check how many COM customers it has, i.e., EDU
provider --- COM customer relationship. I used the AS graphs with inferred
relationships provided by CAIDA ( <http://as-rank.caida.org/data/2006/>
http://as-rank.caida.org/data/2006/). This method works well to find good
candidates for false positive, but they should not be blindly accepted. For
instance AS 7132 (AT&T) has the highest number of COM customers (615) and
should obviously belong to COM (it is a member of EDU_Abilene). In contrast,
a big component of the EDU backbone, AS 11537 (Abilene) has 66 COM
customers! In general there are about 50 EDU nodes with more than 10 COM
customers each.

3) What other "automatic" or "manual" approaches would you suggest? Or
improvements of the ones just described?

I will appreciate even the briefest comments and suggestions,

Maciej Kurant

Hello;

Dear All,

Thank you very much for numerous and quick replies for my email. I must say that nanog list is really highly responsive.

I needed some time to digest your comments and try some new ideas. I share the preliminary results with you now, begging for further comments.

The problem was (and still is) to find a good heuristic to distinguish between commercial (COM) and educational/research/academic (EDU) ASes.

I would suggest you need to think a little about what exactly you want

- a list of _all_ academic ASN ? (that will be tough, and you will have to deal with corner cases, and you will not fully automate it)
- a list of _some_ academic ASN ? (you have that now - so are you worried about completeness or size or ... ?)
- a list of _no_ academic ASN ? (again, this will be tough)
or something else ?

Note, too, that these lists will change with time.

*EDU_Abilene*

My first approach (see my original email) was to extract a list of all destinations announced by Abilene. (The assumption is that Abilene generally does not announce commercial prefixes.) This results in a list, call it “EDU_Abilene”, of 1333 ASes.

*EDU_description*

Some of you suggested looking at the names and descriptions of ASes. I used the AS list available at:

http://www.multicasttech.com/status/asn_expand.txt

and searched the last column ("Organization") for the following strings:

"Universit|Univerz|Universida|research|education|science|scientif|academic>college>institut>laborator>school>ecole>

edu>R&D|library|academy|Etudes"

This approach finds 1796 "educational" ASes, call this set “EDU_description”.

Of course, these two lists overlap, but less than I expected. In particular:

len(EDU_Abilene)=1333

len(EDU_description)=1796

union(EDU_Abilene, EDU_description)=2269

intersection(EDU_Abilene, EDU_description)=860

For many reasons, these lists are far from being very precise. For instance EDU_Abilene contains AS 7132 (AT&T) and AS 8075 (Microsoft). Therefore I need further data sets or filtering methodology. This raises some questions:

1) What other EDU networks (preferably with BGP tables available in the web) can I take as examples of ASes that (generally) do not announce commercial prefixes? Based on them I could construct lists similar in spirit to EDU_Abilene. I guess, the more the better.

There are lots - look at the ones that Abilene peers with

http://international.internet2.edu/partners/
http://abilene.internet2.edu/peernetworks/international.html

2) Do you know of other lists, similar to http://www.multicasttech.com/status/asn_expand.txt ? Maybe a longer description or a www related to an AS would help the method I use to create EDU_description. Do you think the strings I use in my search are appropriate?

Try
http://bgp.potaroo.net/as1221/asnames.txt

Note that there are errors all over the place here; these lists will not agree perfectly.
My lists come from the rwhois data, but I correct for obvious errors (some of which I have
sent back to the list maintainers). There are others I am sure that I have not caught, and my corrections are undoubtedly not perfect. I am
sure that the other maintainers of such lists could tell similar tales.

You could start polling rwhois yourself, and I would in doubtful cases.

*AS relationships*

Another approach is to exploit the AS relationships. Most of you agree that usually EDU ASes are not providers for COM customers. This suggests a way to detect false positives in EDU_Abilene and EDU_description (or in their union). For every EDU node check how many COM customers it has, i.e., EDU provider --- COM customer relationship. I used the AS graphs with inferred relationships provided by CAIDA (http://as-rank.caida.org/data/2006/). This method works well to find good candidates for false positive, but they should not be blindly accepted. For instance AS 7132 (AT&T) has the highest number of COM customers (615) and should obviously belong to COM (it is a member of EDU_Abilene). In contrast, a big component of the EDU backbone, AS 11537 (Abilene) has 66 COM customers! In general there are about 50 EDU nodes with more than 10 COM customers each.

Not a bad approach.

3) What other “automatic” or “manual” approaches would you suggest? Or improvements of the ones just described?

Again, I don't know what you are trying to do. What I have found useful is what you are doing - make lots of lists, and cross reference, and
see what passes multiple tests.

I will appreciate even the briefest comments and suggestions,

Maciej Kurant

Hope this helps.

Regards
Marshall

You might have a look at:

http://www.caida.org/publications/papers/2006/revealingas/revealingas.pdf

The algorithm produces a lot of false negatives for non-English speaking countries that don't use .edu uniformly, but is otherwise an excellent place to start...

TV