Aren’t they in a former church or something? I vaguely remember their location to be significant for some reason or another. So location may weigh heavily.
They are, and I’ve got dark fiber in there. We’ve reached out…
Internet Archive primary office is located at 300 Funston in San Francisco. This was a Christian Science church so it has the roman columns you would expect for a church / library. You can see it on Google Street Views at:
https://www.google.com/maps/place/300+Funston+Ave,+San+Francisco,+CA+94118
Although they serve content out of this site, their primary site for bandwidth is at 2512 Florida Ave, Richmond, CA.
IA does have satellite offices around the world for scanning, etc., the public facing servers are location in these two locations.
Tim
What about introducing some cache offloading, like CDN doing? (Google, Facebook, Netflix, Akamai, etc)
I think it can be rolled pretty quickly, with minimum labor efforts, at least for heavy content.
Maybe some opensource communities can help as well, and same scheme can be applied then to other non-profits.
But sure something more smooth like nginx caching, not bunch of rsync/ssh scripts, as many Linux mirrors have.
Surely someone has already thought thru the idea of a community CDN?
Perhaps along the lines of pool.ntp.org? What became of that
discussion?
Maybe a TOR network could be repurposed to cover the same ground.
Mark.
I believe tor is not efficient at all for this purposes. Privacy have very high overhead.
Several schemes exist:
1)ISP announce in some way subnets he want to be served from his cache.
1.A)Apple cache way - just HTTP(S) request will turn specific IP to ISP cache. Not secure at all.
1.B)BGP + DNS, most common way. ISP does peering with CDN, CDN will return ISP cache nodes IP's to DNS requests.
It means for example content.archive.org will have local node A/AAAA records (btw where is IPv6 for archive?) for
customers of ISP with this node, or anybody who is peering with it.
Huge drawback - archive.org will need to provide TLS certificates for web.archive.org each local node, this is bad and probably no-go.
Yes, i know some schemes exist, that certificate is not present on local node, but some "precalculated" result used, but it is too complex.
1.C)BGP + HTTP redirect. If ISP has peering with archive.org, to all subnets announced users will get 302 or some HTTP redirect.
Next is almost same and much better, but will require small modifications of content engine or frontend balancers.
1.D)BGP + HTTP rewrite. If ISP <*same as before*> URL is rewritten within content
e.g. http://web.archive.org/web/20200511193226/https://git.kernel.org/torvalds/t/linux-5.7-rc5.tar.gz will appear as
http://emu.st.node.archive.org/web/20200511193226/https://git.kernel.org/torvalds/t/linux-5.7-rc5.tar.gz
or
http://archive-org.proxy.emu.st/web/20200511193226/https://git.kernel.org/torvalds/t/linux-5.7-rc5.tar.gz
In second option ISP can handle SSL certificate by himself.
2)BGP announce of archive.org subnets locally. Prone to leaks, require TLS certificates and etc, no-go.
You can still modify some schemes, and make other options that no one has yet implemented.
For example, to do everything through javascript (CDNs cannot afford it, because of way they work),
and for example, website generate content links dynamically, for that client request some /config.json file
(which is dynamically generated and cached for a while), so we give it to IPs that have a local node - URL of the local node, for the rest -
default url.
Yes, Jeff Ubois and I have been discussing it with Brewster.
There was significant effort put into this some eighteen or twenty years ago, backed mostly by the New Zealand government… Called the “Internet Capacity Development Group.” It had a NOC and racks full of servers in a bunch of datacenters, mostly around the Pacific Rim, but in Amsterdam and Frankfurt as well, I think. PCH put quite a lot of effort into supporting it, because it’s a win for ISPs and IXPs to have community caches with local or valuable content that they can peer with. There’s also a much higher hit-rate (and thus efficiency) to caching things the community actually cares about, rather than whatever random thing a startup is paying Akamai or Cloudflare or whatever to push, which may never get viewed at all. It ran well enough for about ten years, but over the long term it was just too complex a project to survive at scale on community support alone. It was trending toward more and more of the hard costs being met by PCH’s donors, and less and less by the donors who were supporting the content publishers, which was the goal.
The newer conversation is centered around using DAFs to support it on behalf of non-profit content like the Archive, Wikipedia, etc., and that conversation seems to be gaining some traction. Unfortunately because there are now a smaller number of really wealthy people who need places to shove all their extra money. Not how I’d have liked to get here.
-Bill
I think this is a simple equation.
1) The minimum cost of implementation and technical support efforts
I think earlier this was the main problem, 10 years ago there was no such level
of software automation as it is available today.
2) Win for operators.
Before it was more trivial by running squid and trivial cache, now, with HTTPS it is
not possible.
3) Proud badge of non-profit projects supporter and charity activities.
(Whether it is possible to write off tax/etc as donations - depends on the laws of your country)
The thing is that if you're an 800 pound gorilla, you probably have enough
things that would benefit from being cached to make it worthwhile.
I'd expect that the Internet Archive is probably mostly long-tail hits with not
much hot content. Has anybody modeled how much cache space would it take to
significantly improve the bandwidth situation?