www.gigablast.com

Feel free to clue me in on this please... :wink:

What is www.gigablast.com? And why is it constantly performing "questionable" queries (mostly http) across every IP that I have access to check.

I get a could of thousand hits (mostly questionable non-existing URL requests) from that ip (66.154.103.75). Anyone else seeing/questioning this?

Completewhois shows some listings in some RBLs, but not the more popular ones.

-Jim P.

:slight_smile: Let me add something before everyone on NANOG reminds me that gigablast is a search engine..... I know what they do, but what I don't understand is why are they searching my systems for URLs that haven't ever existed there before. It's as though they are doing random word searches in hopes of striking lucky. They are "crawling" for URLs like this: (unfortunately most people won't see these because their spam blockers will block all the exclamation points)

/Hj!!lpMall
/BuscaP!!gina
/!!!!!!-!!!!!!
/P!!ginasAbandonadas
/HilfeIndex
/CategoryCategory
/Aktuelle!!nderungen
/EfterladteSider
/SystemPagesInDanishGroup
/!!rvaLapok
/ForSide
/!!!!!!!!!!!!
/!!!!!!-!!!
/StartSeite
/!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/Hj!!lpTilHenvisninger
/!!!!!!-!!!!!!!!!!!!
/ExplorerCeWiki
/Xslt!!!!!!!!!!!!
/P!!ginaInicial
/SenesteRettelser
/!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/Pr!!f!!rencesUtilisateur
/WikiHomePage
/HilfeZuParsern
/AiutoModello
/GewenstePaginas
/HilfeZu!!berschriften

-Jim P.

Jim Popovitch wrote:

Google is your friend?
They're a search engine. robots.txt and forget it.

Malcolm

Jim Popovitch wrote:

That�s exactly it... they are doing site indexing .. if you like google...
you'll need to like them! =P

I personally wouldn�t worry about anything in the logs unless you start
seeing attempts to search and exploit .cgi and executable files...

-Payam

:slight_smile: Let me add something before everyone on NANOG reminds me that
gigablast is a search engine..... I know what they do, but what I don't
understand is why are they searching my systems for URLs that haven't
ever existed there before. It's as though they are doing random word
searches in hopes of striking lucky. They are "crawling" for URLs like
this: (unfortunately most people won't see these because their spam
blockers will block all the exclamation points)

[list of random path names snipped]

  This seems to be a very wrong and bad thing to do. Google searches URLs
because a human gives it permission to do so, for example by linking to that
URL. (What purpose does a link have other than to be something to click on.)

  What gigablast seems to be doing, on the other hand, is trying to open
every window in a house in the hopes that it will find one that's open. It
has no invitation or permission to do this, and I would consider such
behavior inappropriate.

  You do not have the right to make requests of other people's computers
without their permission. You can certainly argue implied permission in many
cases -- for example, if Ford registers the domain ford.com, and assigns an
IP address to 'www.ford.com', you can certainly argue that they have invited
the public to access that URL because that's the normal reason people create
such things. However, you have no implied permission to try numerous
combinations of random paths on the end of that in the hopes that you'll
find something Ford did not invite you into.

  DS

That's assuming whoever designed their software actually adheres
to robots.txt. RFCs recommend people adhere to it, but there are
some who don't; it's operationally "optional".

I can't find a single reference to what standards "GigaBlast"
adheres to, or any technical data about how their engine works.
The way their site is designed, it looks like a total fly-by-night
operation.

If "GigaBlast" is supposedly "indexing" his site, they have to be
basing their GET requests on something (the equivalent of a normal
browsers' Referer header; but again, who knows if they pass that
along?). The requests Jim is seeing appear to be garbage, similar
to spam composition, not based on actual references/indexes. I
could be outright wrong here.

Additionally, how does this solve the issue of Jim's bandwidth,
CPU, memory, if not his time, being wasted for HTTP requests which
shouldn't necessarily even be arriving at his boxes (which is what
he's essentially complaining about)? "So filter upstream, or on
the machine itself". Okay, that's a solution, but it doesn't address
incoming traffic (just responses).

It appears that some of the queries are valid for an older site that existed in the past. That site was a wiki and some of the Giga hits are for internationalized versions of the default help/support pages. This is fine and acceptable behavior by them (IMHO). The fact that they are querying something that no longer exist is something I can deal with. The strangeness is that some of their crawling is looking for URLs with multiple exclamation points, those URLs never existed. This may be indicative of a character translation on my system or theirs. BUT, the net net is that I no longer feel a need to be concerned about them.

Thanks all,

-Jim P.

Jim Popovitch wrote:

  What gigablast seems to be doing, on the other hand, is trying to open

    > every window in a house in the hopes that it will find one that's open.

Just looking at the text strings in the URLs, my off-the-top-of-my-head
guess was that those were URLs it saw in email spam. They looked very
similar to a lot of the ascii-garbage that gets generated by spammers
trying to get through bayesian filters. It seemed plausible to me (not a
good idea, of course, but the sort of thing that happens) that they might
have been grepping web pages for URLs, and run across an archive of spam.

                                -Bill

a message of 32 lines which said:

The strangeness is that some of their crawling is looking for URLs
with multiple exclamation points, those URLs never existed. This may
be indicative of a character translation on my system or theirs.

From my experience (and I talked with people - or at least intelligent

bots - at Gigablast), their HTML parser is seriously broken and it
generates non-existing URL quite often. For instance <a
href="example.fr - Informationen zum Thema Example.; will make their crawler
ask for "/Cafe".

I reported the problem months ago but I got nothing except standard
"Thanks for telling us".