Michel Py wrote:
1. Reduce the efficiency of Bayesian-like filters: Trouble
with this kind of email is that they are a) of sufficient
length b) contain only "real" words c) contain none of the
words regularly used by spammers such as the v. word.
Paul Jakma wrote:
Good bayesian filters do not score on single words alone,
they also score on "phrases" (ie multiple words). Random
strings of words will result in neutral scores (presuming
those words are also used in non-spam), while the phrases
will be slightly higher. Re-used gibberish (ie apparently
random) strings of words will result in "phrases" from
that gibberish having high scores.
Indeed; notice I did write "Bayesian-like" and not "Bayesian" and never
mentioned anything about good ones or not-as-good ones.
Also, a good bayesian filter should prune its database
regularly of phrases (including one word phrases) that
have not had their score updated recently, further
reducing "pollution" by random words and phrases.
noise is just noise. the spam specific stuff will
still be statistically significant, hopefully.
I understand this too. However, I think the point you are missing here
is the difference between "what could be done" and "what people have".
The fact of the matter is that spam messages including a bunch of random
dictionary words have had and still have a much higher penetration rate
than messages that don't feature it. The proof is in the pudding. And as
I said earlier, expect the "bunch of dictionary words" to mutate into a
more sophisticated animal that includes correct grammar.
What you and I do or could do (on a small scale) in terms of spam
filtering is largely irrelevant. If spammers were smart they would not
send us (collectively) spam to begin with, as the only thing it achieves
is to get us pissed and implement more filtering.
In the end, the only thing that matters is not what we could do about
filtering neither how much spam _we_ get, but how many spams
joe-six-pack gets per day. WRT this, although it is true that we have
made tremendous progress in terms of filtering, it is equally true that
the spammers have made tremendous progress in defeating our
counter-measures, resulting in end-users getting unprecedented and still
increasing amounts of spam.
The measuring metric here is _not_ that we successfully filter 90% or
95% or 99.99% of spam; this is meaningless. The meaningful metric is:
how many spams does joe-six-pack get a day.
There is no difference between a) joe-six-pack getting 50 spams a day
and us canceling 450 a day and b) joe-six-pack getting 50 spams a day
and us canceling 9950 a day. Actually, there might be one: the spammers
laughing their bottoms off thinking that filtering 9950 spams per day
per user costs us 100 times more than it takes them to send 10000 spams
per user per day.