Michel Py wrote:
Indeed; notice I did write "Bayesian-like" and not "Bayesian" and
never mentioned anything about good ones or not-as-good ones.
Paul Jakma wrote:
Right, but if we're going to talk about bayesian filtering
in general, there's little sense in constraining the
discussion to "not-as-good" bayesian filters. The not-as-good
filters are obviously doomed to extinction, if they do not
improve and become good ones.
Paul,
I hope you forgive my bluntness, but this is the worst argument you have
ever made in the hundreds of postings I had the privilege to exchange
with you on other mailing lists over the years.
Especially on _this_ mailing list, if you were right, Microsoft would be
extinct.
Michel.
Paul,
I hope you forgive my bluntness, but this is the worst argument you
have ever made in the hundreds of postings I had the privilege to
exchange with you on other mailing lists over the years.
Bluntness forgiven ;).
How about you put the "obviously doomed to extinction" part down to
subtle humour (though that's a cover up on my part, really it was due
to severe jetlag
). My point still stands though, if we're going
to discuss bayesian filtering there is no point deliberately
constraining ourselves to poor implementations of it when considering
weaknesses of / attacks on bayesian filtering.
Especially on _this_ mailing list, if you were right, Microsoft
would be extinct.

We cant stop people using technically poor implementations. In the
case of spam filtering, this might well be done ISP side, not client
side, and hence filtering solution might be chosen by more
technically astute minds, rather than joe-six-pack who is not.
anyway, read the rest of my post. text-stuffing is not per se a
problem for bayesian filtering. So long as an email still contains
phrases which are sufficiently good indicators of either spam or
non-spam, it will be classed as such. Obviously though, in the face
of such attacks, the bayesian filter will no longer count general
phrases as being signs of non-spam. However, spam mail will _always_
differentiate itself somehow, it must do to deliver its "spam"
payload, be it URL, image, whatever.
Also, one thing I like to do is add X-RBL-Warning: headers and have
the bayesian filter consider that as part of its analysis. Which in
time will cause the different DNSBl's I use (by means of the header)
to be perfectly weighted according to the statistical probability of
the DNSBl being "correct" in indicating a mail as spam.
Michel.
regards,