OT: question re. the Volume of unwanted email (fwd)

Hi Folks,

Someone on the cybertelecom list raised a question about the real costs of
handling spam (see below) in terms of computer resources, transmission,
etc. This dovetailed a discussion I had recently with several former BBN
colleagues - where someone pointed out that email is not a very high
percentage of total internet traffic, compared to all the multimedia and
video floating around these days.

Since a lot of the arguments about spam hinge on the various costs it
imposes on ISPs, it seems like it would be a good thing to get a handle on
quantitative data.

It occurs to me that a lot of people on this list might have that sort of
quantitative data - so... any comments?

Regards,

Miles Fidelman

Miles Fidelman wrote:

Since a lot of the arguments about spam hinge on the various costs it
imposes on ISPs, it seems like it would be a good thing to get a handle on
quantitative data.

While there is a cost to ISPs reguarding spam, the highest cost is still on the recipient. End User's who are outraged by their children getting pornography in email, or having trouble finding their legitimate emails due to the sheer volume of spam that fills their inbox. There are cases where emails are so far out of 822 compliance that the mail clients lock up or crash when attempting to read the message. Time is expended across the board in handling, blocking, verifying, or deleting spam. In this day and age, time is often more valuable than money and the assigned value is dependant on the individual. Unfortunately, end user's cannot just highlight and hit delete on spam. They must look at almost every email to verify that it is spam and not a business or personal email. The misleading subject lines and forgeries are making this even more necessary.

-Jack

jbates@brightok.net (Jack Bates) writes:

While there is a cost to ISPs reguarding spam, the highest cost is still
on the recipient. End User's who are outraged by their children getting
pornography in email, or having trouble finding their legitimate emails
due to the sheer volume of spam that fills their inbox.

yes.

lartomatic=# select date(entered),count(*)
             from spam
             where date(entered)>now()-'20 days'::interval
             group by date(entered)
             order by date(entered) desc;
    date | count

Someone on the cybertelecom list raised a question about the real costs
of handling spam (see below) in terms of computer resources,
transmission, etc. This dovetailed a discussion I had recently with
several former BBN colleagues - where someone pointed out that email is
not a very high percentage of total internet traffic, compared to all
the multimedia and video floating around these days.

The major cost items I've seen are increased bandwidth costs (measured
rate), equipment, filtering software/services, and personnel. These costs
vary depending on the size of the organization and the kinds of service
the organization provides (as a dramatic example, the cost burden is
proportionally higher for an email house like pobox than it would be for
yahoo). There are other indirect costs too; lots of organizations have
stopped sharing backup MX services because of problems with assymetrical
filtering, which can translate into more outages, which can lead to ...

My feeling is that any organization with at least one full-time spam
staffer could probably come up with a minimal cost estimate of $.01 per
message. End-users with measured rate services (eg, cellular) can also
reach similar loads with little effort. But due to the variables and
competitive concerns, you'll probably have to go door-to-door with a
non-disclosure agreement to get people to cough up their exact costs,
assuming they are tracking it.

There has been much to-do about spam of late. Figures from Canarie show
that SMTP transmissions account for about .5% of the volume of Internet
traffic. This may be typical of backbone networks, or not. Commercial
networks are jealous of revealing information of this nature.

The backbone utilization isn't going to be relevant unless it is high
enough to affect the price of offering the connection. The mailstore is
where the pressure is at. Companies and users who sink capital and time
into unnecessary maintenance have always been the victims. These costs
also have secondary effects, like permanently delaying rate reductions
(sorry your tuition went up again, but we had to buy another cluster),
which in turn affects other parties, but the bulk of the pressure is
wherever the mailstore is at.

value is dependant on the individual. Unfortunately, end user's cannot
just highlight and hit delete on spam. They must look at almost every

Isn�t "highlight and hit delete" exactly what has been implemented since
Mozilla 1.3 and works with almost perfect accuracy after you give it a few
dozen messages to build up the "good and bad" database with?

PEte

Petri Helenius wrote:

Isn�t "highlight and hit delete" exactly what has been implemented since
Mozilla 1.3 and works with almost perfect accuracy after you give it a few
dozen messages to build up the "good and bad" database with?

Actually, I find that 1.3 and 1.4 still have issues with determining spam. While fairly decent, one still has to go through looking for false positives. The other issue is that spammers have been doing a good job at designing emails to fool filters. I'm starting to see more and more spam designed to defeat Baynesian filters. By including "good" words in their emails, they either make good words spammy so that you get more FP's or they make their email clean enough that it's still in your inbox. The worst part of it is that spam is quickly becoming unreadable, so that legitimate emails that are readable are the emails more likely filtered.

-Jack

On the upside, this means replacing the spam filter with a spell checker
will move us toward 100% accuracy! :slight_smile:
-Paul

Actually, I find that 1.3 and 1.4 still have issues with determining
spam. While fairly decent, one still has to go through looking for false
positives. The other issue is that spammers have been doing a good job
at designing emails to fool filters. I'm starting to see more and more
spam designed to defeat Baynesian filters. By including "good" words in
their emails, they either make good words spammy so that you get more
FP's or they make their email clean enough that it's still in your
inbox. The worst part of it is that spam is quickly becoming unreadable,
so that legitimate emails that are readable are the emails more likely
filtered.

I hope I never get your "legitimate" email. :slight_smile: Since about 100 messages I practically
stopped visiting the Junk folder every now and then because no false positives
occurred. Just for the sake of this message, I peeked into the folder and scrolled
trough the last ~300 messages and all spam.

About one in 50 does not get flagged and this stream has already gone through
the basic checks like that sender needs to have a legit domain name and such.

So I�m happy camper and I hope that legislation catches up with spammers
before they figure out a surefire way to defeat Baynesians.

Pete

Jack Bates wrote:

Petri Helenius wrote:

Isn�t "highlight and hit delete" exactly what has been implemented since
Mozilla 1.3 and works with almost perfect accuracy after you give it a few
dozen messages to build up the "good and bad" database with?

Actually, I find that 1.3 and 1.4 still have issues with determining spam. While fairly decent, one still has to go through looking for false positives. The other issue is that spammers have been doing a good job at designing emails to fool filters. I'm starting to see more and more spam designed to defeat Baynesian filters. By including "good" words in their emails, they either make good words spammy so that you get more FP's or they make their email clean enough that it's still in your inbox. The worst part of it is that spam is quickly becoming unreadable, so that legitimate emails that are readable are the emails more likely filtered.

I have not found this to be the case. While I don't manage an abuse
mailbox, I do manage a busy mailing list. The mailing list address and
administrative addresses have been picked up by spammers and are
probably now on all those "millions of email addresses" CDs. The
mailing list address and administrative addresses are also both
regularly forged (used to send spam) so I get all the undeliverable
spams mixed in with all the undeliverable actual list email.

Until I started using the Bayesian filters in Mozilla, weeding thru the
spam to find the actual administrative emails that needed my attention
was a very big chore, and my false positive rate utilizing JHD was
fairly high. Now Mozilla filters for me, and has a much lower false
positive rate.

Note, I fed Mozilla's Bayesian filters two folders, each containing over
1000 emails, one full of spam and one full of legitimate administrative
email, to train it to learn what was and wasn't acceptable email. Hand
sorting until I had these two seed folders took a fair bit of time, but
it was clearly worth it!

The Bayesian filters are the main reason I'm using Mozilla. Eudora does
some things much better than Mozilla, but I can't live without the spam
filters anymore!

jc

It occurs to me that a lot of people on this list might have that sort of
  quantitative data - so... any comments?

  Regards,

  Miles Fidelman

For my little corner:
http://mrtg.snark.net/spam/

It seems >1:1 is the norm these days, at least at my scale.

matto

--mghali@snark.net------------------------------------------<darwin><
   Flowers on the razor wire/I know you're here/We are few/And far
   between/I was thinking about her skin/Love is a many splintered
   thing/Don't be afraid now/Just walk on in. #include <disclaim.h>

Interesting pattern. Kind of looks like "cutting z's." :slight_smile:

curtis

just me said:

How do you get your mail delivery attempts to occur so linearly? :slight_smile:

I think something's busted with your mrtg script...

Here's the stats for one of the smtp boxes in our cluster (83% rejection
rate...and it's +/- 1% across the other boxes in the cluster):

Postfix log summaries for Jun 18

Grand Totals

Andy Dills wrote:

How do you get your mail delivery attempts to occur so linearly? :slight_smile:

I think something's busted with your mrtg script...

Depends on which stats he wants. He's showing the total since midnight in the graph instead of the count since the last run.

-Jack

Yeah, mea culpa :slight_smile:

Don't know why you have your graphs set up that way, unless you have no
other way of reporting aggregate scores for the day...

http://people.ee.ethz.ch/~oetiker/webtools/mrtg/reference.html

"In the absence of 'gauge' or 'absolute' options, MRTG treats variables as
counters and calculates the difference between the current and the
previous value and divides that by the elapsed time between the last two
readings to get the value to be plotted."

Sounds like you have 'gauge" option set where you shouldn't...unless that
is exactly how you want the graphs to behave, in which case I'll shut up
and respect your right to run mrtg any way you want. :slight_smile:

Andy

Not a lot to break; here's the script in its entirety:

#!/usr/local/bin/bash

grep -c mailer=local /var/log/maillog
egrep -c 'uce@ftc|reject|njabl' /var/log/maillog

A lot of mail traffic on my box is mailing lists; perhaps thats why
the graphs look so smooth.

matto

Yeah, mea culpa :slight_smile:

  Don't know why you have your graphs set up that way, unless you have no
  other way of reporting aggregate scores for the day...

  http://people.ee.ethz.ch/~oetiker/webtools/mrtg/reference.html

  "In the absence of 'gauge' or 'absolute' options, MRTG treats variables as
  counters and calculates the difference between the current and the
  previous value and divides that by the elapsed time between the last two
  readings to get the value to be plotted."

  Sounds like you have 'gauge" option set where you shouldn't...unless that
  is exactly how you want the graphs to behave, in which case I'll shut up
  and respect your right to run mrtg any way you want. :slight_smile:

My configuration lets me see daily totals as well as rate vs.
time-of-day pretty easily. Using "absolute", the only thing I'd be
able to see is a running total. I like the ability to compare traffic
between days, as well as see when the bulk of my mail is delivered-
any anomalous traffic is pretty easy to spot.

matto

--mghali@snark.net------------------------------------------<darwin><
   Flowers on the razor wire/I know you're here/We are few/And far
   between/I was thinking about her skin/Love is a many splintered
   thing/Don't be afraid now/Just walk on in. #include <disclaim.h>

You might find this useful.

http://zebulon.miester.org/spam/

Justin