[My first response was direct to Ross. This has been paraphrased slightly
to make it useful (hopefully) to NANOG...]
Please don't get me wrong, I applaud your efforts, because you're right -
email is huge, and most customers don't have a reasonable expectation of
the service to be expected in terms of mail delivery between providers.
("What do you mean its not there yet! I sent it 10 minutes ago!")
My point is that by posting on NANOG saying 'give me an account please'
for the purpose of keeping *your* customers happy strikes me as, well,
interesting. You have the resources to monitor your own mail systems by
watching your outbound mail queue. Every daemon I know of has ways of
monitoring the outbound queue, and verifying that you're definately
offloading mail to advertised MX's. I noticed an example for Sendmail was
quoted on the list a short while ago.
This is stuff you can influence - its your systems. Thats where i'd
expect you to concentrate your efforts.
By extension of this, it's not unreasonable for this information to
perhaps be scripted and monitored via a web interface - nagios? - and
made available to your upper echilon support staff. Hell at one of the
ISPs I worked for - as a Tier 1 and 2 support tech - I had shell access
to one of the unix boxes and a commandline script which would tell me how
much mail was in the queue. If this remained low, I could verify there
wasnt a problem. If it spiked, then I escalated a query to the NOC to
find out what the story was.
At the ISP I work for currently we dont even have that sort of
information. If mail gets delayed we troubleshoot *without* that
information. We're an ISP with 500,000 customers and have a team of ~15
technical specialists whos expertise closes on that of a junior NOC
engineer. They successfully deal with all manner of technical queries
and they can call the NOC directly to find out if theres anything odd
going on server-side. They also clearly explain to any Tier 1's (and any
customers) they speak to that email is not a guarunteed service, and is
delayed from time to time, and theres nothing we can do except make sure
that *our systems* are working as well as possible.
Who's to say that your monitoring wont be thrown off by problems at
$third_party ? Parsing headers is a good way to identify total delivery
times, but anything beyond your own MX's is outside of your control
anyway, so outside of casual interest I see little value in actually
knowing exactly whats broken at AOL and Gmail, etc. (Isnt this AOL and
Gmails problem, not yours?)
Get queue monitoring. Script it to make the details available via the Web
to your senior tech support staff. And remind your support guys that
email is not guarunteed, and you'll do your level best to keep things
running smoothly, but that once the mail leaves the network its outside
of your control.
So once you've verified that it has infact left your network, your job is
(Disclaimer: Comments are mine and mine alone, and do not represent my
employer or any previous employer for that matter.)
Let me see if I can explain your entire email.
Ensuring that email flows freely between our mail complex and other top
mail provider complexes is a support issue correct. Actually setting up
the system to monitor and to ensure the support people get the data
they need is operations/engineering.
We like automating a lot of our procedures as our mail complex isn't
staffed 24/7. Right now we have a script that monitors incoming mail
sent from probes across the us. It monitors how long it takes the email
to first hit the IronPort's, then how long it takes to hit the
Brightmail, then how long it takes to hit the MTA's. Our script uses
pop3 to grab the email and parse the headers we send from the probes
(or in this case from the complex to the pop accounts). Yes I do
realize some are webmail (AOL, MSN, Gmail), but even a lot of the
webmail providers do have pop3 servers.
Our intent here is not not only verify that the email got there but
that it got there in a reasonable time (lets face it email is becoming
a more imporant part of life/business today).
As fair as teaching the support guys to go look at the mail queue,
would you honestly want them to be doing that? We have over 65 mail
machines and should I trust them with checking them every 10 min? Since
we are not staffed 24/7 what happeneds if we have all gone home? The
way we have it setup if the mail never reaches the complex tier-1 gets
a page, 15 minutes later if the problem still isn't solved tier-2 gets
a page. I believe automating the system rather then trutsing a staff
member to check it and to pray that it dosen't break during the night
is a much better way of doing it.