My company is in the process of evaluating several mail solutions, scalable
to 150k to 200k mailboxes. One thing we'd like to do is run the message
store over Gig-E on network attached storage. Two of the vendors we've been
looking at claim performance issues running this solution over NFS. Does
anyone know of a carrier-class mail solution that will run well on NAS?
Patrick Hollowell
Sr. Network Engineer
CTC Internet Services
phollowell@vnet.net
800-377-3282 x3527
You might want to check out CriticalPath. We are in the process of
designing a mail system using an EMC Clariion and CriticalPath's software
suite. Sounds like a similar setup.
jas
Take a look at NetApp. My company (unfortunately) signed an NDA with
NetApp, but they've posted that Yahoo uses NetApp for their e-mail:
http://www.netapp.com/partners/catalog.cgi/company/28
Rumor has it (no, I'm not violating my NDA) that Hotmail also uses NetApp.
-Plenty- of other large sites use 'em for e-mail. Just call a NetApp
sales person and and ask for the list. It's impressive. I do believe
it contains some carrier-class implementations.
Mike
You may want to check Stalker Software's Communigate Pro.
www.stalker.com
(I'm not sure what it can do over NFS but it can support 200k
mailboxes on a single server).
--vadim
Ok, this is my beef with NetApp. We have a NetApp F720 with a single disk shelf. The F720 "brain box" has two power supply units that slide into the back on the right and left side, each having it's own 48V DC connectors (about the size of LS1010 power supplies).
Now the disk shelf. It has two power supplies that slide into the front and connect into the backplane with no connectors on the front. A fixed connector assembly is located on the back. Just one. One!
No A side. No B side. Just one +/-/GND. We gave NetApp a call and their workaround was "you could use a diode and connect both A & B wires to the unit". Uhh... thanks. They also told us their design engineer had already been slapped on the hand, and they are working on their next version.
It was an interesting gotcha for our server engineer who wansn't too farmiliar with DC power plants.
Now, does anyone know of a diode that can do 10A at 48V? Any EE's out there?
Mike Johnson wrote:
You may want to check Stalker Software's Communigate Pro.
www.stalker.com
(I'm not sure what it can do over NFS but it can support 200k
mailboxes on a single server).
--vadim
Is the number of mailboxes the key metric? What breaks sendmail + "a
very big disk"? Isn't it the traffic?
Chris
The two biggest problems with very-high-volume servers and sendmail are:
1) You *really* need to use multiple queues and some sort of aging scheme,
so mail backlogged for dead hosts gets out of your main queue. If a queue
gets too full, Sendmail exhibits bad O(N**2) behavior in sorting/running
the queue.
2) If you are serving mailboxes (as opposed to a Listserv-type machine where
the mail *leaves*), what can kill you isn't the sendmail, but the local
delivery program and POP/IMAP checks. You get enough bozo users who have
set Eudora to check for new mail every 2 minutes, you'll get bogged down
no matter HOW fast Sendmail itself is.
Is the number of mailboxes the key metric? What breaks sendmail + "a
very big disk"? Isn't it the traffic?
Not that traffic isn't important, but you can get a lot more data from
inferences based off of the number of mailboxes. You can speculate
on the traffic, for instance. You can also look at your disks.
In my opinion, delivery of messages to mailboxes is easy. Yes, you
have to tune sendmail or switch to qmail or postfix to deal with the
amount of messages, but delivery is easy. Retreiving them is a
bit more difficult and takes some planning. Using mbox format with
a POP daemon that copies the entire spool? Allowing 25MB spools?
Welcome to doubling your disk space and adding more RAM to your
servers. Using maildir and a decent IMAP server? Then you can
use NFS spool storage and multiple IMAP servers with some fun load
balancing.
But, if you look at the total number of mailboxes, you can figure
out how much disk space you'll need, approximate how many messages
per second will be sent and received, and figure out how many
users will be retrieving messages per second.
I am wondering if this belongs on a networks list, though...
Chris
Mike
> Is the number of mailboxes the key metric? What breaks sendmail + "a
> very big disk"? Isn't it the traffic?
The two biggest problems with very-high-volume servers and sendmail are:
1) You *really* need to use multiple queues and some sort of aging scheme,
so mail backlogged for dead hosts gets out of your main queue. If a queue
gets too full, Sendmail exhibits bad O(N**2) behavior in sorting/running
the queue.
2) If you are serving mailboxes (as opposed to a Listserv-type machine where
the mail *leaves*), what can kill you isn't the sendmail, but the local
delivery program and POP/IMAP checks. You get enough bozo users who have
set Eudora to check for new mail every 2 minutes, you'll get bogged down
no matter HOW fast Sendmail itself is.
Your second point should in fact be splitted in two.
1. Your going to have a hard time handling the amount of incomming pop
connections, yes. That's true, and there's nothing you can do about it
execpt scale your server farm in consequence or deny consecutive
connections within a 5 or 10 minutes period.
2. The more mailboxes you have, the slower the entire popping process will
be. The reason is very simple, each pop process will spawn and read your
mailbox directory. In the case where you have delivered all your mail to
mailboxes all sitting in the same directory, it will take more and more
time to scan the directory to find your mailbox. One way to fix this
issue would be to use a hashing scheme to split the amount of actual
mailboxes into a subdirectory structure. You could get something like
johndoe@yourdomain.com would have his mailbox in
/export/mailboxes/j/o/h/n/johndoe.mbox
so in /export/mailboxes, in order to find the j directory, you only have
about 36 directories entries or so.
Although this example is not good in the case where you accept usernames
with 3 or less characters.
It's not hard to right-pad any short usernames before hashing. For instance, the username "bo" might hash as "bo__" and thus would end up in the directory "/export/mailboxes/b/o/_/_/bo.mbox". If you allow non-alphanumerics you'll want to translate those to something innocuous as well, or a name such as "bo.lee" will cause problems.
Well, hashing like that works well from the standpoint that it's very easy
for the software to find the mailbox. It's going to make things like backups
very costly, though, because of all the recursive directories. Also, you're
going to end up with some directories very imbalanced, since there are more
frequently occurring names.
If you're going to use NFS, you probably want to use something like maildir
format. - which is nfs-safe but becomes very costly as the number of messages
increase. A lot of that has to do with the performance of the remote nfs
server - the underlying filesystem's performance in reading large directories
will make a BIG difference as far as that goes. Netapps have excellent
large-directory performance, fwiw.
If you're looking for large scalability AND high performance, my preferred
solution would be to have a relational database as the backend, but don't
store any messages in it - simply pointers to their location on disk. Then
store the messages without regard to intended username in a hashed directory
structure. The pop3 server then gets the list of new messages from the
database server, which could just be a list of filenames. Then, the pop3
server simply has to open the message to return it - it doesn't have to do an
opendir(). Also, if you use the filename as the UIDL returned, there's no
need to even stat() the file, again saving you a whole nfs call. The
obvious downside is that you can't do a :
rm -f /users/j/o/h/n/johndoe.mbx
But, with 200k mailboxes, you should have an automated way to do that anyway.
Thanks,
Matt
> >One way to fix this
> >issue would be to use a hashing scheme to split the amount of actual
> >mailboxes into a subdirectory structure. You could get something like
> >
> >johndoe@yourdomain.com would have his mailbox in
> >
> >/export/mailboxes/j/o/h/n/johndoe.mbox
> >
> >so in /export/mailboxes, in order to find the j directory, you only have
> >about 36 directories entries or so.
> >
> >Although this example is not good in the case where you accept usernames
> >with 3 or less characters.
>
> It's not hard to right-pad any short usernames before hashing. For
> instance, the username "bo" might hash as "bo__" and thus would end up in
> the directory "/export/mailboxes/b/o/_/_/bo.mbox". If you allow
> non-alphanumerics you'll want to translate those to something innocuous as
> well, or a name such as "bo.lee" will cause problems.
Well, hashing like that works well from the standpoint that it's very easy
for the software to find the mailbox. It's going to make things like backups
very costly, though, because of all the recursive directories. Also, you're
going to end up with some directories very imbalanced, since there are more
frequently occurring names.
In order to remedy this rather easily, you can always run the username
through a hashing function and use the first 'n' letters of the hash to
figure what directory the mail(box|dir) is in. That also prevents
problems with non-alphanumerical characters such as "."
If you're going to use NFS, you probably want to use something like maildir
format. - which is nfs-safe but becomes very costly as the number of messages
increase. A lot of that has to do with the performance of the remote nfs
server - the underlying filesystem's performance in reading large directories
will make a BIG difference as far as that goes. Netapps have excellent
large-directory performance, fwiw.
If you're looking for large scalability AND high performance, my preferred
solution would be to have a relational database as the backend, but don't
store any messages in it - simply pointers to their location on disk. Then
store the messages without regard to intended username in a hashed directory
structure. The pop3 server then gets the list of new messages from the
database server, which could just be a list of filenames. Then, the pop3
server simply has to open the message to return it - it doesn't have to do an
opendir(). Also, if you use the filename as the UIDL returned, there's no
need to even stat() the file, again saving you a whole nfs call. The
obvious downside is that you can't do a :
rm -f /users/j/o/h/n/johndoe.mbx
But, with 200k mailboxes, you should have an automated way to do that anyway.
It also makes backups a nightmare. In that case, you'll have to shutdown
the entire mail system before you can backup or you'll have a database
image which won't represent the actual data you have on your NAS.
No, no, don't do that. Given the scale of something like this, I'd expect
you'd be running on something like Oracle that supports the concept of "hot
backups". The table spaces are put into a quiesced state, and all writes are
done to memory and to recovery logs. Once the backup is finished, you take
it out of hot backup and it then writes all the pending transactions to the
database files. That way, the database files are stable, and you also back up
the recovery logs to something with real-time access (like another nfs
server). In the event you have a catastrophic database failuser, you recover
from tape (or if you have the space, you have a copy of the dbf files
elsewhere), and run all the transaction logs - it takes about 5 minutes per
hour of transactions. Then your database is brought up to the point where it
was when it died. The worst case scenario is that there's a few transactions
that don't get logged, which means that a few emails get dropped. If you had
a stock smtp server that died, you could be looking at the same situation.
As far as backing up the actual mailboxes, there's no way to get around the
fact that it'll take long enough to finish that stuff will be inaccurate by
the time its finished. If you ever have to restore the mailboxes from tape
without restoring the database, it'd be wise to have an application that
builds a list of the messages that are on disk the database doesn't know
about.
Thanks,
Matt
Hah. Unlink the directory, and do a background fsck every few hours? 
The trouble with the above format is that you're ignoring any locality
that exists in the filesystem. For example, in Berkeley FFS, files in
a given directory are allocated in the same cylinder group (or at least
it is attempted..)
Which, under heavy heavy load could actually give a slight performance
boost on a non-filled FFS.
I believe there was a paper covering this locality for web caches.
Ah, yes:
"Reducing the Disk I/O of Web Proxy Server Caches"
- Carlos Maltzahn and Kathy J Richardson
Compaq Computer Corporation, Network Systems Laboratory
- Dirk Grunwald
University of Colorado
.. some (not all) of the concepts included there are relevant here.
Other filesystems will have different allocation/layout policies,
and additions such as "hinting" which can substantially speed up
mail accesses.
But, this is off topic, and I digress. 
Adrian
There's the isp-emailservers list (at isp-emailservers.com,
even), but the clues are few and far between. It'd be nice
if some actual content (as opposed to "please help me I've
never used the Internet before and now I'm an ISP" questions)
could show up there, though.
No, no, don't do that. Given the scale of something like this, I'd expect
you'd be running on something like Oracle that supports the concept of "hot
backups". The table spaces are put into a quiesced state, and all writes are
done to memory and to recovery logs. Once the backup is finished, you take
it out of hot backup and it then writes all the pending transactions to the
database files. That way, the database files are stable, and you also back up
the recovery logs to something with real-time access (like another nfs
server). In the event you have a catastrophic database failuser, you recover
from tape (or if you have the space, you have a copy of the dbf files
elsewhere), and run all the transaction logs - it takes about 5 minutes per
hour of transactions. Then your database is brought up to the point where it
was when it died. The worst case scenario is that there's a few transactions
that don't get logged, which means that a few emails get dropped. If you had
a stock smtp server that died, you could be looking at the same situation.
.. and then you have to make sure that you periodically garbage collect
your local store, lest you end up with a whole bunch of files which are
unreferenced and just take up space. 
As far as backing up the actual mailboxes, there's no way to get around the
fact that it'll take long enough to finish that stuff will be inaccurate by
the time its finished. If you ever have to restore the mailboxes from tape
without restoring the database, it'd be wise to have an application that
builds a list of the messages that are on disk the database doesn't know
about.
At least two commercial filesystems support "snapshots" - AdvFS and
WALF (NetApp). I don't remember if XFS supports snapshots.
Oh, FreeBSD's FFS has got snapshot capabilities but its not yet useful
in a "real world" scenario.
Adrian
Earthlink have a little paper here:
http://www.earthlink.com/about/papers/mailarch.html
and I half remember seeing a couple of other papers somewhere else (try
LISA archives under usenix.org).
The system we use is based on a few ideas from this paper. Once you start
splitting between multiple servers it's pretty easy to get something
that'll scale to over a million mailboxes.
What about using clustered servers with SAN, I think this is also possible.
For example Legato has a cluster product which can also support SAN.
Is there any security consideration not to use NAS which is based on NFS ?
regards,
What about using clustered servers with SAN, I think this is also possible.
For example Legato has a cluster product which can also support SAN.
Because SANs become a pain when you want to implement shared storage
(ie, one central mailspool mounted by muliple systems). Certainly,
it's doable, but you have to have software running on all the systems
to deal with concurrent access. I've yet to find a version of
this software that runs on Linux (or any other Open Sourceish OS),
so it's not even a consideration for me.
Is there any security consideration not to use NAS which is based on NFS ?
Well, you just need to be careful. NFS security is resonably well
understood.
Mike
Nope, I don't. See page 4 of their white paper:
http://www.missioncriticallinux.com/technology/cluster/kimberlite.pdf
Quote:
Although both systems can access shared disk storage, to ensure data
integrity, only one cluster system can run a service and access service
data at one time. To prevent a service from running on multiple systems
and corrupting data, each cluster system is remotely connected to the
other cluster system's power switch through a serial line. This remote
connection enables each cluster system to completely disable the other
cluster system by cycling its power. Once a cluster system has been
disabled, its services can be safely restarted on the other cluster
system.
They go to great lengths to ensure that no more than one system is accessing
a single piece of data at the same time. Lots of companies do this (maybe
not with this amount of paranoia, but it's still done). The hard part is
to allow both systems to have access to the same data at the same time.
Write operations on the same file are prevented by distributed lock managers.
This is what Veritas Cluster and Sun Cluster do. It seems to be relatively
difficult, otherwise I'm sure we'd already see plenty of options.
I looked into the various options (about six months ago) and arrived
at the conclusion that this was best done via NAS boxes. They handle
locks pretty well, allowing concurrent access to the same file (best
implemented with write locks, allowing read-only access). This
allows the option of placing a bank of otherwise identical SMTP/IMAP
servers behind a load-balancing switch attached to the NAS. The
end users only ever see one hostname, but their request would be
handled by the least loaded box. Need more capacity for more users?
Add more servers. Need more storage space? Max out the capacity of
the NAS, then add more NAS boxes. They're just mount points on the
host.
Mike