|
Performance of mail
Summary
>>
description
of mail operations
>>
mbox
>>
MH
>>
maildir
>>
POP3
>>
IMAP4rev1
>>
NNTP
>>
additional reading
Introduction
This document describes performance of access to different mailbox
formats and remote mailboxes using different kind of network
protocols. This will also take account of cache handling.
Description of mail operations
read the mailbox
This is needed when the mail user agent will give a list of the
messages. It generally needs more information than the message number
or identifier. It has to fetch the subject, the sender, the date and
some other information.
read a message
This is needed when the mail user agent has to display the message to
the user.
adding a message
This is needed by the delivery agent. The message will be received on
the network or created by the user and added to a mailbox.
delete a message
This is neeed by the user to delete a given message from its mailbox.
modify a message
This can be needed when the user want to remove attachments or add some
headers.
mbox
The historical Unix format.
mbox is described here
(extracted from http://www.qmail.org/man/man5/mbox.html).
read the mailbox
- open the file for reading,
- lock it for reading,
- split the file into several messages,
- parse the headers of each message,
- unlock the file,
- and close it.
read a message
- open the file for reading,
- lock it for reading,
- get the position of the message in the file,
- read the content of the message,
- unlock the file,
- then close it.
adding a message
- open the file for writing,
- lock it for writing,
- append the message at the end of the mailbox.
You'll have to quote the lines of the messages that begin with ">" *
"From ".
- unlock the file,
- then close it.
delete a message
- open the file for writing,
- lock it for writing,
- get the position of the message in the file,
- read the content of the file after the message,
- write this content at the beginning of the message to be deleted.
- unlock it,
- close the file.
modify message
- open the file for writing,
- lock it for writing,
- get the position of the message in the file,
- read the content of remaining messages after the given message,
- write the new content of the message at the found position,
- append this content of the remaining messages,
- unlock the file,
- and close it.
performance
- To recognize easily the messages, you can assign them
a unique identifier
- If you choose to make this unique identifier persistant, you have
to write this unique identifier in the headers of the message. That
means that the whole mailbox has to be rewritten at least once
to insert these unique identifiers. When an external application will
modify the mailbox, we have to check whether new message have been
added to assign them unique identifiers.
- If you choose not to keep them persistant, each time a message
will be modified by an external application, you will have to
reassign new unique identifiers.
- Applications that choose to make the identifiers persistant generally
don't share their unique identifiers with others. That's generally
application-specific headers.
- To delete several message messages, you have to mark them as
deleted, and batch the real deletion of the messages.
That's what we can call "mark as deleted, then expunge" model.
-
To get performance when parsing headers of message, you can cache
the parsed headers.
concurrent access
You need to take care about access to the mailbox by other
applications. Locks have to be applied and if you retain the mailbox
opened, you have to check for modifications of the file to reread it.
advantages
- mbox is the historical compatible format
- mbox reduce systems calls since there is one file.
For example, for reading step, you can map the file into memory and
accessed via memory.
disadvantages
-
every mail clients can parse the file different way since
it is the responsability of the client to separate the file
into several message. The work can be done wrong. mutt is doing
sometimes a bad separation of messages. c-client will require the
'From ' header to include the date.
-
The whole mailbox has to be locked while reading or writing.
-
In worst case, we have to rewrite all the mailbox each time we want to
delete a message.
-
Message position are not persistent, neither their size since any mail
client can add its headers. We cannot guess real message size by other
way than reading the entire message content.
MH
One file per mail.
MH is described here
(extracted from http://www.fnal.gov/docs/products/mh/mh-mail.html).
read the mailbox
- open the directory,
- get a list of the files of the directory,
This operation will depend on the performance of the implementation of
the filesystem on the system you are using. For example ext2 (in
linux 2.4) or UFS (in FreeBSD 4.x) will result in poor
performance. ReiserFS (in linux 2.4) will give better
performance.
- parse the headers of each message.
- close the directory.
read the message
- open the file for reading,
- read the content of the file,
- then close it.
adding a message
- open the directory,
- create atomically a file with a name that correspond
to a message number that does not exist. Generally,
maximum message number + 1 will be chosen. That will require to get
the list of all the files of the directory.
This operation will depend on the performance of the implementation of
the filesystem on the system you are using.
- close the directory.
delete a message
modify a message
- open the file for writing,
- write the new content of the file,
- then close it.
performance
-
As some MH applications can remap all the message numbers, we
cannot trust the fact that the message correspond to the
number, because we have no locks on the mailbox. We can write a unique
identifier in the file or use some heuristic such as the file size or
the timestamp.
-
Some application will suppose that all message number are
persistent. That's the case for sylpheed.
-
To get performance when parsing headers of message, you can cache
the parsed headers.
concurrent access
-
Concurrent access can occur when trying to read a message.
You'll have to close and reopen the message if you notice that it is
not the proper message. The same can occur for modification or
deletion.
-
For creation of message, we have to allocate a unique message number
that will correspond to a filename. Without lock of the directory, we
need an atomic creation of file and retry with a new number until we
succeed.
advantages
- message are separated by file boundaries.
disadvantages
- We cannot trust message numbers.
- Multiple system calls (one per message) have to be done to list
the messages.
maildir
Theorically, no locking.
Maildir is described here
(extracted from http://www.qmail.org/man/man5/maildir.html)
and here.
(extracted from http://cr.yp.to/proto/maildir.html).
read the mailbox
- open the directory,
- get a list of the files of the cur/ directory,
- get a list of the files of the new/ directory,
These operations will depend on the performance of the implementation of
the filesystem on the system you are using.
- parse the headers of each message.
- close the directory.
read the message
- given the unique identifier of the message, try to find a
corresponding filename in new/ or cur/ directory.
You will have to get a list of the files in cur/ and new/
and find the one that corresponds to the file.
This operation will depend on the performance of the implementation of
the filesystem on the system you are using.
- open the file for reading,
- read the content of the file,
- then close it.
adding a message
- generate a unique filename defined by maildir specification
- write the file in /tmp
- when write is finished, the file is moved to new/
delete a message
- given the unique identifier of the message, try to find a
corresponding filename in new/ or cur/ directory.
You will have to get a list of the files in cur/ and new/
and find the one that corresponds to the file.
This operation will depend on the performance of the implementation of
the filesystem on the system you are using.
- delete the file.
modify a message
- given the unique identifier of the message, try to find a
corresponding filename in new/ or cur/ directory.
You will have to get a list of the files in cur/ and new/
and find the one that corresponds to the file.
This operation will depend on the performance of the implementation of
the filesystem on the system you are using.
- open the file for writing,
- write the new content of the file,
- then close it.
performance
-
Filenames of the Maildir message files can be changed when the
read/new/deleted flags change. That makes it necessary to reread the
directory that contains the messages when it changed.
That will occur when there will be concurrent access to the mailbox.
-
To get performance when parsing headers of message, you can cache
the parsed headers.
concurrent access
-
The message can be moved or renamed by other mail user agents.
That makes it necessary to reread the whole directory each time it
occurs to find the new unique identifier to files mapping.
This will occur when the mail user agent will change the message flags.
-
For addition of message, there will not be concurrent access.
advantages
- message are separated by file boundaries.
- no operation need locks.
disadvantages
- We have to read the directory content frequently.
- Multiple system calls (one per message) have to be done to list
the messages.
POP3
Simple network retrieval.
POP3 is described here
(extracted from www.ietf.org/rfc/rfc1939.txt).
read the mailbox
-
the message numbers have to be listed through the LIST
command
-
then, TOP command has to be used the get the headers of each
message and show the minimal information about the messages.
One TOP command has to be sent for each message. This will result in
high latency. TOP command is also not available on every servers since
it is optional.
-
- parse the headers of each message.
read the message
-
We just have to send a command to get the entire message.
-
adding a message
- it will be done only by the delivering agent on the server.
delete a message
-
We have to give the number of message to delete.
modify a message
- it will be done only server-side but it not not recommended since
most mail user agent use their cache when accessing these message and
suppose that the message will not change.
performance
-
Messages will be retrieved entirely even if we don't need the
attachment parts (IMAP let us retrieve only parts we are interested
in).
-
To get the message list and the basic information, we have to send one
command per message, which will result in high latency.
-
To avoid retrieving the message each time you want to access them, you
can store them the first time you retrieve them.
You have to assign them a unique identifier to recognize the
messages. You can get the unique identifiers of messages by making a
UIDL request.
-
To get performance when parsing headers of message, you can cache
the parsed headers.
concurrent access
-
The server will give the same view to all concurrent connections.
advantages
- This is a simple protocol.
disadvantages
- We cannot add or modify messages.
- There is only one mailbox.
- The whole message has to be retrieved even on a slow connection.
IMAP4rev1
Everything is server-side.
IMAP4rev1 is described here
(extracted from www.ietf.org/rfc/rfc3501.txt).
read the mailbox
-
you can get message list with the FETCH command by using ENVELOPE
keyword (that will retrieve the basic information of the message such
as From header, recipient, subject, date and some other fields), the
number of the messages and optionally, their unique identifier.
All this can be done in one request.
-
read the message
-
You just have to send a command to get the entire message.
-
-
Alternatively, you can request the MIME structure of the message and
fetch with a second request only the parts you are interested in.
adding a message
-
You will have to call the APPEND command to add an entire message.
delete a message
-
We have to give the number of message to delete.
modify a message
-
You can delete the original message and add the modified message.
performance
-
You can cache the message or MIME parts of the message.
You have to get the unique identifier of the message.
-
The message headers are already parsed by the IMAP server but you have
to parse the IMAP protocol.
concurrent access
-
On some servers, you cannot access a same mailbox with two
connections.
advantages
-
You mailboxes can be accessed using the server. That means that you
can access your messages from everywhere.
-
There are multiple folders.
-
You can retrieve only MIME parts you are interested in.
-
You don't even have to retrieve the whole list of message.
You can only retrieve the part of the list you want to display to the
user.
disadvantages
- The protocol is complex to parse.
- You have to open one connection per folder if you want to access
several folders simultaneously.
NNTP
News retrieval protocol.
NNTP is described here
(extracted from www.ietf.org/rfc/rfc977.txt)
and some advised extension here
(extracted from www.ietf.org/rfc/rfc2980.txt).
read the mailbox
-
you can get message list with the XOVER command (that will retrieve
the basic information of the message such as From header, subject,
date and some other fields) and the unique identifier of the message.
All this can be done in one request. But this command is optional
and may not be present of all servers. If this is not present, GROUP
command will give the range of message identifiers, then you have to
do a HEAD request for each messages.
-
read the message
-
You just have to send a command to get the entire message.
-
adding a message
-
You will have to call the POST command to add an entire message.
delete a message
-
You cannot delete a message.
modify a message
-
You cannot modify a message.
performance
-
You can cache the message the message.
-
Some of the message headers are already parsed by the server and can
be requested through the XOVER command.
concurrent access
-
You can do anything concurrently.
advantages
-
The protocol is simple.
-
You can get the list of messages and their basic informations in two
requests.
-
There are multiple folders.
-
You don't even have to retrieve the whole list of message.
You can only retrieve the part of the list you want to display to the
user.
disadvantages
- This is for news.
- The whole message has to be retrieved even on a slow connection.
additional reading
>>
mailbox formats
by Mark Crispin
>>
IMAP vs POP
by Mark Crispin.
>>
Jamie
Zawinski
website - he was the author of mail
component of netscape.
>>
benchmarks
by Timo Sirainen (author of Dovecot).
>>
comparison of
mbox and mailbox
by Sam Varshavchik (author of
Courier-IMAP).
>>
Javamail
- a well-thought API for mail for Java.
DINH V. Hoà - Mon Aug 20 23:00:52 2007
|