Performance of mail



Summary

>> description of mail operations
>> mbox
>> MH
>> maildir
>> POP3
>> IMAP4rev1
>> NNTP
>> additional reading





Introduction

This document describes performance of access to different mailbox formats and remote mailboxes using different kind of network protocols. This will also take account of cache handling.

Description of mail operations

read the mailbox

This is needed when the mail user agent will give a list of the messages. It generally needs more information than the message number or identifier. It has to fetch the subject, the sender, the date and some other information.

read a message

This is needed when the mail user agent has to display the message to the user.

adding a message

This is needed by the delivery agent. The message will be received on the network or created by the user and added to a mailbox.

delete a message

This is neeed by the user to delete a given message from its mailbox.

modify a message

This can be needed when the user want to remove attachments or add some headers.





mbox

The historical Unix format.

mbox is described here (extracted from http://www.qmail.org/man/man5/mbox.html).

read the mailbox

  • open the file for reading,
  • lock it for reading,
  • split the file into several messages,
  • parse the headers of each message,
  • unlock the file,
  • and close it.

read a message

  • open the file for reading,
  • lock it for reading,
  • get the position of the message in the file,
  • read the content of the message,
  • unlock the file,
  • then close it.

adding a message

  • open the file for writing,
  • lock it for writing,
  • append the message at the end of the mailbox. You'll have to quote the lines of the messages that begin with ">" * "From ".
  • unlock the file,
  • then close it.

delete a message

  • open the file for writing,
  • lock it for writing,
  • get the position of the message in the file,
  • read the content of the file after the message,
  • write this content at the beginning of the message to be deleted.
  • unlock it,
  • close the file.

modify message

  • open the file for writing,
  • lock it for writing,
  • get the position of the message in the file,
  • read the content of remaining messages after the given message,
  • write the new content of the message at the found position,
  • append this content of the remaining messages,
  • unlock the file,
  • and close it.

performance

  • To recognize easily the messages, you can assign them a unique identifier
    • If you choose to make this unique identifier persistant, you have to write this unique identifier in the headers of the message. That means that the whole mailbox has to be rewritten at least once to insert these unique identifiers. When an external application will modify the mailbox, we have to check whether new message have been added to assign them unique identifiers.
    • If you choose not to keep them persistant, each time a message will be modified by an external application, you will have to reassign new unique identifiers.
    • Applications that choose to make the identifiers persistant generally don't share their unique identifiers with others. That's generally application-specific headers.
  • To delete several message messages, you have to mark them as deleted, and batch the real deletion of the messages.
  • That's what we can call "mark as deleted, then expunge" model.
  • To get performance when parsing headers of message, you can cache the parsed headers.

concurrent access

    You need to take care about access to the mailbox by other applications. Locks have to be applied and if you retain the mailbox opened, you have to check for modifications of the file to reread it.

advantages

  • mbox is the historical compatible format
  • mbox reduce systems calls since there is one file. For example, for reading step, you can map the file into memory and accessed via memory.

disadvantages

  • every mail clients can parse the file different way since it is the responsability of the client to separate the file into several message. The work can be done wrong. mutt is doing sometimes a bad separation of messages. c-client will require the 'From ' header to include the date.
  • The whole mailbox has to be locked while reading or writing.
  • In worst case, we have to rewrite all the mailbox each time we want to delete a message.
  • Message position are not persistent, neither their size since any mail client can add its headers. We cannot guess real message size by other way than reading the entire message content.





MH

One file per mail.

MH is described here (extracted from http://www.fnal.gov/docs/products/mh/mh-mail.html).

read the mailbox

  • open the directory,
  • get a list of the files of the directory, This operation will depend on the performance of the implementation of the filesystem on the system you are using. For example ext2 (in linux 2.4) or UFS (in FreeBSD 4.x) will result in poor performance. ReiserFS (in linux 2.4) will give better performance.
  • parse the headers of each message.
  • close the directory.

read the message

  • open the file for reading,
  • read the content of the file,
  • then close it.

adding a message

  • open the directory,
  • create atomically a file with a name that correspond to a message number that does not exist. Generally, maximum message number + 1 will be chosen. That will require to get the list of all the files of the directory. This operation will depend on the performance of the implementation of the filesystem on the system you are using.
  • close the directory.

delete a message

  • delete the file.

modify a message

  • open the file for writing,
  • write the new content of the file,
  • then close it.

performance

  • As some MH applications can remap all the message numbers, we cannot trust the fact that the message correspond to the number, because we have no locks on the mailbox. We can write a unique identifier in the file or use some heuristic such as the file size or the timestamp.
  • Some application will suppose that all message number are persistent. That's the case for sylpheed.
  • To get performance when parsing headers of message, you can cache the parsed headers.

concurrent access

  • Concurrent access can occur when trying to read a message. You'll have to close and reopen the message if you notice that it is not the proper message. The same can occur for modification or deletion.
  • For creation of message, we have to allocate a unique message number that will correspond to a filename. Without lock of the directory, we need an atomic creation of file and retry with a new number until we succeed.

advantages

  • message are separated by file boundaries.

disadvantages

  • We cannot trust message numbers.
  • Multiple system calls (one per message) have to be done to list the messages.





maildir

Theorically, no locking.

Maildir is described here (extracted from http://www.qmail.org/man/man5/maildir.html) and here. (extracted from http://cr.yp.to/proto/maildir.html).

read the mailbox

  • open the directory,
  • get a list of the files of the cur/ directory,
  • get a list of the files of the new/ directory, These operations will depend on the performance of the implementation of the filesystem on the system you are using.
  • parse the headers of each message.
  • close the directory.

read the message

  • given the unique identifier of the message, try to find a corresponding filename in new/ or cur/ directory. You will have to get a list of the files in cur/ and new/ and find the one that corresponds to the file. This operation will depend on the performance of the implementation of the filesystem on the system you are using.
  • open the file for reading,
  • read the content of the file,
  • then close it.

adding a message

  • generate a unique filename defined by maildir specification
  • write the file in /tmp
  • when write is finished, the file is moved to new/

delete a message

  • given the unique identifier of the message, try to find a corresponding filename in new/ or cur/ directory. You will have to get a list of the files in cur/ and new/ and find the one that corresponds to the file. This operation will depend on the performance of the implementation of the filesystem on the system you are using.
  • delete the file.

modify a message

  • given the unique identifier of the message, try to find a corresponding filename in new/ or cur/ directory. You will have to get a list of the files in cur/ and new/ and find the one that corresponds to the file. This operation will depend on the performance of the implementation of the filesystem on the system you are using.
  • open the file for writing,
  • write the new content of the file,
  • then close it.

performance

  • Filenames of the Maildir message files can be changed when the read/new/deleted flags change. That makes it necessary to reread the directory that contains the messages when it changed. That will occur when there will be concurrent access to the mailbox.
  • To get performance when parsing headers of message, you can cache the parsed headers.

concurrent access

  • The message can be moved or renamed by other mail user agents. That makes it necessary to reread the whole directory each time it occurs to find the new unique identifier to files mapping. This will occur when the mail user agent will change the message flags.
  • For addition of message, there will not be concurrent access.

advantages

  • message are separated by file boundaries.
  • no operation need locks.

disadvantages

  • We have to read the directory content frequently.
  • Multiple system calls (one per message) have to be done to list the messages.





POP3

Simple network retrieval.

POP3 is described here (extracted from www.ietf.org/rfc/rfc1939.txt).

read the mailbox

  • the message numbers have to be listed through the LIST command
  • then, TOP command has to be used the get the headers of each message and show the minimal information about the messages. One TOP command has to be sent for each message. This will result in high latency. TOP command is also not available on every servers since it is optional.
  • parse the headers of each message.

read the message

  • We just have to send a command to get the entire message.

adding a message

  • it will be done only by the delivering agent on the server.

delete a message

  • We have to give the number of message to delete.

modify a message

  • it will be done only server-side but it not not recommended since most mail user agent use their cache when accessing these message and suppose that the message will not change.

performance

  • Messages will be retrieved entirely even if we don't need the attachment parts (IMAP let us retrieve only parts we are interested in).
  • To get the message list and the basic information, we have to send one command per message, which will result in high latency.
  • To avoid retrieving the message each time you want to access them, you can store them the first time you retrieve them. You have to assign them a unique identifier to recognize the messages. You can get the unique identifiers of messages by making a UIDL request.
  • To get performance when parsing headers of message, you can cache the parsed headers.

concurrent access

  • The server will give the same view to all concurrent connections.

advantages

  • This is a simple protocol.

disadvantages

  • We cannot add or modify messages.
  • There is only one mailbox.
  • The whole message has to be retrieved even on a slow connection.





IMAP4rev1

Everything is server-side.

IMAP4rev1 is described here (extracted from www.ietf.org/rfc/rfc3501.txt).

read the mailbox

  • you can get message list with the FETCH command by using ENVELOPE keyword (that will retrieve the basic information of the message such as From header, recipient, subject, date and some other fields), the number of the messages and optionally, their unique identifier. All this can be done in one request.

read the message

  • You just have to send a command to get the entire message.
  • Alternatively, you can request the MIME structure of the message and fetch with a second request only the parts you are interested in.

adding a message

  • You will have to call the APPEND command to add an entire message.

delete a message

  • We have to give the number of message to delete.

modify a message

  • You can delete the original message and add the modified message.

performance

  • You can cache the message or MIME parts of the message. You have to get the unique identifier of the message.
  • The message headers are already parsed by the IMAP server but you have to parse the IMAP protocol.

concurrent access

  • On some servers, you cannot access a same mailbox with two connections.

advantages

  • You mailboxes can be accessed using the server. That means that you can access your messages from everywhere.
  • There are multiple folders.
  • You can retrieve only MIME parts you are interested in.
  • You don't even have to retrieve the whole list of message. You can only retrieve the part of the list you want to display to the user.

disadvantages

  • The protocol is complex to parse.
  • You have to open one connection per folder if you want to access several folders simultaneously.





NNTP

News retrieval protocol.

NNTP is described here (extracted from www.ietf.org/rfc/rfc977.txt) and some advised extension here (extracted from www.ietf.org/rfc/rfc2980.txt).

read the mailbox

  • you can get message list with the XOVER command (that will retrieve the basic information of the message such as From header, subject, date and some other fields) and the unique identifier of the message. All this can be done in one request. But this command is optional and may not be present of all servers. If this is not present, GROUP command will give the range of message identifiers, then you have to do a HEAD request for each messages.

read the message

  • You just have to send a command to get the entire message.

adding a message

  • You will have to call the POST command to add an entire message.

delete a message

  • You cannot delete a message.

modify a message

  • You cannot modify a message.

performance

  • You can cache the message the message.
  • Some of the message headers are already parsed by the server and can be requested through the XOVER command.

concurrent access

  • You can do anything concurrently.

advantages

  • The protocol is simple.
  • You can get the list of messages and their basic informations in two requests.
  • There are multiple folders.
  • You don't even have to retrieve the whole list of message. You can only retrieve the part of the list you want to display to the user.

disadvantages

  • This is for news.
  • The whole message has to be retrieved even on a slow connection.





additional reading

>> mailbox formats by Mark Crispin
>> IMAP vs POP by Mark Crispin.
>> Jamie Zawinski website - he was the author of mail component of netscape.
>> benchmarks by Timo Sirainen (author of Dovecot).
>> comparison of mbox and mailbox by Sam Varshavchik (author of Courier-IMAP).
>> Javamail - a well-thought API for mail for Java.

DINH V. Hoà - Mon Aug 20 23:00:52 2007