In a recent blog entry, tumgreyspf considered harmful, the author documents a pretty bad time that was had with tumgreyspf. I'd like to make some corrections to that post and provide some tips for folks who wish to use tumgreyspf.
First of all, let me say that we've been using tumgreyspf for well over a year now on 3 mail servers which get high volumes of spam and non-spam messages, and haven't seen any of the problems described in the blog post mentioned above. I will admit that tumgreyspf is not for everyone though. tumgreyspf does also have some significant benefits that I'll go into later.
Up front, let me say that the blog post is absolutely correct that tumgreyspf may have on file-systems which do not deal with large numbers of directory entries, such as ext2. If your system has many thousands of users, or you do not configure the system to reject unknown local recipients before accepting the message, you will absolutely have problems with tumgreyspf.
This is because the ext2 and similar file-systems based off the Berkeley Fast File System (FFS) store directory entries as a sequential list with no ordering. This means that in order to do pretty much any operation on a directory, the entire directory entry may have to be scanned looking for the matching entry. Not a problem for 10 or 100 entries, sequential scan for those may actually be faster than more sophisticated data-structures. However, for tens of thousands of entries your system will spend vast amounts of time scanning that megabytes of data for entries.
This is a problem with anything that uses directory entries on the file-system. You can usually tell when this is happening because the system responds slowly and if you watch the “vmstat” output you will see most of the CPU time being spent in “system time”.
There are a few ways to deal with this. The first is that you have to remember to set up the daily cron job for tumgreyspf so that it ages out old entries. If you don't do this, the data will grow and become a problem. However, a better solution is to make sure you configure Postfix to reject invalid recipients before they reach tumgreyspf. We regularly get dictionary attacks on a few of our domains, as well as just a regular spam volume that has reached as high as 20,000 spams per day (for a company of 5 people). Finally, you can also use a file-system which is immune to this problem, our mail server runs XFS on the partition that has tumgreyspf data on it.
Another claim made by the post above is that tumgreyspf launches a new process for every incoming message. This is true only if you explicitly configure Postfix to do so. tumgreyspf is implemented as an external policy filter for Postfix, and Postfix manages starting enough tumgreyspf processes so that multiple message checks can be done in parallel to handle the incoming e-mail volume. Since most messages spend a relatively short period of their life doing the tumgreyspf lookup, it's rare that there are very many instances running at once.
Postfix does an extremely good job of this, much like the Apache web server does of having spare servers around to handle additional incoming volume. For example, our e-mail server currently has 2 copies of tumgreyspf running, despite having a steady stream of incoming spam attempts.
However, in the case where you have excessively large directories using a file-system that can't handle them, each check may take tens or hundreds of seconds to process. In that case, because it takes a long time to do the checks, there would probably be new tumgreyspf instances started for many incoming connections.
Oh, and a correction to the original blog entry. The Unix system load does not indicate CPU utilization. A load of 1 does not indicate 100% CPU usage – if that were the case what would the system load of 6 the blog author quotes mean? What about the time my system had a load of over 100? Was my CPU at 10,000% load? ;-) No, Unix load means the number of processes which are runnable but waiting for resources. This can be CPU, but is just as often disc or network resources. For example “dd if=/dev/hda of=/dev/null” and “wget URL” both would contribute 1 to the system load, while consuming almost no CPU time. Also, “thrashing” where your system is swapping hard because you don't have enough memory also tends to lead to CPU starvation but high system load.
In the case of the the failed installation, the system load was so high almost certainly because of the file-system problems I mentioned above, leading to tumgreyspf blocking while the kernel tried to deal with inefficiently stored large directories.
The blog post also chastises me for using the file-system instead of a database. To be honest, I was originally going to design it to use a PostgreSQL database back-end for storing the data. However, Kevin pushed back (he wanted to implement greylisting as well) that he wanted something easier to set up, like storing it in the file-system. So, I designed a back-end for Kevin using files, and it's worked so well that I haven't gotten around to writing a PostgreSQL back-end.
In the time since, we have used other greylist programs for various of our clients (mostly those not running Postfix), and have found them to be much less robust than tumgreyspf. The biggest problems come with concurrency and corruption.
Simple databases like SQLite and gdbm are easy to build programs around, but they rely on locking to ensure that only one program is accessing the database at a single time. In other words, if two messages come in at the same time, one of them may have to wait around until the first is done with the database. Using heavier-weight databases like PostgreSQL tends to avoid this problem, at the cost of being harder to set up and maintain. For example, if you don't regular “vacuum” or “optimize” the table, it will probably also grow significantly over time.
Because of using the file-system for it's back-end storage, tumgreyspf can deal with having many concurrent incoming messages and read and write to the back-end storage, even on multi-processor systems, with little if any contention – the OS kernel takes care of that and is designed to have very high-performance while doing so.
On the corruption issue, we have seen database-based solutions that have exactly the same problem that the blog author had with tumgreyspf: rejecting incoming messages. Corruption of that database we have had clients experience on numerous occasions. On a hard crash of the system, file-systems seem to come up in a better state than databases, and then they don't come up you tend to notice immediately – the system won't boot until you deal with it.
In short, I have the following recommendations for tumgreyspf users:
While the original blog entry author was pretty abrasive about tumgreyspf, I do appreciate the feedback. I'll be releasing a new version of tumgreyspf shortly that includes a bit more documentation discussing some of these matters. For one, I know that I didn't mention the problems related to running under the ext2 file-system. That should be included in version 1.11 of tumgreyspf.comments powered by Disqus