Benchmarks, Round 2

Ever since my last round of benchmarks, I have been asked to run new benchmarks. There have been new versions of the operating systems, people have asked me to benchmark Solaris, OS X and Windows, too, and last but not least, there is other stuff to benchmark, too.

The previous benchmark mainly focused on networking, scheduling and memory management, but the best networking in the world does not help you if your file system sucks. I could not run real filesystem benchmarks last time, because I had all OSes installed in parallel on different partitions of the same hard disk. Hard disks have different access speeds depending on where on the hard disk your files are. The benchmark would have been unfair under those conditions, so I didn't do it. This time, I have a wonderful new dual Pentium D 3.2 GHz Dell monster with two gigs of memory to benchmark on. I don't have real enterprise level storage available, so I will be using the internal hard disk (which is an Western Digital WD2500JS disk with 232 GB).

Another point of improvement is to use gigabit ethernet. Last time I only had 100 MBit ethernet. This time I have a Broadcom BCM5751 in the Dell monster, a direct cable link between the Dell box and my notebook, using a Broadcom BCM5705. My notebook runs Linux 2.6.16 all the time.

The previous benchmarks consisted solely of synthetic benchmarks. These benchmark one specific thing, and are thus very good if you want to know whether you should optimize this specific thing you are benchmarking, but people keep complaning about how in the real world, these kinds of benchmarks don't matter, because your bottleneck will be something completely different. I disagree completely with that, but to appease those naysayers, I have prepared a real world benchmark, too. During my work in the last years, I consulted a company doing online car sales, and they had some special web servers serving only static images. I captured the directory layout, the file names and sizes, and about half an hour worth of download hits from them, and I wrote a small benchmark tool that can download a list of urls given in a file from a certain IP.

Detailed benchmark

So my real-time benchmark will be:
  1. On a fresh install of the OS (using the full disc), switch the file system into the fastest and most dangerous mode. This usually means mounting in noatime mode, and on BSD, in async mode.
  2. untar the enormous tar.gz file with milions of images (half normal size, half tumbnails). The tarball is about 21 GB, and it contains about 3 million images. Not all of the files are in the same directory, but there are some directories with a LOT of files in them.
  3. untar another big tar file (300 MB) with more images on top of the old one so we don't just have sequential updating in the mix. This second tar has the file names in a "random" order (I sniffed the URLs requested in a series of live HTTP requests and tarred the files in that order; the real life site I got the data from actually copied the images in that order to the disk, so this is also a playback of a real-life application.
  4. start gatling without logging
  5. replay the requests, 100 in parallel
  6. prime buffer cache by copying 2 GB from /dev/zero to a file
  7. start 32 instances of gatling, without logging
  8. replay the requests, 100 in parallel
  9. delete all the files

Criticism

What if an OS does not turn on DMA by default?

If I notice that, I'll try turning on DMA. Basically, I think if an OS does not turn on DMA by default, it deserves the bad benchmark scores it gets. Let's see if it comes to this.

Why don't you benchmark PHPbb on Apache with a MySQL backend?
Why don't you benchmark J2EE on Tomcat with a Postgres backend?

Those benchmarks involve SO many aspects of the OS at the same time, that the results are pretty much useless. My target audience with these benchmarks is not so much the end user who wants to know what platform to use for his horrible bloatware monster setup. My target audience is the kernel hackers of each of the operating systems, and I want my benchmarks to be razor sharp enough so they can see what they need to optimize right away.

Besides, I hate PHP and Java and SQL databases :-)

Don't you just benchmark the drivers?

Yes.

If you can see a way to do benchmarks like these without benchmarking the drivers (and without me spending weeks rerunning the same benchmark on different hardware configurations), please tell me!

On the other hand, I'm using the most typical off-the-shelf hardware I could get. So if one OS has a bad driver for THIS hardware, that is a really bad sign for that OS. Draw your own conclusions.

What if tar A is more efficient than tar B?

I'm using my own tar implementation to tar and untar here, so we really benchmark the kernel, not the tar.

Why use gatling?

gatling is my own web server, and I basically implemented all the tricks I could find for each OS. gatling does have drawbacks, too. For instance, gatling is single threaded. The Dell monster box has a dual core CPU, so gatling does not fully make use of the machine. Also, gatling uses blocking OS calls to open files (because there is no non-blocking file open syscall on Unix). That means if the file system takes a long time to open files, gatling will block during that time.

To work around that, I hacked gatling so it would fork in the beginning, right after opening the sockets, so there would be more than one gatling process available to answer queries. We shall see how useful that option is.

I'm also running fnord in one benchmark for this reason. fnord is a trivial web server that is run from tcpserver, so one fnord process will be running per open http connection.

Why not use a real setup, with SCSI/Fibrechannel/solid state disk?

Because I don't have access to hardware like that. Please contact me if you have such hardware and would like to give it to me for benchmarking!

Why didn't you compile custom kernels for each platform?

Most people use custom distro kernels these days, so that is the fair thing to do for benchmarking. I do have a shell script to increase some limits on a system, on ports, open files, processes, stuff like that. It's called "prep" and it's part of the gatling CVS. If you think I missed some crucial thing to do for optimization, please contact me.

Why didn't you benchmark XYZ?

I didn't have it :-)

Diary

These benchmarks were conducted by Ilja van Sprundel an me (Felix). Most importantly, that means I have a witness this time :-)

NetBSD 3.0 x86

The first OS I installed was NetBSD 3.0 x86. The installer got in my way when it didn't try to newfs the new partition I told it to create, and then it tell me that fsck failed. That was a little surprising, but I went back and tell it to newfs, and then it was all fine.

When NetBSD booted, I copied libowfat and gatling to the machine and tried to build them, but the build failed. It used to work in previous versions of NetBSD, so I was quite surprised about this. It turns out that the CMSG_* macros use the BSD types (u_char in this case), but sys/types.h only defines them if you have _NETBSD_SOURCE defined. WTF?! Anyway, so libowfat now defines that, and gatling builds on NetBSD 3.

When untarring the tarball, I saw that gzip had 1% CPU, but tar had 60% CPU. Something is wrong with the BSD tar, if you ask me. So I decided to use my own tar (from embutils). Unfortunately, that tar did not compile on NetBSD 3, because NetBSD does not have ftw in their libc. Huh!? There is an open ticket from 2002 in their bug tracking system to add ftw, but it was closed with "we should use someone else's implementation for this", without anyone ever doing it. NetBSD should really implement ftw. I ended up using the implementation from my diet libc. With my tar, the extraction took only 25% CPU in tar. I don't know what is wrong with NetBSD's tar, but someone should look at it.

The first data from NetBSD: download + untar took 3505.33 seconds real time, which is about one hour. Untarring the other tar file on top of it, the 300 MB one (after buffer cache priming), took 28 minutes, 28 seconds, which is about half an hour.

We were shocked at how poorly NetBSD performed. Well, I was. Ilja doesn't regard NetBSD very highly, so he was more amused than shocked. But having to wait for an hour to extract 21 gigs of images, that is pretty much unacceptable.

The next surprise came when I wanted to run the benchmark locally, because NetBSD lowered the open files limit to 64!! WTF?!? (Ilja theorized that they lowered it to mitigate fd_set overflows). Anyway, I upped the limit, and started the benchmark.

NetBSD scores 169 requests per second on the first benchmark.

The second benchmark is running gatling in the forked mode, i.e. creating the server sockets, then forking 32 times, so we have 32 gatling processes, all listening on the same sockets. The idea is that when one gatling is blocking on open(), some other can answer the incoming connections. There is some slight loss because more than one gatling can try accepting the same connection, and NetBSD showed some signs of race conditions there, because it returned interesting error messages to accept(), like ENOENT (no such file or directory) and EADDRINUSE (address already in use). So, NetBSD does have some fine-grained networking after all :-) At least it didn't panic as Ilja had hoped. NetBSD scores 97 requests per second on the second benchmark. I would have expected this one to be faster. Strange.

Deleting the files took 349 seconds.

NetBSD 3.0 AMD64

I tried getting the 64-bit version of NetBSD 3.0 on the disk by telling NetBSD to upgrade the existing installation. That went well until I tried to boot into the upgraded installation, at which point the loader said it couldn't find /boot (?!?) and aborted. So I did a full reinstall of NetBSD 3.0 on the disk. Let's see how this goes.

Uh, not so well. The AMD64 build sets the network card to 100BaseTX mode, and even if I use ifconfig to override it, will still stay in 100BaseTX mode. Also, the boot messages didn't look so good (for example, it complained about libm.so and ldconfig gave the error message you get when you try to run a 32-bit NetBSD 3.0 x86 binary on 64-bit NetBSD 3.0 AMD64 (which does not work. No, really!). So I reinstalled NetBSD 3.0 AMD64 from scratch, and the freshly installed NetBSD 3.0 froze solid during the kernel initialization in the boot process. The last line is

boot device: <unknown>
root device:
I gave up on NetBSD AMD64 at this point. Later, after installing a few other operating systems on the box, I came back to NetBSD AMD64, this time staying with the default UFS1, and this time it worked. Go figure.

In the NetBSD AMD64 graphs, you can see quite nicely what I like to call a "pump and dump" throughput, where first the throughput is high, while everything goes into the buffer cache, and then there is a drought period where the OS is busy dumping the buffer cache to disk, and the throughput goes down dramatically :-) This same behavior is typical for Linux, too. First of all, this is a good sign, because it means the buffer cache is really used fully. On the other hand, the potential for improvement is evident, because it would be great if the buffer cache could be used better while the OS is dumping it to disk.

Fascinatingly, the AMD64 install performed much better than the x86 install. It took 35 minutes, 28 seconds. On NetBSD, by the way, since we had a Dell with a USB keyboard presumably, the "Please press any key to reboot" prompt after halt does not actually reboot if you press a key. Anyway, untarring the other tarball took 12 minutes, 19 seconds.

In the HTTP benchmark, NetBSD 3.0 AMD64 scored 41 requests per second. Running the HTTP benchmark again (now everything in the benchmark should be in the buffer cache, so this benchmarks more than the file system) got about 7500 requests per second. Running the second HTTP benchmark (32 instances of gatling) on a cold cache scored 41 requests per second. On a warm cache, I measured 6800 requests per second. The kernel messages indicate that NetBSD found the second CPU, but did not enable it.

These numbers are much more in line with the numbers measured on the other operating systems, so I reinstalled the x86 version of NetBSD. It only detects one CPU, btw. On the HTTP benchmark it scored 41 requests per second on cold cache, 6700 rps on warm cache. Running 32 gatling instances at the same time scored 41 requests per second on cold cache, and 5950 rps. This is almost exactly how much the 64-bit version scored, so why are the results so different from the first run of NetBSD 3.0 x86? Two things are different: 1. the first benchmark was using UFSv2, the second one used UFSv1. UFSv2 appears to be much slower than UFSv1 and it may even be the reason why the first attempt at installing the 64-bit version failed. The other thing I did differently the second time around is how I flushed the cache between benchmark phases. In the first attempt, I used dd to create a 2 GB file on disk. Maybe that was not sufficient to clear the buffer cache on NetBSD and that's why the requests per second where higher in the first attempt. In the second attempt I actually rebooted the machines between benchmark phases.

A note about the warm cache values: if all the files are in memory, then the whole benchmark run ends very quickly, around one second typically. The benchmark downloads 10000 URLs from a list of captured requests. That's why it's normal to have a variation of several hundred requests per second. Also, please note that the benchmark is running in keep-alive mode with 10 requests per TCP connection, so it "only" has 1000 actual TCP incoming connections.

Downloading only one 11k file over and over again, a 50000 fetches all in all, 50 parallel connections, with keep-alive and 10 requests per connection, NetBSD 3.0 gave me 57.3 MB/sec with just one process, and 57.1 MB/sec with 4 processes. With IPv6, I got 53.3 MB/sec with one process, and 57.1 MB/sec with four.

On my home AMD64 box, NetBSD installed quickly and easily, but then failed to boot (said it couldn't open /boot).

FreeBSD 6.1 x86

The second OS was FreeBSD 6.1 x86. Ilja installed it for me, and he chose the developer install, i.e. with gcc, but without X. FreeBSD has directory hashes, so we expected it to perform much better than NetBSD in these tests.

Untarring the tarball took 35 minutes and 34 seconds. Unfortunately, FreeBSD's time utility didn't actually output the time (WTF?!), but I could deduce the time from the traffic log I took.

Untarring the second tarball (the 300 MB one) took 17 minutes and 54 seconds.

In the first benchmark, FreeBSD scored 57 requests per second. That is pretty bad. NetBSD is three times faster!

In the second benchmark, I got the same error messages as with NetBSD, ENOENT and EADDRINUSE. I would have expected EAGAIN if I try to accept when someone else already accepted all the outstanding connections. In the end, FreeBSD hat 52 requests per second.

The delete took 7 minutes, 40 seconds.

Interestingly, FreeBSD didn't actually let us switch to async mounts. I did mount -u -o async,noatime /usr, that worked, but mount then still reported the file system to be in soft-updates mode.

FreeBSD 6.1 AMD64

The FreeBSD AMD64 install CD is apparently busted. The install aborted with
Write failure on transfer!  (Wrote -1 bytes of 1425408)
This was on the default layout on a 250 GB disk. WTF?!

So I asked Ilja to install FreeBSD 6.1 again, this time with auto settings. He did, and got the same error. He tried again, this time with the auto partitioning scheme but with more space for /, and this time it worked. So apparently the FreeBSD auto partitioning scheme is broken :-)

When running the benchmark on / by mistake, which does not have soft updates, I got about five times the throughput than when running it on /usr, which does have soft updates. For some reason I don't quite understand, enabling async mode on the file system does not disable soft updates, and this really kills performance on this FreeBSD benchmark. So, a toast to the brainiacs who always told you soft updates is practically free from a performance point of view.

Downloading only one 11k file over and over again, a 50000 fetches all in all, 50 parallel connections, with keep-alive and 10 requests per connection, FreeBSD 6.1 gave me 56.6 MB/sec with just one process, and 60.2 MB/sec with 4 processes. With IPv6, I got 53.9 MB/sec with one process, and 60.3 MB/sec with four.

On my home box, FreeBSD 6.1 AMD64 with soft-updates took 48:41 to extract the files, and 39:20 to extract the second tarball on top of it. The first gatling bench scored 46 requests per second on cold cache, and 6800 requests per second on warm cache. The second gatling bench (32 instances) scored 43 requests per second on cold cache, and 6700 requests per second on warm cache. Deleting the files took 7:35.

Without soft-updates, the same benchmarks took 37:01, 30:12, 46 and 6700 rps for one instance, 43 and 6800 rps for 32 instances, and deleting took 7:47.

OpenBSD 3.9

We were hoping for a kernel panic during the benchmarks, and the OpenBSD 3.9 kernel must have sensed that, because it froze solid during installation. I didn't even get the first prompt after the kernel init. The last line was
rd0: fixed, 3800 blocks
So I tried the AMD64 version. Same thing. So I tried an old OpenBSD 3.7 ISO. Same thing. So I changed to a different hardware for OpenBSD. This will skew the benchmark results, but I don't have a choice, do I?

So, OpenBSD 3.9 finally comes with a gcc 3! Hooray! The fallback box we installed OpenBSD on was another Dell, but an uniprocessor one. It has one Pentium 4 with 3.6 GHz, 2 Gigs of RAM, and a single IDE disk, the same Broadcom ethernet as the other box, and the hard disk is a 150 GB ST3160828AS. This puts OpenBSD at a disadvantage (because there is only one CPU) but since OpenBSD still has very rudimentary SMP support (one big kernel lock), this might well be an advantage after all, since the CPU has 400 more MHz.

Imagine our shock when OpenBSD 3.9 outperformed NetBSD 3.0 and came very close to FreeBSD 6.1 in the first benchmark! OpenBSD 3.9 took 34 minutes and 9 seconds. The second benchmark was much more modest at 40 minutes and 34 seconds, so apparently OpenBSD has some special optimization in place to make tests like our first benchmark run fast. :-)

OpenBSD scored 37 requests per second in the single task gatling benchmark, and 33 requests per second in the 32 task gatling benchmark, and 28 requests per second in the fnord benchmark.

Another fun fact: the OpenBSD ftp client is not 64-bit clean. I downloaded the large tarball with the images with it, and the progress bar and ETA were wrong. Hilariously, when it had downloaded the 21 GB, it said "Read short file." :-)

Deleting the files took 4 minutes, 48 seconds. Strangely, serving the 21 GB file from OpenBSD was immensely slow. gatling was at 92% CPU utilization and top showed 92% load as system. Meanwhile, OpenBSD only had 3 MB/sec throughput at this point. Apparently, you can saturate a 3,6 GHz Pentium 4 with two GB RAM running OpenBSD by having someone download a single large file from it, and it won't even saturate a 100 MBit Link. Unbelievable.

We are giving up on OpenBSD at this point.

OpenBSD i386 hung during boot on my home dual core AMD64 box, OpenBSD amd64 worked, but the installation from CD was unbelievably slow (about 300 KB/sec) and then the Intel NIC was detected but failed (the watchdog constantly reset the em0 interface, and ping didn't go through).

Linux 2.6.16

reiser4

For ungzipping and untarring the reference dataset on Linux, we chose the reiser4 file system, which is optimized for this use case. Extracting the tarball took 18 minutes and 1 second. Extracting the second tarball over it took 13 minutes, 50 seconds (plus 52 seconds sync). reiser4 scored 142 requests per second for the single threaded gatling (143 requests per second for the 64-bit binary), and 140 requests per second for the 32-instance gatling.

During the unpacking of the second tarball we got this fun message:

reiser4[pdflush(14954)]: commit_current_atom (fs/reiser4/txnmgr.c:1024)[nikita-3176]:
WARNING: Flushing like mad: 16384
I love it when my computer talks dirty to me :-)

The good impression reiser4 made so far it obliterated in the delete benchmark. Deleting all the files took 1 hour, 8 minutes, and 11 seconds.

On my Athlon 64, I got 16 minutes, 5 seconds and 9 minutes, 5 seconds, 57 requests per second for both the single and 32 instance gatling runs.

ext2

I also tested ext2 (I figured journaling might slow things down). I turned on dir_hash on the directory.

Extracting the tarball took 49 minutes, 9 seconds. Extracting the second tarball took 16 minutes, 15 seconds. Linux scored 34 requests per second in the single tasked gatling; interestingly, when I preloaded the directories (not the files) in the buffer cache using find (which took 4 minutes and 20 seconds), and then ran the gatling benchmark, it scored 69 requests per second. Unfortunately, in this benchmark, the find took almost as long as the gatling benchmark. So there is massive gain by preloading the directories in the filesystem, but it does not amortize itself in this benchmark. In the real world, things may look different, though.

In the 32-instances-gatling benchmark, Linux with ext2 scored 31 requests per second. Deleting the files on ext2 took 7 minutes, 10 seconds.

I re-ran this benchmark on my Athlon 64 box, and got even worse timings: 55 minutes, 24 seconds for part one, and 17 minutes, 42 seconds for part two. Downloading scored 37 rps for the single tasked gatling, and 37 rps for the 32-task gatling. Deleting the files took 7 minutes, 23 seconds.

ext3

Measured on my Athlon. Extracting the tarball took 26 min, 46 sec, extracting the second tarball on top of it took 17 min, 57 sec. In the single tasked gatling, Linux scored 43 requests per second, and 44 requests per second with 32 instances of gatling running. On a hot cache, Linux got 20000 requests per second with the 32-instance gatling, and 11500 requests per second with only a single instance. Deleting the files took 20 min, 09 sec.

XFS

XFS is another file system with a reputation of high performance. I mounted with noatime, couldn't find a way to turn off logging. Extracting the files took 45 minutes. Extracting the second tarball on top of it took 13 minutes and 33 seconds (plus 25 seconds sync). The single process gatling scored 34 requests per second, the 32-instances gatling scored 32 requests per second. Deleting all the files took 43 minutes, 25 seconds.

JFS

JFS does not have a reputation for high performance, but I benchmarked it anyway. I mounted the JFS with -o nointegrity because the mount man page implies that that would increase performance.

Extraction took 1 hour, 24 minutes, 39 seconds, and 31 minutes, 6 seconds for the second. The gatling benchmarks scored 30 rps, and 33 rps for the 32-instance test. Deleting took 30 minutes, 4 seconds.

reiserfs

reiserfs (3.6 with r5) took 33 minutes, 7 seconds for the first tarball, and 25 minutes, 38 seconds. The gatling bench scored 38 rps, and 36 rps with 32 instances. Deleting took 33 minutes, 45 seconds.

IP stack performance

Downloading only one 11k file over and over again, a 50000 fetches all in all, 50 parallel connections, with keep-alive and 10 requests per connection, Linux gave me up to 84 MB/sec with just one process, and 84 MB/sec with 4 processes. With IPv6, I had 70 MB/sec with one process, and 80 MB/sec with four.

Solaris

We got ourselves two Opensolaris distributions, Schillix 0.5.3 and Belenix 0.4.3a, and burned those to CDs. Schillix booted but does not come with a disk installation mode (at least we didn't find one), and bash didn't find gcc either. It turned out that gcc is in the perfectly obvious path /.cdrom/opt/gcc-3.4.3/bin/gcc! But OF COURSE! I should have known from the start!1!! So where is make? Old Solaris victims will know /usr/ccs/bin, so I put it in the path, too. Turns out, there is a make, an ld, a strip, but no as. NO AS!!! WTF!? In the gcc directory, there is an as, but it is called gas, and gcc doesn't know about it. We also didn't know how to partition a disk, so we booted from a regular Solaris 10 CD, rebooted during the installation (but after it had created the file system), and then mounted that from Schillix.

Even worse, there were no man pages on the system. NO MAN PAGES! WTF?! OK, granted, the Solaris man pages are pretty much worthless anyway, but I had to look up the Solaris ifconfig and route syntax, and it took me a few minutes to remember that you have to actually say "ifconfig bge0 up" after you configured an IP for it to come up. Grrrreat engineering, Sun!

OK, we had gcc, we had as and ld, the machine even had a wget. I used wget to download the files, and it turned out that the wget was not 64-bit clean and actually aborted the download after it got as many bytes as it (erroneously) expects after truncating the file size down to 32 bits. I am pretty underwhelmed by Schillix. Oh, and one for the road: when Sun make was too broken to compile libowfat, I tried smake. It core dumped. Yes, you read that correctly. smake, one of the flagship products of Jörg Schilling, the maker of Schillix, dumped core on me. HAHAHA

For some reason, my tar hung on Solaris. I have no idea why. truss showed that it wasn't in a syscall at that time. So I used star instead. Turns out that star can't do "star xzf -", it will say "Can only compress files." ROTFL! OK, so I used "|gzip -dc|star xf -" instead. What the hell.

Schillix took 1 hour, 21 minutes, and 29 seconds for the initial test (apparently, Solaris doesn't have an async mode, but at least I turned off logging), and 43 minutes and 47 seconds for untarring the second tarball over the first one. This squarely places Solaris at the bottom of the food pile.

Running the HTTP benchmark against Solaris instantly froze the machine. There was no panic or anything, but gatling apparently deadlocked something in the kernel, because the process hung and I could not interrupt it with ^C, ^\ or even ^Z. So I logged in via ssh and kill -9'd the process -- it stayed. So Solaris has fine grained locking all-right :-)

We rebooted Solaris to delete all the files, to have something very strange happen: the rm returned immediately, below 1 second, and the directories were actually gone, but the disk space wasn't. Later, the disk space went, too. I have no idea what happened there in detail, but I can't score a delete that takes less time than a find on the data. Maybe it was a fluke and the unclean shutdown has damaged our filesystem or so.

I later tried gatling again on Belenix, to see if maybe some Schillix changes to the kernel were at fault, but Belenix showed the same kernel hang.

To get some throughput measurement at all from OpenSolaris, I manually build a libowfat/gatling combo that does not try to use sendfile but mmap+write.

After the abysmal performance of UFS, I wanted to know whether ZFS would perform better. Sun has been making a lot of wind around ZFS, and as it turns out, rightfully so. OpenSolaris took 26 minutes, 42 seconds to extract the first tarball, then 15 minutes, 53 seconds to extract the second on top of the first one. The single-process gatling scored 34 requests per second, the 32-process gatling scored 35 requests per second. After the cache as primed, the single process gatling scored 6000 requests per second, while the 32-process gatling scored 9000 requests per second.

Downloading only one 11k file over and over again, a 50000 fetches all in all, 50 parallel connections, with keep-alive and 10 requests per connection, OpenSolaris gave me 77.8 MB/sec with just one process, and 80 MB/sec with 4 processes. I could not get my OpenSolaris to do IPv6. Maybe Belenix does not have IPv6 support in the kernel? Maybe I'd have to do something special for it? I can't tell because neither Schillix nor Belenix come with man pages.

Dragonfly BSD 1.4.4

Dragonfly detected the Gigabit chipset as "Broadcom BCM5751 Gigabit Ethernet" but with a PHY that only has Fast Ethernet. So I only got 100 MBit Ethernet out of it. That makes it impossible to test Dragonfly BSD fairly.

On my home AMD64 box, I tried Dragonfly 1.6, and it detected the NIC, but did not get ping out.

Windows

Vista (build 5472.5)

I tried to use Vista build 5463, which was the latest at the time of testing. So I installed the AMD64 build, which promptly blue screened after the login prompt but before actually letting me do anything. So I installed the x86 version instead, same build. It did not blue screen on me, so I used it. The first attempt had really bad throughput, along the lines of 200 KB/sec. So I looked for culprits (I used the default build they gave me, and installed nothing else on it), and ended up shutting down the search service, Windows Defender and the eTrust Antivirus services. Throughput was still pretty abominable, but at least it grew to about 600 KB/sec. I did not find a way to turn off journaling in NTFS, and doing this with FAT to get better benchmark numbers is just wrong. FAT does not even pretend to have ways to do good directory indexing. I don't know what NTFS actually does with directories, but I heard it is tree based. We shall see.

Disclosure: running the benchmark at all on Windows gave me a massive heartburn and indigestion issues. I first ported my small tar to Windows natively, and used a dl.exe I had ported to winsock, and I was planning to do something like this:

C:\> dl -O- http://10.0.0.5/bilder.tar.gz | gzip -d | tar xf -
I don't know whether it is a general shortcoming of Windows or Vista, but when I did this, read() returned 0 in tar, so tar reported premature end of file and terminated, gzip returned broken pipe, and dl terminated, too. I tried switching from read to ReadFile in tar, but that did not work either. So I ended up cheating in Windows' favor by writing a tar.exe that has a built-in gunzip and can also do HTTP downloads. I call it dltar.exe. Just to extra sure this is understood: I did not use the POSIX emulation of Windows, these are all native binaries. In the end I even moved from the POSIX functions in msvcrt (open, read, write) to CreateFile and ReadFile, just to make sure noone can piss on these results and say I discriminated against Windows by using slow legacy emulation APIs. Also, please note that untarring takes more system calls on Unix than on Windows, because there is no chown and no chmod on Windows.

I tried porting gatling to Windows, but failed to finish on the weekends I had, so in the end I gave up on porting the framework and wrote a new mini benchmark-only webserver only for Windows, using the native APIs that are said to have the highest performance, in particular AcceptEx and TransmitFile, using overlapped I/O and I/O Completion Ports. To mimic the gatling benchmarks with one process and 32 processes, I optionally made this web server multithreaded.

It took Vista 1 hour, 57 minutes to untar the tarball, although the environment was even skewed to its advantage, because I had to use the one-process download-gunzip-untar programm on Vista to work around the broken pipes. I don't expect this to make much of a difference, but technically Vista had an advantage but was still much slower than everyone else. Untarring the second tarball over the first one took 24 minutes, 54 seconds.

Downloading only one 11k file over and over again, a 50000 fetches all in all, 50 parallel connections, with keep-alive and 10 requests per connection, Vista gave me up to 30 MB/sec with just one process, and 32 MB/sec with 4 threads. IIS 7 on Vista did 59 MB/sec. Interestingly, after having installed IIS, the benchmark results against my server went up, too, to 42 MB/sec in single threaded mode and still 32 MB/sec.

Windows 2003 Server

Running the benchmark on Windows 2003 Enterprise Server, I got 46 MB/sec with one process, and 69 MB/sec with 4 threads (with 32 threads, it's 63 MB/sec). However, I also got a significant amount of "connection reset by peer" errors. It is apparently normal to get these against most peers, and my benchmark app retries in that case. For most other OSes, I got up to a dozen resets, for Windows 2003, I got over 8000 (for just 50000 requests over 5000 connections, that is pretty much). Since my small server.exe did not produce an abnormal amount of TCP resets on Vista, I'm blaming this on the earlier IP stack. I'll try to get my hands on an optimized Vista build. Using IIS 6.0, I got 83 MB/sec throughput, which is practically the limit that my notebook gigabit ethernet can handle incoming. It is a little below the Linux mark, but that's just marginal. It looks like you need to cheat and put the web server in the kernel to get Linux-like performance out of Windows. :-)

The first half of the benchmarks on Windows 2003 got much better values than Vista, too. Downloading and extracting the big tarball took 1 hour, 30 minutes, and 12 seconds. Downloading and extracting the second tarball on top of the other one took 23 minutes, and 32 seconds. Downloading the files via HTTP got me a ton of errors. It turned out that the errors go away once I limited the number of parallel connections from 100 to 20. Very strange. With 20 connections, I got 24 requests per second. Running the same benchmark again on the warm cache yielded 6700 requests per second. The same benchmark with the 32-thread server got 32 requests per second with the single-thread web server, and 7700 requests per second with primed cache. IIS scored 30 requests per second on cold cache, and 161 requests per second on warm cache. Yes, you read that right. 161 requests per second. Apparently installing IIS renders the Windows file system cache ineffective.

Deleting the files took 2 hours, 25 minutes, 13 seconds.

Longhorn Server (build 5484)

Downloading and extracting the files took 1 hour, 53 minutes, 47 seconds, downloading and extracting the second tarball on top of the first one took 20 minutes, 5 seconds. My web server scored 22 requests per second in single process mode on a cold cache, and 2200 requests per second on a warm cache. With 32 threads, my web server scores 33 requests per second on a cold cache, and 2996 requests per second on a warm cache. Decreasing the number of concurrent connections to 20 increased the throughput to 3500 requests per second. For some reason, the first run on a cold cache with 32 threads had an excessive amount of connection reset by peer errors (70000 resets on 10000 requests). Rerunning that benchmark with 32 threads with only 20 concurrent connections got rid of the reset connections, but scored the identical value of 33 requests per second. IIS scored 31 requests per second on a cold cache, and 220 requests per second on a warm cache (well, not so warm after all). I have no idea why IIS is so slow on the warm cache, but during the benchmark the disk light was on all the time. I re-ran the benchmark again on the same warm cache, and got 223 rps this time.

Downloading only one 11k file over and over again, a 50000 fetches all in all, 50 parallel connections, with keep-alive and 10 requests per connection, Longhorn Server gave me 70 MB/sec with just one process, and 38 MB/sec with 4 threads. IIS 7 did 63.6 MB/sec.

Lessons learned

When downloading 10000 files over gigabit ethernet from the other host, when those files were already in the buffer cache, I had about 9000 requests per second from FreeBSD 6.1, and 9700 from Linux 2.6 on reiser4, but 18700 from Linux 2.6 on reiser4 with the multiple-instances model of gatling.

These requests per second are measured using HTTP with keep-alive, so it's just a little over 1000 actual incoming HTTP connections per second.

So, one lesson to be learned is that running more than one gatling actually doubled performance, but only if all the files are already in the buffer cache. I expected the opposite. In fact, I implemented the multi-process model in gatling exactly so the kernel could have more than one outstanding read request and take advantage of the I/O elevator. That did not happen. I wonder how and when the I/O elevator can speed things up at all, or if it is only for write requests. This warrants further research.

It is very worrying to see almost every contestant blow up in our face in one way or another. We were astonished at the bad quality of the current releases of everyone except Linux. We were surprised that reiser4 didn't blow up in our face, which we expected after previous results with reiserfs version 3 and judging from the warnings on the reiser4 web page.