The previous benchmark mainly focused on networking, scheduling and memory management, but the best networking in the world does not help you if your file system sucks. I could not run real filesystem benchmarks last time, because I had all OSes installed in parallel on different partitions of the same hard disk. Hard disks have different access speeds depending on where on the hard disk your files are. The benchmark would have been unfair under those conditions, so I didn't do it. This time, I have a wonderful new dual Pentium D 3.2 GHz Dell monster with two gigs of memory to benchmark on. I don't have real enterprise level storage available, so I will be using the internal hard disk (which is an Western Digital WD2500JS disk with 232 GB).
Another point of improvement is to use gigabit ethernet. Last time I only had 100 MBit ethernet. This time I have a Broadcom BCM5751 in the Dell monster, a direct cable link between the Dell box and my notebook, using a Broadcom BCM5705. My notebook runs Linux 2.6.16 all the time.
The previous benchmarks consisted solely of synthetic benchmarks. These benchmark one specific thing, and are thus very good if you want to know whether you should optimize this specific thing you are benchmarking, but people keep complaning about how in the real world, these kinds of benchmarks don't matter, because your bottleneck will be something completely different. I disagree completely with that, but to appease those naysayers, I have prepared a real world benchmark, too. During my work in the last years, I consulted a company doing online car sales, and they had some special web servers serving only static images. I captured the directory layout, the file names and sizes, and about half an hour worth of download hits from them, and I wrote a small benchmark tool that can download a list of urls given in a file from a certain IP.
Besides, I hate PHP and Java and SQL databases :-)
If you can see a way to do benchmarks like these without benchmarking the drivers (and without me spending weeks rerunning the same benchmark on different hardware configurations), please tell me!
On the other hand, I'm using the most typical off-the-shelf hardware I could get. So if one OS has a bad driver for THIS hardware, that is a really bad sign for that OS. Draw your own conclusions.
To work around that, I hacked gatling so it would fork in the beginning, right after opening the sockets, so there would be more than one gatling process available to answer queries. We shall see how useful that option is.
I'm also running fnord in one benchmark for this reason. fnord is a trivial web server that is run from tcpserver, so one fnord process will be running per open http connection.
When NetBSD booted, I copied libowfat and gatling to the machine and tried to build them, but the build failed. It used to work in previous versions of NetBSD, so I was quite surprised about this. It turns out that the CMSG_* macros use the BSD types (u_char in this case), but sys/types.h only defines them if you have _NETBSD_SOURCE defined. WTF?! Anyway, so libowfat now defines that, and gatling builds on NetBSD 3.
When untarring the tarball, I saw that gzip had 1% CPU, but tar had 60% CPU. Something is wrong with the BSD tar, if you ask me. So I decided to use my own tar (from embutils). Unfortunately, that tar did not compile on NetBSD 3, because NetBSD does not have ftw in their libc. Huh!? There is an open ticket from 2002 in their bug tracking system to add ftw, but it was closed with "we should use someone else's implementation for this", without anyone ever doing it. NetBSD should really implement ftw. I ended up using the implementation from my diet libc. With my tar, the extraction took only 25% CPU in tar. I don't know what is wrong with NetBSD's tar, but someone should look at it.
The first data from NetBSD: download + untar took 3505.33 seconds real time, which is about one hour. Untarring the other tar file on top of it, the 300 MB one (after buffer cache priming), took 28 minutes, 28 seconds, which is about half an hour.
We were shocked at how poorly NetBSD performed. Well, I was. Ilja doesn't regard NetBSD very highly, so he was more amused than shocked. But having to wait for an hour to extract 21 gigs of images, that is pretty much unacceptable.
The next surprise came when I wanted to run the benchmark locally, because NetBSD lowered the open files limit to 64!! WTF?!? (Ilja theorized that they lowered it to mitigate fd_set overflows). Anyway, I upped the limit, and started the benchmark.
NetBSD scores 169 requests per second on the first benchmark.
The second benchmark is running gatling in the forked mode, i.e. creating the server sockets, then forking 32 times, so we have 32 gatling processes, all listening on the same sockets. The idea is that when one gatling is blocking on open(), some other can answer the incoming connections. There is some slight loss because more than one gatling can try accepting the same connection, and NetBSD showed some signs of race conditions there, because it returned interesting error messages to accept(), like ENOENT (no such file or directory) and EADDRINUSE (address already in use). So, NetBSD does have some fine-grained networking after all :-) At least it didn't panic as Ilja had hoped. NetBSD scores 97 requests per second on the second benchmark. I would have expected this one to be faster. Strange.
Deleting the files took 349 seconds.
Uh, not so well. The AMD64 build sets the network card to 100BaseTX mode, and even if I use ifconfig to override it, will still stay in 100BaseTX mode. Also, the boot messages didn't look so good (for example, it complained about libm.so and ldconfig gave the error message you get when you try to run a 32-bit NetBSD 3.0 x86 binary on 64-bit NetBSD 3.0 AMD64 (which does not work. No, really!). So I reinstalled NetBSD 3.0 AMD64 from scratch, and the freshly installed NetBSD 3.0 froze solid during the kernel initialization in the boot process. The last line is
boot device: <unknown> root device:I gave up on NetBSD AMD64 at this point. Later, after installing a few other operating systems on the box, I came back to NetBSD AMD64, this time staying with the default UFS1, and this time it worked. Go figure.
In the NetBSD AMD64 graphs, you can see quite nicely what I like to call a "pump and dump" throughput, where first the throughput is high, while everything goes into the buffer cache, and then there is a drought period where the OS is busy dumping the buffer cache to disk, and the throughput goes down dramatically :-) This same behavior is typical for Linux, too. First of all, this is a good sign, because it means the buffer cache is really used fully. On the other hand, the potential for improvement is evident, because it would be great if the buffer cache could be used better while the OS is dumping it to disk.
Fascinatingly, the AMD64 install performed much better than the x86 install. It took 35 minutes, 28 seconds. On NetBSD, by the way, since we had a Dell with a USB keyboard presumably, the "Please press any key to reboot" prompt after halt does not actually reboot if you press a key. Anyway, untarring the other tarball took 12 minutes, 19 seconds.
In the HTTP benchmark, NetBSD 3.0 AMD64 scored 41 requests per second. Running the HTTP benchmark again (now everything in the benchmark should be in the buffer cache, so this benchmarks more than the file system) got about 7500 requests per second. Running the second HTTP benchmark (32 instances of gatling) on a cold cache scored 41 requests per second. On a warm cache, I measured 6800 requests per second. The kernel messages indicate that NetBSD found the second CPU, but did not enable it.
These numbers are much more in line with the numbers measured on the other operating systems, so I reinstalled the x86 version of NetBSD. It only detects one CPU, btw. On the HTTP benchmark it scored 41 requests per second on cold cache, 6700 rps on warm cache. Running 32 gatling instances at the same time scored 41 requests per second on cold cache, and 5950 rps. This is almost exactly how much the 64-bit version scored, so why are the results so different from the first run of NetBSD 3.0 x86? Two things are different: 1. the first benchmark was using UFSv2, the second one used UFSv1. UFSv2 appears to be much slower than UFSv1 and it may even be the reason why the first attempt at installing the 64-bit version failed. The other thing I did differently the second time around is how I flushed the cache between benchmark phases. In the first attempt, I used dd to create a 2 GB file on disk. Maybe that was not sufficient to clear the buffer cache on NetBSD and that's why the requests per second where higher in the first attempt. In the second attempt I actually rebooted the machines between benchmark phases.
A note about the warm cache values: if all the files are in memory, then the whole benchmark run ends very quickly, around one second typically. The benchmark downloads 10000 URLs from a list of captured requests. That's why it's normal to have a variation of several hundred requests per second. Also, please note that the benchmark is running in keep-alive mode with 10 requests per TCP connection, so it "only" has 1000 actual TCP incoming connections.
Downloading only one 11k file over and over again, a 50000 fetches all in all, 50 parallel connections, with keep-alive and 10 requests per connection, NetBSD 3.0 gave me 57.3 MB/sec with just one process, and 57.1 MB/sec with 4 processes. With IPv6, I got 53.3 MB/sec with one process, and 57.1 MB/sec with four.
On my home AMD64 box, NetBSD installed quickly and easily, but then failed to boot (said it couldn't open /boot).
Untarring the tarball took 35 minutes and 34 seconds. Unfortunately, FreeBSD's time utility didn't actually output the time (WTF?!), but I could deduce the time from the traffic log I took.
Untarring the second tarball (the 300 MB one) took 17 minutes and 54 seconds.
In the first benchmark, FreeBSD scored 57 requests per second. That is pretty bad. NetBSD is three times faster!
In the second benchmark, I got the same error messages as with NetBSD, ENOENT and EADDRINUSE. I would have expected EAGAIN if I try to accept when someone else already accepted all the outstanding connections. In the end, FreeBSD hat 52 requests per second.
The delete took 7 minutes, 40 seconds.
Interestingly, FreeBSD didn't actually let us switch to async mounts. I did mount -u -o async,noatime /usr, that worked, but mount then still reported the file system to be in soft-updates mode.
Write failure on transfer! (Wrote -1 bytes of 1425408)This was on the default layout on a 250 GB disk. WTF?!
So I asked Ilja to install FreeBSD 6.1 again, this time with auto settings. He did, and got the same error. He tried again, this time with the auto partitioning scheme but with more space for /, and this time it worked. So apparently the FreeBSD auto partitioning scheme is broken :-)
When running the benchmark on / by mistake, which does not have soft updates, I got about five times the throughput than when running it on /usr, which does have soft updates. For some reason I don't quite understand, enabling async mode on the file system does not disable soft updates, and this really kills performance on this FreeBSD benchmark. So, a toast to the brainiacs who always told you soft updates is practically free from a performance point of view.
Downloading only one 11k file over and over again, a 50000 fetches all in all, 50 parallel connections, with keep-alive and 10 requests per connection, FreeBSD 6.1 gave me 56.6 MB/sec with just one process, and 60.2 MB/sec with 4 processes. With IPv6, I got 53.9 MB/sec with one process, and 60.3 MB/sec with four.
On my home box, FreeBSD 6.1 AMD64 with soft-updates took 48:41 to extract the files, and 39:20 to extract the second tarball on top of it. The first gatling bench scored 46 requests per second on cold cache, and 6800 requests per second on warm cache. The second gatling bench (32 instances) scored 43 requests per second on cold cache, and 6700 requests per second on warm cache. Deleting the files took 7:35.
Without soft-updates, the same benchmarks took 37:01, 30:12, 46 and 6700 rps for one instance, 43 and 6800 rps for 32 instances, and deleting took 7:47.
rd0: fixed, 3800 blocksSo I tried the AMD64 version. Same thing. So I tried an old OpenBSD 3.7 ISO. Same thing. So I changed to a different hardware for OpenBSD. This will skew the benchmark results, but I don't have a choice, do I?
So, OpenBSD 3.9 finally comes with a gcc 3! Hooray! The fallback box we installed OpenBSD on was another Dell, but an uniprocessor one. It has one Pentium 4 with 3.6 GHz, 2 Gigs of RAM, and a single IDE disk, the same Broadcom ethernet as the other box, and the hard disk is a 150 GB ST3160828AS. This puts OpenBSD at a disadvantage (because there is only one CPU) but since OpenBSD still has very rudimentary SMP support (one big kernel lock), this might well be an advantage after all, since the CPU has 400 more MHz.
Imagine our shock when OpenBSD 3.9 outperformed NetBSD 3.0 and came very close to FreeBSD 6.1 in the first benchmark! OpenBSD 3.9 took 34 minutes and 9 seconds. The second benchmark was much more modest at 40 minutes and 34 seconds, so apparently OpenBSD has some special optimization in place to make tests like our first benchmark run fast. :-)
OpenBSD scored 37 requests per second in the single task gatling benchmark, and 33 requests per second in the 32 task gatling benchmark, and 28 requests per second in the fnord benchmark.
Another fun fact: the OpenBSD ftp client is not 64-bit clean. I downloaded the large tarball with the images with it, and the progress bar and ETA were wrong. Hilariously, when it had downloaded the 21 GB, it said "Read short file." :-)
Deleting the files took 4 minutes, 48 seconds. Strangely, serving the 21 GB file from OpenBSD was immensely slow. gatling was at 92% CPU utilization and top showed 92% load as system. Meanwhile, OpenBSD only had 3 MB/sec throughput at this point. Apparently, you can saturate a 3,6 GHz Pentium 4 with two GB RAM running OpenBSD by having someone download a single large file from it, and it won't even saturate a 100 MBit Link. Unbelievable.
We are giving up on OpenBSD at this point.
OpenBSD i386 hung during boot on my home dual core AMD64 box, OpenBSD amd64 worked, but the installation from CD was unbelievably slow (about 300 KB/sec) and then the Intel NIC was detected but failed (the watchdog constantly reset the em0 interface, and ping didn't go through).
During the unpacking of the second tarball we got this fun message:
reiser4[pdflush(14954)]: commit_current_atom (fs/reiser4/txnmgr.c:1024)[nikita-3176]: WARNING: Flushing like mad: 16384I love it when my computer talks dirty to me :-)
The good impression reiser4 made so far it obliterated in the delete benchmark. Deleting all the files took 1 hour, 8 minutes, and 11 seconds.
On my Athlon 64, I got 16 minutes, 5 seconds and 9 minutes, 5 seconds, 57 requests per second for both the single and 32 instance gatling runs.
Extracting the tarball took 49 minutes, 9 seconds. Extracting the second tarball took 16 minutes, 15 seconds. Linux scored 34 requests per second in the single tasked gatling; interestingly, when I preloaded the directories (not the files) in the buffer cache using find (which took 4 minutes and 20 seconds), and then ran the gatling benchmark, it scored 69 requests per second. Unfortunately, in this benchmark, the find took almost as long as the gatling benchmark. So there is massive gain by preloading the directories in the filesystem, but it does not amortize itself in this benchmark. In the real world, things may look different, though.
In the 32-instances-gatling benchmark, Linux with ext2 scored 31 requests per second. Deleting the files on ext2 took 7 minutes, 10 seconds.
I re-ran this benchmark on my Athlon 64 box, and got even worse timings: 55 minutes, 24 seconds for part one, and 17 minutes, 42 seconds for part two. Downloading scored 37 rps for the single tasked gatling, and 37 rps for the 32-task gatling. Deleting the files took 7 minutes, 23 seconds.
Extraction took 1 hour, 24 minutes, 39 seconds, and 31 minutes, 6 seconds for the second. The gatling benchmarks scored 30 rps, and 33 rps for the 32-instance test. Deleting took 30 minutes, 4 seconds.
Even worse, there were no man pages on the system. NO MAN PAGES! WTF?! OK, granted, the Solaris man pages are pretty much worthless anyway, but I had to look up the Solaris ifconfig and route syntax, and it took me a few minutes to remember that you have to actually say "ifconfig bge0 up" after you configured an IP for it to come up. Grrrreat engineering, Sun!
OK, we had gcc, we had as and ld, the machine even had a wget. I used wget to download the files, and it turned out that the wget was not 64-bit clean and actually aborted the download after it got as many bytes as it (erroneously) expects after truncating the file size down to 32 bits. I am pretty underwhelmed by Schillix. Oh, and one for the road: when Sun make was too broken to compile libowfat, I tried smake. It core dumped. Yes, you read that correctly. smake, one of the flagship products of Jörg Schilling, the maker of Schillix, dumped core on me. HAHAHA
For some reason, my tar hung on Solaris. I have no idea why. truss showed that it wasn't in a syscall at that time. So I used star instead. Turns out that star can't do "star xzf -", it will say "Can only compress files." ROTFL! OK, so I used "|gzip -dc|star xf -" instead. What the hell.
Schillix took 1 hour, 21 minutes, and 29 seconds for the initial test (apparently, Solaris doesn't have an async mode, but at least I turned off logging), and 43 minutes and 47 seconds for untarring the second tarball over the first one. This squarely places Solaris at the bottom of the food pile.
Running the HTTP benchmark against Solaris instantly froze the machine. There was no panic or anything, but gatling apparently deadlocked something in the kernel, because the process hung and I could not interrupt it with ^C, ^\ or even ^Z. So I logged in via ssh and kill -9'd the process -- it stayed. So Solaris has fine grained locking all-right :-)
We rebooted Solaris to delete all the files, to have something very strange happen: the rm returned immediately, below 1 second, and the directories were actually gone, but the disk space wasn't. Later, the disk space went, too. I have no idea what happened there in detail, but I can't score a delete that takes less time than a find on the data. Maybe it was a fluke and the unclean shutdown has damaged our filesystem or so.
I later tried gatling again on Belenix, to see if maybe some Schillix changes to the kernel were at fault, but Belenix showed the same kernel hang.
To get some throughput measurement at all from OpenSolaris, I manually build a libowfat/gatling combo that does not try to use sendfile but mmap+write.
After the abysmal performance of UFS, I wanted to know whether ZFS would perform better. Sun has been making a lot of wind around ZFS, and as it turns out, rightfully so. OpenSolaris took 26 minutes, 42 seconds to extract the first tarball, then 15 minutes, 53 seconds to extract the second on top of the first one. The single-process gatling scored 34 requests per second, the 32-process gatling scored 35 requests per second. After the cache as primed, the single process gatling scored 6000 requests per second, while the 32-process gatling scored 9000 requests per second.
Downloading only one 11k file over and over again, a 50000 fetches all in all, 50 parallel connections, with keep-alive and 10 requests per connection, OpenSolaris gave me 77.8 MB/sec with just one process, and 80 MB/sec with 4 processes. I could not get my OpenSolaris to do IPv6. Maybe Belenix does not have IPv6 support in the kernel? Maybe I'd have to do something special for it? I can't tell because neither Schillix nor Belenix come with man pages.
On my home AMD64 box, I tried Dragonfly 1.6, and it detected the NIC, but did not get ping out.
Disclosure: running the benchmark at all on Windows gave me a massive heartburn and indigestion issues. I first ported my small tar to Windows natively, and used a dl.exe I had ported to winsock, and I was planning to do something like this:
C:\> dl -O- http://10.0.0.5/bilder.tar.gz | gzip -d | tar xf -I don't know whether it is a general shortcoming of Windows or Vista, but when I did this, read() returned 0 in tar, so tar reported premature end of file and terminated, gzip returned broken pipe, and dl terminated, too. I tried switching from read to ReadFile in tar, but that did not work either. So I ended up cheating in Windows' favor by writing a tar.exe that has a built-in gunzip and can also do HTTP downloads. I call it dltar.exe. Just to extra sure this is understood: I did not use the POSIX emulation of Windows, these are all native binaries. In the end I even moved from the POSIX functions in msvcrt (open, read, write) to CreateFile and ReadFile, just to make sure noone can piss on these results and say I discriminated against Windows by using slow legacy emulation APIs. Also, please note that untarring takes more system calls on Unix than on Windows, because there is no chown and no chmod on Windows.
I tried porting gatling to Windows, but failed to finish on the weekends I had, so in the end I gave up on porting the framework and wrote a new mini benchmark-only webserver only for Windows, using the native APIs that are said to have the highest performance, in particular AcceptEx and TransmitFile, using overlapped I/O and I/O Completion Ports. To mimic the gatling benchmarks with one process and 32 processes, I optionally made this web server multithreaded.
It took Vista 1 hour, 57 minutes to untar the tarball, although the environment was even skewed to its advantage, because I had to use the one-process download-gunzip-untar programm on Vista to work around the broken pipes. I don't expect this to make much of a difference, but technically Vista had an advantage but was still much slower than everyone else. Untarring the second tarball over the first one took 24 minutes, 54 seconds.
Downloading only one 11k file over and over again, a 50000 fetches all in all, 50 parallel connections, with keep-alive and 10 requests per connection, Vista gave me up to 30 MB/sec with just one process, and 32 MB/sec with 4 threads. IIS 7 on Vista did 59 MB/sec. Interestingly, after having installed IIS, the benchmark results against my server went up, too, to 42 MB/sec in single threaded mode and still 32 MB/sec.
The first half of the benchmarks on Windows 2003 got much better values than Vista, too. Downloading and extracting the big tarball took 1 hour, 30 minutes, and 12 seconds. Downloading and extracting the second tarball on top of the other one took 23 minutes, and 32 seconds. Downloading the files via HTTP got me a ton of errors. It turned out that the errors go away once I limited the number of parallel connections from 100 to 20. Very strange. With 20 connections, I got 24 requests per second. Running the same benchmark again on the warm cache yielded 6700 requests per second. The same benchmark with the 32-thread server got 32 requests per second with the single-thread web server, and 7700 requests per second with primed cache. IIS scored 30 requests per second on cold cache, and 161 requests per second on warm cache. Yes, you read that right. 161 requests per second. Apparently installing IIS renders the Windows file system cache ineffective.
Deleting the files took 2 hours, 25 minutes, 13 seconds.
Downloading only one 11k file over and over again, a 50000 fetches all in all, 50 parallel connections, with keep-alive and 10 requests per connection, Longhorn Server gave me 70 MB/sec with just one process, and 38 MB/sec with 4 threads. IIS 7 did 63.6 MB/sec.
These requests per second are measured using HTTP with keep-alive, so it's just a little over 1000 actual incoming HTTP connections per second.
So, one lesson to be learned is that running more than one gatling actually doubled performance, but only if all the files are already in the buffer cache. I expected the opposite. In fact, I implemented the multi-process model in gatling exactly so the kernel could have more than one outstanding read request and take advantage of the I/O elevator. That did not happen. I wonder how and when the I/O elevator can speed things up at all, or if it is only for write requests. This warrants further research.
It is very worrying to see almost every contestant blow up in our face in one way or another. We were astonished at the bad quality of the current releases of everyone except Linux. We were surprised that reiser4 didn't blow up in our face, which we expected after previous results with reiserfs version 3 and judging from the warnings on the reiser4 web page.