Here are the gory details about the 3n+1 benchmark I did with Berkeley DB. I’ll try to make this entertaining, but it’s ultimately about running the same program over and over with slightly different parameters each time. My blog post will have a more interesting (I hope!) discussion.
The benchmark I’m looking at was defined by K.S. Bhaskar here. Bhaskar’s results are here. The benchmark computes a recursive mathematical function for a range of integers 1 through N. Results for each value are stored in the database. In computing a new value, the database is consulted for other values.
It turns out that the function, at least in the ranges I tried, causes a mix of about 40% writes, 60% reads. There are no deletes, so the database is always growing.
Here’s my configuration:
Model Name: MacBook Pro
Model Identifier: MacBookPro3,1
Processor Name: Intel Core 2 Duo
Processor Speed: 2.2 GHz
Number Of Processors: 1
Total Number Of Cores: 2
L2 Cache: 4 MB
Memory: 4 GB 667 MHz DDR SDRAM
Bus Speed: 800 MHz
Disk: 250GB Hitachi 5400 RPM
Compiler: i686-apple-darwin10-g++-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5664)
File system: Journaled HFS+
Database: Berkeley DB 5.1.19, built: '--enable-cxx --enable-shared --enable-static --enable-tcl --enable-test --enable-sql'
As far as I can tell, this is very close to the system used by Bhaskar, except:
- he uses Linux/JFS, I’m using OSX/HFS+.
- he has built his software with 64 bit, I am using 32 bit
- his disk is faster – rotational speed 7200rpm vs my 5400rpm
- his disk is smaller 200GB – vs my 250GB
Differences can mean a lot, so while I’m using Bhaskar’s results as a goal, I don’t know that they can be directly compared. Maybe someone with more time on their hands can do a head to head.
I’ve coded using Bhaskar’s definition of the ‘enhanced’ benchmark. That is, instead of storing raw integers, we store numbers ‘written out’, using a different language for each numeral. So 409 is encoded as the string “quattro zero ஒன்பது”. I like this enhancement, as it makes the record sizes more in the range of 20-40 bytes, rather than a fixed 4 or maybe 8 bytes. That’s closer to record sizes I’ve seen in many applications. It does make the keys longer as well – the keys are generally longer than their associated data, which is unlike typical apps.
The source code is on github here. It’s in a ‘mostly C dialect’ of C++. That is, I started writing C, and found some features of C++ useful. As it’s adapted from some other benchmarks I’ve written along the way, it’s not beautiful code, but it should be straightforward to read.
For motivation, I looked at Bhaskar’s best result shown with N=1000000, 183 seconds, to see if I could beat that. He uses 5000 Global buffers (=~ BDB cache), which I believe is 20M, and 2000 Journal buffers ( =~ BDB log buffer size), that’s 8M. I found that running with a different mix: 27M cache and <1M log buffer, makes better use of that same 28M for BDB.
Here’s my baseline. With these parameters, I don’t call DB_ENV->set_cachesize, so I get the default cache (about 256K). Run the benchmark to populate the first one million integers. Single threaded, no bells and whistles.
./bench3n -c 0 -n 1000000
nputs=2168610 (254.47 puts/second)
ngets=3168608 (371.82 gets/second)
ops=5337218 (626.29 ops/second)
Yeah, that’s really bad. I only ran with the previous config once. All the other results from here on out were run 5 times in succession. I threw out the best and worst times, and averaged the other three.
This next one is the same configuration, but I’m running only to 100000 (one tenth the integers as previously).
../bench3n -c 0 -n 100000
nputs=217211 (348.65 puts/second)
ngets=317209 (509.16 gets/second)
ops=534420 (857.81 ops/second)
That’s not too far from linear. The number of ops/sec has improved, but not dramatically. That indicates that after a certain point, we’re getting a somewhat steady proportion I/O hits for each value calculated.
The next case is setting the cache to 20 megabytes — and we’re back to one million values to compute. All following runs use one million input values. By the way, the number I’ve been watching is operations per second. The benchmark has a pretty consistent ratio of reads to writes, so all of: puts/sec, gets/sec, ops/sec move proportionally. The more relevant number is actually time (number of elapsed seconds). But bigger numbers are more fun.
../bench3n -c 20000 -n 1000000
nputs=2168610 (7943.62 puts/second)
ngets=3168608 (11606.62 gets/second)
ops=5337218 (19550.24 ops/second)
As expected, cache is vitally important in this benchmark, and 20M gives us something like a 30x improvement over 256K. Now let’s allocate a total of 28M – almost all for cache, since that’s usually the critical resource for BDB.
./bench3n -c 27000 -logbufsize 1000 -n 1000000
nputs=2168610 (9993.59 puts/second)
ngets=3168608 (14601.88 gets/second)
ops=5337218 (24595.47 ops/second)
Even better, but not nearly good enough to reach our goal.
At this point, I started looking more closely at the GT.M program published by Bhaskar, and discovered that the program is not using strict durability for transactions, but are using the equivalent of DB_TXN_NOSYNC. (See the blog entry for discussion). So I’ll run with DB_TXN_NOSYNC as well. And I find that having a smaller log buffer size can also improve performance (somewhat paradoxically).
./bench3n -c 27000 -nosynctxn -logbufsize 100 -n 1000000
nputs=2168610 (10843.05 puts/second)
ngets=3168608 (15843.04 gets/second)
ops=5337218 (26686.09 ops/second)
Now, with 2 threads…
./bench3n -c 27000 -t 2 -nosynctxn -logbufsize 100 -n 1000000
nputs=2168679 (12986.10 puts/second)
ngets=3168677 (18974.11 gets/second)
ops=5337356 (31960.21 ops/second)
Okay, that’s faster than 183 seconds.
Let’s try 3 threads…
./bench3n -c 27000 -t 3 -nosynctxn -logbufsize 100 -n 1000000
nputs=2168832 (12184.44 puts/second)
ngets=3168830 (17802.41 gets/second)
ops=5337662 (29986.86 ops/second)
Wow… that’s a little slower. We’ll see later that partitioning the data set can make 3 threads run faster, but I didn’t think of that before trying some other things first.
Here’s something a little wacky. If we use a different way of sorting items in the database (first by the number of ‘digits’ in our spelled out key) we’ll tend to get more locality in our access pattern. Specifying a different (and slower!) set_bt_compare function goes against some conventional wisdom, but the locality win can be big.
./bench3n -c 27000 -t 3 -nosynctxn -logbufsize 100 -sortbylength -n 1000000
nputs=2168946 (13471.71 puts/second)
ngets=3168944 (19682.88 gets/second)
ops=5337890 (33154.59 ops/second)
Yes, locality rules – it basically makes better use of our still woefully undersized cache.
Now let’s try partitioning the data set. 4 pretty equally sized partitions should help.
./bench3n -c 27000 -t 3 -nosynctxn -logbufsize 100 -sortbylength -partition 4 -n 1000000
nputs=2168903 (13555.64 puts/second)
ngets=3168901 (19805.63 gets/second)
ops=5337804 (33361.27 ops/second)
Helps a little, we’re on the right track. Let’s try 8:
./bench3n -c 27000 -t 3 -nosynctxn -logbufsize 100 -sortbylength -partition 8 -n 1000000
nputs=2168945 (13814.93 puts/second)
ngets=3168943 (20184.35 gets/second)
ops=5337888 (33999.28 ops/second)
That’s a bit better. What I haven’t shown you here is that I did some analysis with db_stats to determine that there was some contention on the meta page, which is relieved by partitioning. (Here’s my blog post about partitioning).
Alright. Let’s see how fast we can really go. Many applications run with plenty of cache memory these days, so let’s do the same. The size of the database file is under 300M, so let’s make sure it will fit all in memory.
./bench3n -c 300000 -t 3 -nosynctxn -logbufsize 100 -sortbylength -partition 8 -n 1000000
nputs=2168990 (27112.37 puts/second)
ngets=3168988 (39612.35 gets/second)
ops=5337978 (66724.72 ops/second)
We almost doubled the speed. Now, since the database is totally in memory, we’re not getting any locality benefit from our custom sort function.
./bench3n -c 300000 -t 3 -nosynctxn -logbufsize 100 -partition 8 -n 1000000
nputs=2168897 (29710.91 puts/second)
ngets=3168895 (43409.52 gets/second)
ops=5337792 (73120.43 ops/second)
Let’s see if increasing log buffer size helps.
./bench3n -c 300000 -t 3 -nosynctxn -logbufsize 800 -partition 8 -n 1000000
time=72 <-- BEST
nputs=2168951 (30124.31 puts/second)
ngets=3168949 (44013.18 gets/second)
ops=5337900 (74137.50 ops/second)
A bit (it turns out that will be the best run). How about even more log buffer?
./bench3n -c 300000 -t 3 -nosynctxn -logbufsize 1600 -partition 8 -n 1000000
nputs=2168820 (30122.50 puts/second)
ngets=3168818 (44011.36 gets/second)
ops=5337638 (74133.86 ops/second)
Nope, 1600K doesn’t help, 800K was big enough.
Let’s try with DB_DSYNC, this uses a different syncing policy.
./bench3n -c 300000 -t 3 -nosynctxn -logbufsize 800 -logdsync -partition 8 -n 1000000
nputs=2168859 (38050.15 puts/second)
ngets=3168857 (55593.98 gets/second)
ops=5337716 (93644.14 ops/second) TAINTED
And fiddling with the log buffer size, this one’s the best:
./bench3n -c 300000 -t 3 -nosynctxn -logbufsize 600 -logdsync -partition 8 -n 1000000
nputs=2168916 (40165.11 puts/second)
ngets=3168914 (58683.59 gets/second)
ops=5337830 (98848.70 ops/second) TAINTED
Let’s try fewer threads:
./bench3n -c 300000 -t 2 -nosynctxn -logbufsize 600 -logdsync -partition 8 -n 1000000
nputs=2168695 (38726.69 puts/second)
ngets=3168693 (56583.80 gets/second)
ops=5337388 (95310.50 ops/second) TAINTED
And more threads:
./bench3n -c 300000 -t 4 -nosynctxn -logbufsize 600 -logdsync -partition 8 -n 1000000
nputs=2169033 (38732.73 puts/second)
ngets=3169031 (56589.83 gets/second)
ops=5338064 (95322.57 ops/second) TAINTED
Okay, these were a little too good to be true. A little research reveals that O_DSYNC is broken on Mac OS X (see this discussion). For DB, I think we can’t really count the last four results, given the discussions, I think we may end up with an unrecoverable database. I’ve marked them here as TAINTED. But I wanted to show them because they do show that a fair amount of our latency is due to waiting for fsync(). If we ran with the log in memory, and durability achieved via replication, we may see results this good.
Just to complete our runs, we’ll try to tweak out best results to see if fewer threads helps:
./bench3n -c 300000 -t 2 -nosynctxn -logbufsize 800 -partition 8 -n 1000000
nputs=2168686 (29708.02 puts/second)
ngets=3168684 (43406.63 gets/second)
ops=5337370 (73114.65 ops/second)
Or if more threads:
./bench3n -c 300000 -t 4 -nosynctxn -logbufsize 800 -partition 8 -n 1000000
nputs=2169016 (29311.02 puts/second)
ngets=3169014 (42824.51 gets/second)
ops=5338030 (72135.54 ops/second)
So, it appears on this machine, 72 seconds is the best we can do, with basically unlimited cache. That’s 74137.50 operations per second.