3n+1

Here are the gory details about the 3n+1 benchmark I did with Berkeley DB.  I’ll try to make this entertaining, but it’s ultimately about running the same program over and over with slightly different parameters each time.  My blog post will have a more interesting (I hope!) discussion.

The benchmark I’m looking at was defined by K.S. Bhaskar here.  Bhaskar’s results are here.  The benchmark computes a recursive mathematical function for a range of integers 1 through N.  Results for each value are stored in the database.  In computing a new value, the database is consulted for other values.

It turns out that the function, at least in the ranges I tried, causes a mix of about 40% writes, 60% reads.  There are no deletes, so the database is always growing.

Here’s my configuration:

Model Name: MacBook Pro
Model Identifier: MacBookPro3,1
Processor Name: Intel Core 2 Duo
Processor Speed: 2.2 GHz
Number Of Processors: 1
Total Number Of Cores: 2
L2 Cache: 4 MB
Memory: 4 GB 667 MHz DDR SDRAM
Bus Speed: 800 MHz
Disk: 250GB Hitachi 5400 RPM
Compiler: i686-apple-darwin10-g++-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5664)
File system: Journaled HFS+
Database: Berkeley DB 5.1.19, built: '--enable-cxx --enable-shared --enable-static --enable-tcl --enable-test --enable-sql'

As far as I can tell, this is very close to the system used by Bhaskar, except:

  • he uses Linux/JFS, I’m using OSX/HFS+.
  • he has built his software with 64 bit, I am using 32 bit
  • his disk is faster – rotational speed 7200rpm vs my 5400rpm
  • his disk is smaller 200GB – vs my 250GB

Differences can mean a lot, so while I’m using Bhaskar’s results as a goal, I don’t know that they can be directly compared.  Maybe someone with more time on their hands can do a head to head.

I’ve coded using Bhaskar’s definition of the ‘enhanced’ benchmark.  That is, instead of storing raw integers, we store numbers ‘written out’, using a different language for each numeral.  So 409 is encoded as the string “quattro zero ஒன்பது”.  I like this enhancement, as it makes the record sizes more in the range of 20-40 bytes, rather than a fixed 4 or maybe 8 bytes.  That’s closer to record sizes I’ve seen in many applications.  It does make the keys longer as well – the keys are generally longer than their associated data, which is unlike typical apps.

The source code is on github here. It’s in a ‘mostly C dialect’ of C++.  That is, I started writing C, and found some features of C++ useful.  As it’s adapted from some other benchmarks I’ve written along the way, it’s not beautiful code, but it should be straightforward to read.

For motivation, I looked at Bhaskar’s best result shown with N=1000000, 183 seconds, to see if I could beat that.  He uses 5000 Global buffers (=~ BDB cache), which I believe is 20M, and 2000 Journal buffers ( =~ BDB log buffer size), that’s 8M.  I found that running with a different mix: 27M cache and <1M log buffer, makes better use of that same 28M for BDB.

The Results

Here’s my baseline. With these parameters, I don’t call DB_ENV->set_cachesize, so I get the default cache (about 256K). Run the benchmark to populate the first one million integers. Single threaded, no bells and whistles.

./bench3n -c 0 -n 1000000
N=1000000
result=525
time=8522
nputs=2168610 (254.47 puts/second)
ngets=3168608 (371.82 gets/second)
ops=5337218 (626.29 ops/second)

Yeah, that’s really bad. I only ran with the previous config once. All the other results from here on out were run 5 times in succession. I threw out the best and worst times, and averaged the other three.

This next one is the same configuration, but I’m running only to 100000 (one tenth the integers as previously).

../bench3n -c 0 -n 100000
N=100000
result=351
time=623
nputs=217211 (348.65 puts/second)
ngets=317209 (509.16 gets/second)
ops=534420 (857.81 ops/second)

That’s not too far from linear. The number of ops/sec has improved, but not dramatically. That indicates that after a certain point, we’re getting a somewhat steady proportion I/O hits for each value calculated.

The next case is setting the cache to 20 megabytes — and we’re back to one million values to compute. All following runs use one million input values.  By the way, the number I’ve been watching is operations per second. The benchmark has a pretty consistent ratio of reads to writes, so all of: puts/sec, gets/sec, ops/sec move proportionally.  The more relevant number is actually time (number of elapsed seconds). But bigger numbers are more fun.

../bench3n -c 20000 -n 1000000
N=1000000
result=525
time=273
nputs=2168610 (7943.62 puts/second)
ngets=3168608 (11606.62 gets/second)
ops=5337218 (19550.24 ops/second)

As expected, cache is vitally important in this benchmark, and 20M gives us something like a 30x improvement over 256K.  Now let’s allocate a total of 28M – almost all for cache, since that’s usually the critical resource for BDB.

./bench3n -c 27000 -logbufsize 1000 -n 1000000
N=1000000
result=525
time=217
nputs=2168610 (9993.59 puts/second)
ngets=3168608 (14601.88 gets/second)
ops=5337218 (24595.47 ops/second)

Even better, but not nearly good enough to reach our goal.

At this point, I started looking more closely at the GT.M program published by Bhaskar, and discovered that the program is not using strict durability for transactions, but are using the equivalent of DB_TXN_NOSYNC. (See the blog entry for discussion). So I’ll run with DB_TXN_NOSYNC as well. And I find that having a smaller log buffer size can also improve performance (somewhat paradoxically).

./bench3n -c 27000 -nosynctxn -logbufsize 100 -n 1000000
N=1000000
result=525
time=200
nputs=2168610 (10843.05 puts/second)
ngets=3168608 (15843.04 gets/second)
ops=5337218 (26686.09 ops/second)

Now, with 2 threads…

./bench3n -c 27000 -t 2 -nosynctxn -logbufsize 100 -n 1000000
N=1000000
result=525
time=167
nputs=2168679 (12986.10 puts/second)
ngets=3168677 (18974.11 gets/second)
ops=5337356 (31960.21 ops/second)

Okay, that’s faster than 183 seconds.

Let’s try 3 threads…

./bench3n -c 27000 -t 3 -nosynctxn -logbufsize 100 -n 1000000
N=1000000
result=525
time=178
nputs=2168832 (12184.44 puts/second)
ngets=3168830 (17802.41 gets/second)
ops=5337662 (29986.86 ops/second)

Wow… that’s a little slower. We’ll see later that partitioning the data set can make 3 threads run faster, but I didn’t think of that before trying some other things first.

Here’s something a little wacky. If we use a different way of sorting items in the database (first by the number of ‘digits’ in our spelled out key) we’ll tend to get more locality in our access pattern. Specifying a different (and slower!) set_bt_compare function goes against some conventional wisdom, but the locality win can be big.

./bench3n -c 27000 -t 3 -nosynctxn -logbufsize 100 -sortbylength -n 1000000
N=1000000
result=525
time=161
nputs=2168946 (13471.71 puts/second)
ngets=3168944 (19682.88 gets/second)
ops=5337890 (33154.59 ops/second)

Yes, locality rules – it basically makes better use of our still woefully undersized cache.

Now let’s try partitioning the data set. 4 pretty equally sized partitions should help.

./bench3n -c 27000 -t 3 -nosynctxn -logbufsize 100 -sortbylength -partition 4 -n 1000000
N=1000000
result=525
time=160
nputs=2168903 (13555.64 puts/second)
ngets=3168901 (19805.63 gets/second)
ops=5337804 (33361.27 ops/second)

Helps a little, we’re on the right track. Let’s try 8:

./bench3n -c 27000 -t 3 -nosynctxn -logbufsize 100 -sortbylength -partition 8 -n 1000000
N=1000000
result=525
time=157
nputs=2168945 (13814.93 puts/second)
ngets=3168943 (20184.35 gets/second)
ops=5337888 (33999.28 ops/second)

That’s a bit better. What I haven’t shown you here is that I did some analysis with db_stats to determine that there was some contention on the meta page, which is relieved by partitioning.  (Here’s my blog post about partitioning).

Alright. Let’s see how fast we can really go. Many applications run with plenty of cache memory these days, so let’s do the same.  The size of the database file is under 300M, so let’s make sure it will fit all in memory.

./bench3n -c 300000 -t 3 -nosynctxn -logbufsize 100 -sortbylength -partition 8 -n 1000000
N=1000000
result=525
time=80
nputs=2168990 (27112.37 puts/second)
ngets=3168988 (39612.35 gets/second)
ops=5337978 (66724.72 ops/second)

We almost doubled the speed. Now, since the database is totally in memory, we’re not getting any locality benefit from our custom sort function.

./bench3n -c 300000 -t 3 -nosynctxn -logbufsize 100 -partition 8 -n 1000000
N=1000000
result=525
time=73
nputs=2168897 (29710.91 puts/second)
ngets=3168895 (43409.52 gets/second)
ops=5337792 (73120.43 ops/second)

Let’s see if increasing log buffer size helps.

./bench3n -c 300000 -t 3 -nosynctxn -logbufsize 800 -partition 8 -n 1000000
N=1000000
result=525
time=72 <-- BEST
nputs=2168951 (30124.31 puts/second)
ngets=3168949 (44013.18 gets/second)
ops=5337900 (74137.50 ops/second)

A bit (it turns out that will be the best run). How about even more log buffer?

./bench3n -c 300000 -t 3 -nosynctxn -logbufsize 1600 -partition 8 -n 1000000
N=1000000
result=525
time=72
nputs=2168820 (30122.50 puts/second)
ngets=3168818 (44011.36 gets/second)
ops=5337638 (74133.86 ops/second)

Nope, 1600K doesn’t help, 800K was big enough.

Let’s try with DB_DSYNC, this uses a different syncing policy.

./bench3n -c 300000 -t 3 -nosynctxn -logbufsize 800 -logdsync -partition 8 -n 1000000
N=1000000
result=525
time=57
nputs=2168859 (38050.15 puts/second)
ngets=3168857 (55593.98 gets/second)
ops=5337716 (93644.14 ops/second) TAINTED

And fiddling with the log buffer size, this one’s the best:

./bench3n -c 300000 -t 3 -nosynctxn -logbufsize 600 -logdsync -partition 8 -n 1000000
N=1000000
result=525
time=54
nputs=2168916 (40165.11 puts/second)
ngets=3168914 (58683.59 gets/second)
ops=5337830 (98848.70 ops/second) TAINTED

Let’s try fewer threads:

./bench3n -c 300000 -t 2 -nosynctxn -logbufsize 600 -logdsync -partition 8 -n 1000000
N=1000000
result=525
time=56
nputs=2168695 (38726.69 puts/second)
ngets=3168693 (56583.80 gets/second)
ops=5337388 (95310.50 ops/second) TAINTED

And more threads:

./bench3n -c 300000 -t 4 -nosynctxn -logbufsize 600 -logdsync -partition 8 -n 1000000
N=1000000
result=525
time=56
nputs=2169033 (38732.73 puts/second)
ngets=3169031 (56589.83 gets/second)
ops=5338064 (95322.57 ops/second) TAINTED

Okay, these were a little too good to be true.  A little research reveals that O_DSYNC is broken on Mac OS X (see this discussion). For DB, I think we can’t really count the last four results, given the discussions, I think we may end up with an unrecoverable database. I’ve marked them here as TAINTED. But I wanted to show them because they do show that a fair amount of our latency is due to waiting for fsync(). If we ran with the log in memory, and durability achieved via replication, we may see results this good.

Just to complete our runs, we’ll try to tweak out best results to see if fewer threads helps:

./bench3n -c 300000 -t 2 -nosynctxn -logbufsize 800 -partition 8 -n 1000000
N=1000000
result=525
time=73
nputs=2168686 (29708.02 puts/second)
ngets=3168684 (43406.63 gets/second)
ops=5337370 (73114.65 ops/second)

Or if more threads:

./bench3n -c 300000 -t 4 -nosynctxn -logbufsize 800 -partition 8 -n 1000000
N=1000000
result=525
time=74
nputs=2169016 (29311.02 puts/second)
ngets=3169014 (42824.51 gets/second)
ops=5338030 (72135.54 ops/second)

So, it appears on this machine, 72 seconds is the best we can do, with basically unlimited cache. That’s 74137.50 operations per second.

Advertisements

4 Responses to 3n+1

  1. Pingback: Revving up a benchmark: from 626 to 74000 operations per second | libdb

  2. Martin says:

    Hi Don,

    finally I could find a good blog about performance of BDB. Thanks a lot for that 🙂

    I looked at your code and I think there is a small error in line 474:
    else if (args->txn == NOSYNC)

    Shouldn’t you check here if txn is WRITENOSYNC?

    Best regards,
    Martin

    • ddanderson says:

      Thanks, Martin! Nice catch, that explains why I didn’t see *any* improvement with DB_TXN_WRITE_NOSYNC over the default transaction sync policy. I’m making several changes to the code in response to feedback, this change will be in. I’ll post again when I’m finished.

  3. musze says:

    Hi Don,

    Thanks for providing details of this benchmark, and for linking to Bhaskar’s site. I enjoy reading your blog.

    I was very intrigued by your 3n+1 post, so I took both your benchmark and Bhaskar’s GT.M routine and ported both to InterSystems Caché (basically, I created Caché Object Script Routines). I just wanted to share my results with you and the rest of the readers of your blog.

    I’ve provided all the gory details of my test and configuration on the intersystems-cache-googlegroup – you can read all about it here: https://groups.google.com/group/intersystems-public-cache/browse_thread/thread/5b3d9b75236d7f83?hl=en.

    In summary, I found the following:
    1.Running your test, my results for Caché on a 2.66 GHz Intel i7 Mac with 8GB memory and 256MB of database cache took 37 sec for 1-1000000
    numbers range running 3 jobs. This appears to be roughly twice as fast as your test with BDB.
    2. Running Bhaskar’s tests (on the same hardware), I found:
    – For 4 worker processes: my test completed roughly 22% faster with ~25% more updates/sec and ~27% more reads/sec.
    – For 8 worker processes: my test completed roughly 36% faster with ~34% more updates/sec and ~6% more reads/sec.

    Let me know if you have specific questions about my tests – happy to provide details.

    Thanks,

    Vik

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s