I started looking at the 3n+1 benchmark described here to see how well BDB fared against GT.M (the language formally known as MUMPS). I built my benchmark and then tweaked and tuned. Along the way I discovered some interesting things.
My first run netted 626 operations per second. In BDB technical lingo, this is known as… slow. Okay, starting with 626 is a little contrived. Who actually runs with the default cache setting with Berkeley DB? But I wanted to prove a point. You really need to do some tuning of any application to begin to take full advantage of BDB. But seriously, two orders of magnitude? That’s a pretty hefty speedup.
All the gory details are here. Here are the highlights. The benchmark measures how fast you can compute results for the 3n+1 function. It’s a function that is defined recursively, so that computed results are stashed in a database or key/value store and are used by successive calculations. This statement of the benchmark requires numbers to be stored as blank separated words drawn from various natural languages. For example, 409 is “quattro zero ஒன்பது”. This tests that different character sets can be stored as keys and values in the database.
My initial naive coding, following the defined rules, got an expected dismal result. I immediately increased the amount of cache size and got a hefty improvement. There were still plenty of options for improvement, but as I was looking at the results reported by the Bhaskar for GT.M, I started to wonder if we were playing by the same rules.
I spent a little time reviewing the published the source of the GT.M (also called M) program. Reading M is a bit of a challenge. Then I discovered something interesting in the GT.M programming manual I found online here. If you search for JNLWAIT in this manual you see that ‘Normally, GT.M performs journal buffer writes asynchronously while the process continues execution’, and under JNLFLUSH this line: ‘Normally GT.M writes journal buffers when it fills the journal buffer pool or when some period of time passes with no journal activity.’ That doesn’t sound like full ACID semantics to me (ACID implies durable transactions), and I don’t see anything to indicate the M program uses JNLWAIT or JNLFLUSH. The statement for the benchmark doesn’t actually come right out and say ACID is required, but the discussion of a legal replication solution seems to imply it. Yeah, it’s a little confusing. But It looks like M program has suspended the requirement for immediate durability of each transaction.
And you know, that may make some sense. For most problems, you want to go as fast as you can, with all the tools at your disposal. Rules or conventional wisdom should be questioned. There’s a ton of applications out there that don’t need durability at every transaction boundary. So to match the GT.M program, I decided to add the equivalent option DB_TXN_NOSYNC to my benchmark.
Another oddity. To the best of my reading of the M program, it looks like the worker threads start running before the start time is collected. A bit like stopwatch being started after the runners hear the starting gun. Oops. Maybe I misunderstand the program, but in any case, I didn’t replicate that.
Speaking of replication, I didn’t take advantage of BDB’s replication either. Everything’s running on my trusty laptop Apple MacBook Pro (a few years old).
The results reported for M uses 20M of ‘global buffers’ (equivalent to BDB cache) and 8M of ‘journal buffers’ (equivalent to BDB log buffer). I happen to know that cache is a lot more important than log buffers in BDB, so I used the same 28M in my configuration, but distributed them differently: 27M of cache and less then 1M of log buffer size. (It turns out that in some cases, smaller log buffers can give better results).
I used 3 threads, as that seemed to be the sweet spot for BDB in my setup. So I got some results, a bit better than those published for the M program, on pretty close to the same hardware. Mine’s written in C++ (but mostly the C subset), and it is a bit long – I put all the various options I played with as command line options for easy testing. It’s here if you want to look.
After that, I decided to crack open the test completely — making the cache large enough that the entire database is in memory. There’s a lot of apps I see that run like this. I found it helpful to partition the data. The final result had 30124 transactional puts per second, 44013 gets per second, yielding 74137 operations per second. The total runtime of the program was 72 seconds, down from 8522 seconds for my first run.
It was interesting to see what was helpful at various points. With a very small cache, partitioning was counterproductive – it inflated the data size a bit in the cache, so even while helping contention, it made the cache less effective, which was more important. Also, changing the btree comparison function to get better locality helped. Using trickle had mixed results.
When the cache held all the data, partitioning was helpful, changing the btree compare was not, trickle was not. At no point in my testing did a log buffer size over 800K make much of a difference. But then the transactions weren’t very large at all.
What configuration options should you choose? As you see, it all depends on your situation. Which leads to the last point.
Your mileage will vary. Your app will run slower, or faster, depending on lots of things that only you know: Your data layout – this benchmark has both keys and data typically a couple dozen bytes. Your hardware and OS – BDB runs on everything from mobile phones to mainframes. The other processing that your application does. Data access pattern. Memory usage.
And only you know which rules you can bend, and which rules you can break.