Revving up a benchmark: from 626 to 74000 operations per second

[Note: K.S.Baskar has responded to my post here. I have more to say on this here].

I started looking at the 3n+1 benchmark described here to see how well BDB fared against GT.M (the language formally known as MUMPS). I built my benchmark and then tweaked and tuned. Along the way I discovered some interesting things.

My first run netted 626 operations per second. In BDB technical lingo, this is known as… slow. Okay, starting with 626 is a little contrived. Who actually runs with the default cache setting with Berkeley DB? But I wanted to prove a point. You really need to do some tuning of any application to begin to take full advantage of BDB. But seriously, two orders of magnitude? That’s a pretty hefty speedup.

All the gory details are here. Here are the highlights. The benchmark measures how fast you can compute results for the 3n+1 function. It’s a function that is defined recursively, so that computed results are stashed in a database or key/value store and are used by successive calculations. This statement of the benchmark requires numbers to be stored as blank separated words drawn from various natural languages. For example, 409 is “quattro zero ஒன்பது”. This tests that different character sets can be stored as keys and values in the database.

My initial naive coding, following the defined rules, got an expected dismal result. I immediately increased the amount of cache size and got a hefty improvement. There were still plenty of options for improvement, but as I was looking at the results reported by the Bhaskar for GT.M, I started to wonder if we were playing by the same rules.

I spent a little time reviewing the published the source of the GT.M (also called M) program. Reading M is a bit of a challenge. Then I discovered something interesting in the GT.M programming manual I found online here. If you search for JNLWAIT in this manual you see that ‘Normally, GT.M performs journal buffer writes asynchronously while the process continues execution’, and under JNLFLUSH this line: ‘Normally GT.M writes journal buffers when it fills the journal buffer pool or when some period of time passes with no journal activity.’ That doesn’t sound like full ACID semantics to me (ACID implies durable transactions), and I don’t see anything to indicate the M program uses JNLWAIT or JNLFLUSH. The statement for the benchmark doesn’t actually come right out and say ACID is required, but the discussion of a legal replication solution seems to imply it. Yeah, it’s a little confusing. But It looks like M program has suspended the requirement for immediate durability of each transaction.

And you know, that may make some sense. For most problems, you want to go as fast as you can, with all the tools at your disposal. Rules or conventional wisdom should be questioned. There’s a ton of applications out there that don’t need durability at every transaction boundary. So to match the GT.M program, I decided to add the equivalent option DB_TXN_NOSYNC to my benchmark.

Another oddity. To the best of my reading of the M program, it looks like the worker threads start running before the start time is collected. A bit like stopwatch being started after the runners hear the starting gun. Oops. Maybe I misunderstand the program, but in any case, I didn’t replicate that.

Speaking of replication, I didn’t take advantage of BDB’s replication either. Everything’s running on my trusty laptop Apple MacBook Pro (a few years old).

The results reported for M uses 20M of ‘global buffers’ (equivalent to BDB cache) and 8M of ‘journal buffers’ (equivalent to BDB log buffer). I happen to know that cache is a lot more important than log buffers in BDB, so I used the same 28M in my configuration, but distributed them differently: 27M of cache and less then 1M of log buffer size. (It turns out that in some cases, smaller log buffers can give better results).

I used 3 threads, as that seemed to be the sweet spot for BDB in my setup. So I got some results, a bit better than those published for the M program, on pretty close to the same hardware. Mine’s written in C++ (but mostly the C subset), and it is a bit long – I put all the various options I played with as command line options for easy testing. It’s here if you want to look.

After that, I decided to crack open the test completely — making the cache large enough that the entire database is in memory. There’s a lot of apps I see that run like this. I found it helpful to partition the data. The final result had 30124 transactional puts per second, 44013 gets per second, yielding 74137 operations per second. The total runtime of the program was 72 seconds, down from 8522 seconds for my first run.

It was interesting to see what was helpful at various points. With a very small cache, partitioning was counterproductive – it inflated the data size a bit in the cache, so even while helping contention, it made the cache less effective, which was more important. Also, changing the btree comparison function to get better locality helped. Using trickle had mixed results.

When the cache held all the data, partitioning was helpful, changing the btree compare was not, trickle was not. At no point in my testing did a log buffer size over 800K make much of a difference. But then the transactions weren’t very large at all.

What configuration options should you choose? As you see, it all depends on your situation. Which leads to the last point.

Your mileage will vary. Your app will run slower, or faster, depending on lots of things that only you know: Your data layout – this benchmark has both keys and data typically a couple dozen bytes. Your hardware and OS – BDB runs on everything from mobile phones to mainframes. The other processing that your application does. Data access pattern. Memory usage.

And only you know which rules you can bend, and which rules you can break.

About ddanderson

Berkeley DB, Java, C, C , C# consultant and jazz trumpeter

View all posts by ddanderson →

This entry was posted in Uncategorized. Bookmark the permalink.

In GT.M transaction processing (anything code executed between a tstart and a tcommit) has ACID properties. In the benchmark, I use it in the following places: - Updating statistics just before a worker process shuts down - the number of reads, the number of updates, the highest number encountered, etc. So, this is once per process. - Whenever a new longest 3n+1 sequence is found by a worker process. So, on a run from 1 through 1,000,000, this will happen many times. I don't know how many transaction commits are actually performed, but it is probably in the hundreds.

12 Responses to Revving up a benchmark: from 626 to 74000 operations per second

dsegleau says:

January 31, 2011 at 5:43 am

Don,

Great blog entry. Thanks. I’ll be interested to see what you think of the dynamic memory allocation feature that we’re including in the next release. It should make out-of-the-box configuring of the cache an easier proposition.

Regards,

Dave

- ddanderson says:
  
  January 31, 2011 at 12:24 pm
  
  Thanks, Dave. Very much looking forward to the dynamic allocation feature. Probably wouldn’t make a big difference on this particular benchmark, but for apps that struggle with getting #lockers, #lockobjects, etc. just right, it will win big.
  
K.S. Bhaskar says:

January 31, 2011 at 10:13 pm

Don, I’m thrilled that you decided to implement a solution to the 3n+1 benchmark problem. I started to post comments, and then decided that I was going to be too verbose for a response. So, I posted them as comments on my blog: http://ksbhaskar.blogspot.com/2011/01/comments-in-response-to-don-andersons.html

If you have any questions, comments or concerns, please do ask, either publicly, or offline – e-mail me at ks dot bhaskar at fisglobal dot com.

Regards
— Bhaskar (yes, that’s my last name, but that’s what I’m called)

- ddanderson says:
  
  February 1, 2011 at 2:16 pm
  
  Bhaskar, I’m pleased to meet you. Thanks so much for your detailed reply. My biggest concern was about the transactional requirements of the benchmark, that seems clearer now. I admit that the race condition for starting time is a small nit. It’s very difficult to get an accurate start time, I like to be conservative on that one.
  
  On my very last run chronicled at https://libdb.wordpress.com/3n1/ , I did rerun my best result with 4 threads, it took 74 seconds, instead of 72 for 3 threads (running with N=1,000,000). So, yes, there's a some penalty for having more contention with more threads.
  
Rob says:

February 1, 2011 at 7:24 am

More related discussion is happening here too: https://groups.google.com/group/intersystems-public-cache/browse_thread/thread/5b3d9b75236d7f83?hl=en

- ddanderson says:
  
  February 1, 2011 at 2:01 pm
  
  Thanks, Rob. (Anyone interested in following the link needs to join the google group – it’s for discussing the Cache product. They’ve run some 3n+1 benchmarks of their own).
  
Dan Weinreb says:

February 1, 2011 at 4:54 pm

I left an extensive comment at Bhaskar’s blog, which I’ll repeat here:

First: I do not know what you mean by “recoverable”. If the transactions are not durable, then you cannot recover them. Recoverable in the database literature means that you can recover the state as it was after all transactions that committeed, in the face of any “stop” including total system crash. The log file must be persisted when a (write) transaction commits. Omitting this can speed things up a lot; real durability isn’t cheap if you have to actually write the log to rotating magnetic storage. (Less so if you can use FLASH, or you disk has a battery-backed-up RAM buffer, or you’re copying the data to main memory in two servers with independent failure modes, or something like that.)

Is this really a database benchmark? You use the word “NoSQL” in the name. But since this is just a computation that ends promptly, getting the job done does not need durability anyway. It just runs to completion. That is, unless you want the application to complete even if a server crashes; but then measuring the execution time would not make sense. I don’t even see why there’s any need for a log at all.

Second: This benchmark appears to be operating on key/value pairs that are very small, containing small character strings (naming integers) rather than, say, images. Don Anderson says that the record sizes are typically 20-40 bytes. Some of the “NoSQL” systems are clearly designed for much larger K/V pairs. I am pretty sure that Riak (with BitCask) is like this, as well as Google’s BigTable and the clones thereof (HBase in Hadoop, and Cassandra).

Third: Many of the NoSQL systems are specifically designed to use replication, to the point where using them without replication is simply not done, and therefore not optimized for. This might make a difference.

Fourth: About tuning, this is a problem that always arises in benchmark comparisons. I have been involved in many such comparisons in my career, especially the “OO7” benchmark for object-oriented database systems. “Apples to apples” gets to be very hard to define fairly. This isn’t your fault; it’s an inherent problem in comparative benchmarks.

It’s a good idea to have well-specified benchmarks, and this is a good start!

- K.S. Bhaskar says:
  
  February 2, 2011 at 6:55 pm
  
  I responded to Dan Weinreb’s comments with another post (http://ksbhaskar.blogspot.com/2011/02/responses-to-dan-weinrebs-comments-of.html) because it was too long for a reply.
  
  — Bhaskar
  
ddanderson says:

February 1, 2011 at 5:27 pm
In private email, on Feb 1, 2011, at 10:30 AM, Bhaskar, K.S wrote:
```
  In GT.M transaction processing (anything code executed between
  a tstart and a tcommit) has ACID properties.  In the benchmark,
  I use it in the following places:

    - Updating statistics just before a worker process shuts down -
       the number of reads, the number of updates, the highest
       number encountered, etc.  So, this is once per process.
    - Whenever a new longest 3n+1 sequence is found by a worker
       process.  So, on a run from 1 through 1,000,000, this will
       happen many times.  I don't know how many transaction
       commits are actually performed, but it is probably in the hundreds.
```
(end of quote)

Based on this, I misunderstood this sentence in the benchmark description: ‘restarting the program with the same input i and j uses the partial results in the database prior to the crash’. My initial solution (before turning on DB_NOSYNC) defined ‘partial results’ much more broadly – to include all values computed for any N up to the point of failure. At the same time, I understand now that my program is not doing the same amount of work as Bhaskar’s, I am not storing or consulting the single ‘result’ value in my database. As time permits, I’ll need to fix and rerun. I don’t think it will be a huge hit against the elapsed time. I’ll have more to say in my next post.
Pingback: I broke the rules on the 3n+1 benchmark… again | libdb
Pingback: I broke the rules on the 3n+1 benchmark… again | libdb
K.S. Bhaskar says:

February 13, 2011 at 4:51 am

I did some refinement of performance with GT.M that I posted here: http://ksbhaskar.blogspot.com/2011/02/from-44-seconds-to-27-seconds-simple.html

Regards
— Bhaskar

	ddanderson on No Threads Attached
	chang on No Threads Attached
	ddanderson on No Threads Attached
	chang on No Threads Attached
	Schema evolution, or… on Cleaning up your zoo