This post follows up on last week’s post of the 3n+1 benchmark I wrote using Berkeley DB. There are plenty of reasons for this. Lots of meaty comments and discussion from K.S.Bhaskar and Dan Weinreb. Rob Tweed has pointed us to a discussion where some folks followed up on this and tried Intersystems Caché with the same benchmark. Martin found an error in my program. The discussion, in comments on my blog, but also on email and other blogs, has been lively and interest has been high. Thank you all!
The primary movitivator to revisit this was that I broke the rules on the 3n+1 benchmark, in a way I didn’t understand last week. K.S.Bhaskar was good enough to enlighten me on my questions about what needs to be persisted and when in the benchmark. In my defense, I have to say that I was reading two different versions of the benchmark statement at various times. And neither one is free of ambiguity. Dan Weinreb has doubts that the current statement requires any persistance at all.
When I coded my solution, I knew it did not obey the strict rules of the benchmark, especially regarding the format of inputs and outputs. That’s one reason I never ‘officially’ submitted my results. But I thought I was doing the same amount of work between the start and finish line. In reading Baskar’s response, I realized two important things. First, that the intermediate results known as ‘cycles’ need not be transactional. Second, that the final result (maximum length cycle so far) needs to be transactional at each step that a new maximum is found.
The first point basically cleared up the issues I raised with the use of JNLFLUSH in the M benchmark program and does indeed make it equivalent to the use of DB_TXN_NOSYNC in my benchmark. The second point was that the final maximum-length-cycle result needed to be persisted in a transactional way. My first published code didn’t even store this result in the database, I kept it in per-thread 4-byte variables, and chose the maximum at the end so I could report the right result. So I definitely didn’t play by the rules last week.
To get out of the penalty box, I corrected the benchmark to make the final results transactional and reran it. For every value computed, the program transactionally reads the current maximum and compares it and if a new maximum is found, stores it. This uses a database that contains exactly one value. Depending on how you read the rules, an optimization might allow each thread to keep (in local storage) the previous maximum that thread knew about and not even consult the result database if the value had not exceeded it. I did not include that optimization, but I note it in case we are trying to define the benchmark more rigorously.
There was a several second bump in run times when I first coded the change this in the most straightforward way. Here’s some pseudocode from that run:
begin transaction // RMW holds an exclusive lock get old_value from result database using DB_RMW if any errors abort txn and break if (new_value > old_value) then put new_value to result database if any errors abort txn and break end if commit transaction
Actually that whole thing is wrapped in a loop in order to deal with potential for deadlock. It’s in the published code in the function update_result().
The problem with this code is that an exclusive lock is held on the data from the point of the first DB->get (using DB_RMW) until the commit. The DB_RMW (read-modify-lock) flag is useful because it ensures that a classic deadlock scenario with two threads wanting to update simultanously is avoided. But the exclusive lock increases the surface of contention.
One way out is to remove the DB_RMW flag and let the outer loop catch the inevitable deadlocks and retry. I didn’t consciously consider this until now because I saw another approach. Since the value is strictly increasing, it can be checked first with a naked DB->get, compared and only if the value needs to be changed do we do the ‘check and set’ completely within a transaction. Is that legal by the rules? I don’t know. I do know that this modification ran the benchmark at 72 seconds using ‘maximum’ cache for both 3 and 4 threads. That’s the same as the previous result I reported.
At one point, I also made a version that stored the maximum result using the string version of the numerals, where 525 is “пять deux пять”. This is not part of the benchmark statement. This one got 86 seconds with 4 threads and maximum cache. A full 20% slower – I think that really shows the CPU strain of coding/decoding these numbers. Anyone doing serialization/deserialization of objects into a database knows what I’m talking about.
Sadly, the current version of source that I put on github runs a little more slowly. After realizing that I had been looking at two versions of the problem statement, I looked more closely at the newer one and again misinterpreted one part to mean that I needed to do more – transactionally store the maximum value that was visited. Looking at the submitted M program, I don’t think that this needs to be kept transactionally. All this changing and rerunning becomes rather painful because I don’t have a separate test machine. Each benchmark run requires me to shutdown browsers, email, IMs and leave my laptop alone during the various runs. It’s not fun for me, and I expect it’s not fun to read about.
So here’s something a little more entertaining to think about. In various runs, I occasionally saw that adding a trickle thread was useful. (Is an extra thread even legal? Grrrrr.) Trickle helped when there was extra CPU to go around, and when cache was scarce. It hurt when CPU was scarce. When it worked, small trickles, done frequently, did the trick. But change the input parameters or other configuration, and trickle’s overhead hurt more than it helped. It got me thinking about coding up a better trickle thread – one that uses various available metrics: dirty pages forced from the cache from the memory stats, the percentage of dirty pages, current CPU usage, and the effectiveness of previous trickle calls. Surely we could make an adaptable trickle thread that’s useful for typical scenarios?
And while we’re on the subject of trickle, I wonder why BDB’s pseudo-LRU algorithm for choosing a buffer to evict in the cache to use doesn’t even consider whether a buffer is dirty? If it does, I don’t see it – take a look at mp/mp_alloc.c . If BDB knew there was a trickle running, it seems like in the main thread it would want to choose old clean pages to evict rather than slightly older dirty pages. Maybe another reason to have some tighter coordination by having a built-in default trickle thread available.
I think Berkeley DB Java Edition has the right strategy for utility threads like this. Rather than have everyone roll their own, create a reasonable default thread that knows all the important metrics to look at and adapts accordingly. Add a few configuration knobs to deal with extreme cases. If you don’t like it, you can drop in your own solution. But at least you have something reasonable to start with. Is it time for DB core to pick up on this? Maybe it’s a community effort?
Speaking of Java, it would certainly be instructive to revisit the benchmark with solution coded using Java and Berkeley DB. I think we’d learn a lot and and we could get JE in the picture too. Someday. Oh yeah, someday I should also fix up the current benchmark. Which brings us full circle.
The original points of my previous post stand. With Berkeley DB, you really have to do some tuning. I’m pretty certain that had I coded the right solution from the start, I would have still seen a 100x speedup. I’m pretty certain that your mileage, and the techniques you’ll need to employ for your app, will vary. Lastly, I’m pretty certain that I can’t be very certain about benchmarks.