Hot backup with smart updates

You’re set up for speed. You have all your many gigabytes of data in memory. You have your log files on an SSD, or maybe you’re using replication to stream your updates to your a co-located backup. Your backup is on a separate power supply, but it doesn’t much matter because you have UPS. Now you’re set for speed and reliability. Next, you partition your data set and remove dependencies so you can put each partition on a separate machine, each with its own backup. Woohoo! Speed, reliability and scalability!

Slow down partner. What happens when your data center goes out? Huh?! Yeah, we’re talking about natural disasters here. Loss of power or connectivity for hours, days or beyond. Maybe even destruction of the physical media. Now there’s a phrase that strikes fear into the hearts of system designers and administrators alike.

Okay, well, let’s look at our replication strategy again. We’ll have an offsite replication server or servers, and we’ll ship the replication traffic to them. Nice, but it can get expensive. That speed we were mentioning means we’re generating lots of megabytes of log data per second.

If we can relax the synchronicity requirements, we might consider hot backup over the network. That would be implemented by using the db_hotbackup utility to local storage followed up by a network transfer of the backed up environment to our remote server. With the basic way of doing hot backup, we transfer the whole database followed by the changed log files. If our databases are huge (and they often are), this is not a better solution, since even a daily transfer of terabytes of data might add a large expense. And it can be slow.

The db_hotbackup utility does have that nifty -u option to update a hot backup. But that requires that all the log files since the last backup be present, and effectively shipped as part of the backup. Not really much better, transfer-wise, than replication.

Here’s another thought. What if we had a network-aware hot backup utility that worked a little like a smart rsync? That is, it compares databases on the local and remote machine, and copies just the changed blocks (like rsync’s –inplace option). After that’s done, copy all the log files since the beginning of the sync.

The advantage is evident when I have a large database that has a lot of updates, and many of the updates are localized. Think of a single record that may be updated 100 times between the time of two different backups. Then, instead of copying 100 log records pertaining to that record, I’m only copying the record itself.

Back in the dark ages, when there was no hot backup utility, every BDB user wrote their own. Can we write our own db_netbackup that goes back to these roots, and uses the rsync idea? Maybe. I think db_hotbackup has some extra abilities to coordinate with other administrative functions, so that, for example, needed log files are not auto-removed during a hot backup. So we’d need to consider this when we’re messing with hot backup. Possibly the best approach would be to clone db_hotbackup, and have it call an external rsync process at the appropriate point to copy files.

Too my knowledge, nobody has done this. Am I the only one that sees the need?

That’s it. Speed, reliability and scalability, now with inexpensive disaster proofing.

Posted in Berkeley DB | Leave a comment

Schema evolution, or joke driven development?

In a previous post we talked about a fast and loose way to clean up your zoo — that is, how to evolve the structs you store with BDB. I stated that that fast and loose Perl script couldn’t be used transactionally. Baskhar’s comment ‘What if the zoo needs to keep operational even while eliminating peanuts?’ directs this post. And if you’re using the offline-upgrade route anyway, Alexandr Ciornii offered some improvements to the perl script. See, I do read those comments!

Somehow this question reminds me of the old joke about three envelopes. A new manager is appointed to a position and the on the way out, the old manager hands her three envelopes. The outgoing manager says, “in times of crisis, open these one at a time.” Well, after a short time, the new manager finds herself in hot water and she opens the first envelope. It says, “Blame your predecessor.” She does that, and things cool off for a while. But when the going gets tough again, she opens the second envelope. “Reorganize.” She promptly shuffles the organization structure and somehow that makes things better. Sometime later, she is faced with yet another crisis. She doesn’t have any choice but to open envelope #3. It says, “Prepare three envelopes.”

Now, suppose you were new to the zoo project, and you’ve been told that the zoo needs to stay up and running. If you’re lucky, you received those three envelopes from your predecessor.

Here’s our original struct:

    struct zoo_supplies {
       int n_bananas;
       int n_peanuts;
       int n_bamboo_stalks;

and where we want to get to:

    struct downsized_zoo_supplies {
       int n_bananas;
       int n_bamboo_stalks;

Envelope #1 tells us to blame our predecessor. Indeed, here’s what he could have done – put a version number in every struct to be stored. It should be first so it will never change position:

    struct zoo_supplies_versioned {
       int version_num;
       int n_bananas;
       int n_peanuts;
       int n_bamboo_stalks;

version_num would be zeroed initially. Then, to downsize our struct, we’d have this:

    struct downsized_zoo_supplies_versioned {
       int version_num;
       int n_bananas;
       int n_bamboo_stalks;

with version_num always being 1. When you read the struct from the database, look at version_num first so you know which one to cast it to. If you’re using C++/Java, you might want to inherit from a common class containing version_num.

This is all elementary stuff, our predecessor really missed the boat!

While blaming your predecessor might feel good, it didn’t solve the problem. The heat’s still on. So let’s go to envelope #2, and reorganize. All your data is starting in this format, which you’ve renamed:

    struct zoo_supplies_version_0 {
       int n_bananas;
       int n_peanuts;
       int n_bamboo_stalks;

Before we get to the final version, let’s introduce this one:

    struct zoo_supplies_version_1 {
       int version_num;
       int n_bananas;
       int n_peanuts;
       int n_bamboo_stalks;

Every new insert or update in the database uses zoo_supplies_version_1 (with version_num set to 1). When you get a record from the database, you can’t use the version_num field yet. Rather, you look at the size returned from DB, if you get sizeof(zoo_supplies_version_0), then cast it to that struct, otherwise cast to zoo_supplies_version_1.

This approach alters the database a little at a time, but we really need a push to get it all done. How about a little background utility that marches through the database to convert a record at a time. We’ll want to turn on the DB_READ_COMMITTED flag for its cursor to make sure it’s not holding on to any locks it doesn’t need.

Once we’ve confirmed that the background utility has done its march and every record is version 1, then we can finally make the real mod we’re seeking:

    struct zoo_supplies_version_2 {
       int version_num;
       int n_bananas;
       // sorry, no more peanuts
       int n_bamboo_stalks;

At this point, we can reliably use the version_num field, and cast as appropriate.

But there’s an implicit problem here with adding a version_num field at all. Many BDB databases have a small record size – the 12 bytes in our toy example is not often too far off the mark. Adding 4 bytes to a small record can result in a proportionally large increase in the overall database size. If you’re memory tight and cache bound, your runtime performance may suffer in even greater proportion.

Time to open envelope #3? Not just yet, there’s a couple other solutions that just might have a better punchline. See you next post.

Posted in Uncategorized | Leave a comment

Keys to a performance win

You might think that you don’t have much of a choice in selecting keys for a database. Sometimes you have more choices than you think. And those choices can make a whale of a difference when it comes to performance.

Let’s start with a simple example using an online shop. Sure, you say, another online store example. Yeah, but this store deals exclusively with ultimate frisbee supplies! Or dog booties. Maybe lepidopterist hats? Now that I’ve roped in a few random google surfers, let’s get started :-).

Our order numbers are plain old integers, and we want to store the orders in a BDB database. Assuming we’re coding this in C/C++ we might start with a simple model like this:


struct OrderKey {
    int orderNumber;
struct Order {
    int orderItemList;    // a key into another data store
    int accountNumber;
    char shippingAddress[100];


The data for an order’s going to be a lot more complex than this, but we’re focusing on the key here. What could be simpler, just an integer, right?

If you use this key/value arrangement, things will function just fine. But when you start looking at what’s happening performance-wise, you might be surprised. Since we generate our order keys sequentially, we want our key/value pairs to appear sequentially in the Btree. But that won’t necessarily happen. If your processor is a ‘LittleEndian’ one, like that x86 that dominates commodity hardware, the order number appears with the least significant byte first. BDB processes keys as sequences of bytes — it has no clue that those bytes made up a meaningful integer. So order number 256 (0x00000100) is stored as these bytes: 00 01 00 00.  With its least significant ’00’ byte, it appears before order number 1 (and 2 and 3…) in the database. 257 appears between 1 and 2. To drive home the point, here’s the first chunk of keys you’d see in your database after storing order numbers 1 through 1000 as integer keys.


db bytes           interpreted as 'int' order number
00 01 00 00        256
00 02 00 00        512
00 03 00 00        768
01 00 00 00          1
01 01 00 00        257
01 02 00 00        513
01 03 00 00        769
02 00 00 00          2
02 01 00 00        258
02 02 00 00        514
02 03 00 00        770
03 00 00 00          3
03 01 00 00        259
03 02 00 00        515
03 03 00 00        771
04 00 00 00          4

The gap between order numbers 1 and 2, and between 2 and 3, etc., gets wider and wider as we add more and more orders.

Why should we care?  We care because if our access pattern is that the most recently allocated order numbers are accessed most recently, and those orders are scattered all over the btree, well, we just might be stressing the BDB cache. That is, we’d need more, more, more cache as the database gets bigger, bigger, bigger. Even if my application’s data set is entirely in memory, I still may be missing out on other cache benefits at the hardware level. We’re pretty much losing all the Locality of Reference inherent in our access pattern.

Okay, if scattered data is the disease, let’s look at the cures. DB->set_bt_compare() does the job, but with caveats. The bt_compare function is a custom function that allows you to do the comparison any way you want. Here’s the sort of compare function we’d want to have with this sort of integer key:

int my_compare(DB *db, const DBT *dbt1, const DBT *dbt2) {
    // I know what my keys look like!
    int key1 = *(int *)dbt1->data;
    int key2 = *(int *)dbt2->data;
    if (key1 < key2)
        return -1;
    else if (key1 > key2)
        return 1;
        return 0;

There are two downsides to using this approach. First, the btree compare function is called a lot. It’s got to be really fast or else your performance will suffer. Even though the above code is pretty tight, it’s hard to get faster than BDB’s default memcmp. The more important issue is that introducing a btree compare function creates a maintenance headache for you. If you ever want to use the db_load utility, you’ll need to modify the sources to know about your keys. db_verify will no longer be able to check key ordering without source modification. Who knows what other future tools won’t be available to you. Since db_dump and db_load are useful for various reasons (data transfer, upgrades, ultimate db salvage) that should be enough for you.

So the better approach is to fix the key. Here’s one way:


struct OrderKey {
    char bytesForOrderNumber[4];  //   Store MSB of int first.

You can marshall the bytes yourself or call intrinsic byte swap routines (try __builtin_bswap32 for GCC, and _byteswap_ulong for VC++). If you’re using C++, you could make a nifty class (let’s call it BdbOrderedInt) that hides the marshalling and enforces the ordering. Our example suddenly becomes much more readable:


struct OrderKey {
    BdbOrderedInt orderNumber;


Oh what the heck, here’s an implementation of such a class – partially tested, which should be pretty close for demonstration purposes.

Another way that makes the actual data readable, but is less space efficient, would be to store the bytes as a string: “0000000123”.  For good or bad, this goes a little down the path of string typing your data as opposed to strong typing.

Both of these approaches will get us to the locality we are looking for.
Those of you that use the Java API are probably yawning. That API has a little bit different model of how you present keys to BDB that does the marshaling for you (for example, using IntegerBinding). They even go a little bit further, as negative values for signed quantities are ordered before zero and positive values. As for other BDB languages: C# – I don’t see any marshaling in the API; PHP, Perl, Python, I frankly don’t know if this is an issue.

There’s more to say about key selection, that will wait for another post.


Posted in Uncategorized | 1 Comment

Cleaning up your zoo

Let’s look at something that’s über-practical this week. Some of you will define this version of practical as hackery, so you can hold your nose and slyly slip this into your bag of tricks. You’ll know when you need it.

Suppose your program stores values with the following structure in a database:

struct {
int n_bananas;
int n_peanuts;
int n_bamboo_stalks;

and due to downsizing at the zoo, you’d like to change your values to this:

struct {
int n_bananas;
int n_bamboo_stalks;

Did I say this was a contrived example? Regardless, if you’ve ever reorged your data structure, you’ll need to think about any data that you already have stored. How would one change millions of records like the above (or even just one)?

One option would be to change ‘int n_peanuts’ to ‘int reserved’, and forget about it. But that’s not so good – the old data will still be there for existing records, so you really can’t easily reuse that reserved field for anything.

You could write a program that reads from one format and produces another. And that’s the solution most of us would do (and no reason to hold your nose).

Or you could leverage the fact that db_dump and db_load use a scrutable text format. Perl has some modules that know about Berkeley DB, but here’s a Perl solution that uses the db_dump and db_load in your path to do the database part, and leaving really just the transformation part. Note that this is much the same as the option just described (a program that reads from one format and produces another), except that the program is here, written for you, and is trivial to modify. And it has the virtue of being in a neutral language – the Java folks won’t complain that it’s C, and the C/C++ folks won’t complain that it’s Java. I guess anyone could complain that it’s Perl…


# Copyright (c) 2011
#	Donald D. Anderson.  All rights reserved.
# Redistribution and use in source and binary forms are permitted.
# This software is provided 'as is' and any express or
# implied warranties, including, but not limited to, the implied
# warranties of merchantability, fitness for a particular purpose, or
# non-infringement, are disclaimed.
# Usage:  bdb_convert_data fromfile tofile
# Converts data in a Berkeley DB database.
# Can be customized to remove or add bytes to each record,
# and even add new records.
# Warning: this should normally be used on a quiet system (or best, a
# system that has no live BDB processes.  There is nothing
# transactional about this script!
# Warning: should not be used if you've changed your btree compare function
# (since db_load will not do the right thing).
# See discussion on

die "Usage: $0 fromfile tofile" if ($#ARGV != 1);
$from = $ARGV[0];
$to = $ARGV[1];
die "$0: $from: does not exist" if (! -f $from);
die "$0: $to: exists, will not overwrite" if (-f $to);
open IN, "db_dump $from|" || die "$0: cannot run db_dump";
unlink "$to.TMP";
open OUT, "|db_load $to.TMP" || die "$0: cannot run db_load";

# convert_key and convert_value are called with $_ set to
# a data line from the db_dump output.  Each data line starts
# with a single space, then there are hex digits, a pair of hex
# digits for each 8 bit char.  E.g. ' 65696e7320d0bfd1' is 8 bytes.

# This convert_key passes through the key without modification
sub convert_key() {
    #print "key: $line";
    print OUT "$line";

# This convert_value 'removes' the second 4 bytes, for demonstration
# !! **** modify this as necessary **** !!
sub convert_value() {

    # !! **** here's the custom part **** !!
    if (length($_) > 17) {
      $line = substr($_,0,9) . substr($_,17);

    #print "dat: $line";
    print OUT "$line";

$iskey = 1;

# The dbdump format contains some header info, that starts
# in the first column.  Those lines are copied directly.
# The data appears with a single space in the first column,
# followed by a bunch of hex numbers.  Lines of data alternate
# between keys and values.
while () {
    if (/^ /) {
        if ($iskey) {
        } else {
        # alternate lines
        $iskey = ! $iskey;
    else {
        print OUT $_;
close (IN);
close (OUT);
rename "$to.TMP", "$to";


The same code is nicely formatted and colorized here.

Easy enough to customize.  If you needed to delete records, you could do it. If you needed to add records, you could do it.

Heed the warnings in the script.  First, it should be used on a quiet system. A minimum requirement is that the file being converted is closed by all. There is nothing transactional about this script!

Secondly, if you’ve changed your btree compare function, or duplicate sorting function, this script is not for you. Back to making a custom program.

Finally, if your data structure is not dead simple – if you can’t easily discern byte positions, etc. then by all means, write a proper conversion program in your choice of language.

And it might be nice to test it before trying it out on your production system. We all want to keep our zoo clean, but even hackers have standards.

Posted in Uncategorized | 5 Comments

I broke the rules on the 3n+1 benchmark… again

This post follows up on last week’s post of the 3n+1 benchmark I wrote using Berkeley DB. There are plenty of reasons for this. Lots of meaty comments and discussion from K.S.Bhaskar and Dan Weinreb. Rob Tweed has pointed us to a discussion where some folks followed up on this and tried Intersystems Caché with the same benchmark. Martin found an error in my program. The discussion, in comments on my blog, but also on email and other blogs, has been lively and interest has been high. Thank you all!

The primary movitivator to revisit this was that I broke the rules on the 3n+1 benchmark, in a way I didn’t understand last week. K.S.Bhaskar was good enough to enlighten me on my questions about what needs to be persisted and when in the benchmark. In my defense, I have to say that I was reading two different versions of the benchmark statement at various times. And neither one is free of ambiguity. Dan Weinreb has doubts that the current statement requires any persistance at all.

When I coded my solution, I knew it did not obey the strict rules of the benchmark, especially regarding the format of inputs and outputs. That’s one reason I never ‘officially’ submitted my results. But I thought I was doing the same amount of work between the start and finish line. In reading Baskar’s response, I realized two important things. First, that the intermediate results known as ‘cycles’ need not be transactional. Second, that the final result (maximum length cycle so far) needs to be transactional at each step that a new maximum is found.

The first point basically cleared up the issues I raised with the use of JNLFLUSH in the M benchmark program and does indeed make it equivalent to the use of DB_TXN_NOSYNC in my benchmark. The second point was that the final maximum-length-cycle result needed to be persisted in a transactional way. My first published code didn’t even store this result in the database, I kept it in per-thread 4-byte variables, and chose the maximum at the end so I could report the right result. So I definitely didn’t play by the rules last week.

To get out of the penalty box, I corrected the benchmark to make the final results transactional and reran it. For every value computed, the program transactionally reads the current maximum and compares it and if a new maximum is found, stores it. This uses a database that contains exactly one value. Depending on how you read the rules, an optimization might allow each thread to keep (in local storage) the previous maximum that thread knew about and not even consult the result database if the value had not exceeded it. I did not include that optimization, but I note it in case we are trying to define the benchmark more rigorously.

There was a several second bump in run times when I first coded the change this in the most straightforward way. Here’s some pseudocode from that run:

begin transaction

// RMW holds an exclusive lock
get old_value from result database using DB_RMW
if any errors abort txn and break

if (new_value > old_value) then
  put new_value to result database
  if any errors abort txn and break
end if

commit transaction

Actually that whole thing is wrapped in a loop in order to deal with potential for deadlock. It’s in the published code in the function update_result().

The problem with this code is that an exclusive lock is held on the data from the point of the first DB->get (using DB_RMW) until the commit. The DB_RMW (read-modify-lock) flag is useful because it ensures that a classic deadlock scenario with two threads wanting to update simultanously is avoided. But the exclusive lock increases the surface of contention.

One way out is to remove the DB_RMW flag and let the outer loop catch the inevitable deadlocks and retry. I didn’t consciously consider this until now because I saw another approach. Since the value is strictly increasing, it can be checked first with a naked DB->get, compared and only if the value needs to be changed do we do the ‘check and set’ completely within a transaction. Is that legal by the rules? I don’t know. I do know that this modification ran the benchmark at 72 seconds using ‘maximum’ cache for both 3 and 4 threads.  That’s the same as the previous result I reported.

At one point, I also made a version that stored the maximum result using the string version of the numerals, where 525 is “пять deux пять”. This is not part of the benchmark statement. This one got 86 seconds with 4 threads and maximum cache. A full 20% slower – I think that really shows the CPU strain of coding/decoding these numbers. Anyone doing serialization/deserialization of objects into a database knows what I’m talking about.

Sadly, the current version of source that I put on github runs a little more slowly. After realizing that I had been looking at two versions of the problem statement, I looked more closely at the newer one and again misinterpreted one part to mean that I needed to do more – transactionally store the maximum value that was visited. Looking at the submitted M program, I don’t think that this needs to be kept transactionally. All this changing and rerunning becomes rather painful because I don’t have a separate test machine. Each benchmark run requires me to shutdown browsers, email, IMs and leave my laptop alone during the various runs. It’s not fun for me, and I expect it’s not fun to read about.

So here’s something a little more entertaining to think about. In various runs, I occasionally saw that adding a trickle thread was useful. (Is an extra thread even legal? Grrrrr.) Trickle helped when there was extra CPU to go around, and when cache was scarce. It hurt when CPU was scarce. When it worked, small trickles, done frequently, did the trick. But change the input parameters or other configuration, and trickle’s overhead hurt more than it helped. It got me thinking about coding up a better trickle thread – one that uses various available metrics: dirty pages forced from the cache from the memory stats, the percentage of dirty pages, current CPU usage, and the effectiveness of previous trickle calls. Surely we could make an adaptable trickle thread that’s useful for typical scenarios?

And while we’re on the subject of trickle, I wonder why BDB’s pseudo-LRU algorithm for choosing a buffer to evict in the cache to use doesn’t even consider whether a buffer is dirty?  If it does, I don’t see it – take a look at mp/mp_alloc.c . If BDB knew there was a trickle running, it seems like in the main thread it would want to choose old clean pages to evict rather than slightly older dirty pages.  Maybe another reason to have some tighter coordination by having a built-in default trickle thread available.

I think Berkeley DB Java Edition has the right strategy for utility threads like this. Rather than have everyone roll their own, create a reasonable default thread that knows all the important metrics to look at and adapts accordingly. Add a few configuration knobs to deal with extreme cases. If you don’t like it, you can drop in your own solution. But at least you have something reasonable to start with. Is it time for DB core to pick up on this? Maybe it’s a community effort?

Speaking of Java, it would certainly be instructive to revisit the benchmark with solution coded using Java and Berkeley DB. I think we’d learn a lot and and we could get JE in the picture too. Someday. Oh yeah, someday I should also fix up the current benchmark. Which brings us full circle.

The original points of my previous post stand. With Berkeley DB, you really have to do some tuning. I’m pretty certain that had I coded the right solution from the start, I would have still seen a 100x speedup. I’m pretty certain that your mileage, and the techniques you’ll need to employ for your app, will vary. Lastly, I’m pretty certain that I can’t be very certain about benchmarks.

Posted in Uncategorized | 1 Comment

Revving up a benchmark: from 626 to 74000 operations per second

[Note: K.S.Baskar has responded to my post here. I have more to say on this here].

I started looking at the 3n+1 benchmark described here to see how well BDB fared against GT.M (the language formally known as MUMPS). I built my benchmark and then tweaked and tuned. Along the way I discovered some interesting things.

My first run netted 626 operations per second. In BDB technical lingo, this is known as… slow.  Okay, starting with 626 is a little contrived. Who actually runs with the default cache setting with Berkeley DB? But I wanted to prove a point. You really need to do some tuning of any application to begin to take full advantage of BDB. But seriously, two orders of magnitude?  That’s a pretty hefty speedup.

All the gory details are here. Here are the highlights. The benchmark measures how fast you can compute results for the 3n+1 function. It’s a function that is defined recursively, so that computed results are stashed in a database or key/value store and are used by successive calculations. This statement of the benchmark requires numbers to be stored as blank separated words drawn from various natural languages. For example, 409 is “quattro zero ஒன்பது”. This tests that different character sets can be stored as keys and values in the database.

My initial naive coding, following the defined rules, got an expected dismal result. I immediately increased the amount of cache size and got a hefty improvement. There were still plenty of options for improvement, but as I was looking at the results reported by the Bhaskar for GT.M, I started to wonder if we were playing by the same rules.

I spent a little time reviewing the published the source of the GT.M (also called M) program. Reading M is a bit of a challenge. Then I discovered something interesting in the GT.M programming manual I found online here.  If you search for JNLWAIT in this manual you see that ‘Normally, GT.M performs journal buffer writes asynchronously while the process continues execution’, and under JNLFLUSH this line: ‘Normally GT.M writes journal buffers when it fills the journal buffer pool or when some period of time passes with no journal activity.’ That doesn’t sound like full ACID semantics to me (ACID implies durable transactions), and I don’t see anything to indicate the M program uses JNLWAIT or JNLFLUSH.  The statement for the benchmark doesn’t actually come right out and say ACID is required, but the discussion of a legal replication solution seems to imply it.  Yeah, it’s a little confusing.  But It looks like M program has suspended the requirement for immediate durability of each transaction.

And you know, that may make some sense. For most problems, you want to go as fast as you can, with all the tools at your disposal. Rules or conventional wisdom should be questioned.  There’s a ton of applications out there that don’t need durability at every transaction boundary. So to match the GT.M program, I decided to add the equivalent option DB_TXN_NOSYNC to my benchmark.

Another oddity. To the best of my reading of the M program, it looks like the worker threads start running before the start time is collected. A bit like stopwatch being started after the runners hear the starting gun. Oops. Maybe I misunderstand the program, but in any case, I didn’t replicate that.

Speaking of replication, I didn’t take advantage of BDB’s replication either. Everything’s running on my trusty laptop Apple MacBook Pro (a few years old).

The results reported for M uses 20M of ‘global buffers’ (equivalent to BDB cache) and 8M of ‘journal buffers’ (equivalent to BDB log buffer). I happen to know that cache is a lot more important than log buffers in BDB, so I used the same 28M in my configuration, but distributed them differently: 27M of cache and less then 1M of log buffer size.  (It turns out that in some cases, smaller log buffers can give better results).

I used 3 threads, as that seemed to be the sweet spot for BDB in my setup. So I got some results, a bit better than those published for the M program, on pretty close to the same hardware. Mine’s written in C++ (but mostly the C subset), and it is a bit long – I put all the various options I played with as command line options for easy testing.  It’s here if you want to look.

After that, I decided to crack open the test completely — making the cache large enough that the entire database is in memory. There’s a lot of apps I see that run like this. I found it helpful to partition the data. The final result had 30124 transactional puts per second, 44013 gets per second, yielding 74137 operations per second.  The total runtime of the program was 72 seconds, down from 8522 seconds for my first run.

It was interesting to see what was helpful at various points. With a very small cache, partitioning was counterproductive – it inflated the data size a bit in the cache, so even while helping contention, it made the cache less effective, which was more important. Also, changing the btree comparison function to get better locality helped. Using trickle had mixed results.

When the cache held all the data, partitioning was helpful, changing the btree compare was not, trickle was not. At no point in my testing did a log buffer size over 800K make much of a difference. But then the transactions weren’t very large at all.

What configuration options should you choose?  As you see, it all depends on your situation. Which leads to the last point.

Your mileage will vary. Your app will run slower, or faster, depending on lots of things that only you know:  Your data layout – this benchmark has both keys and data typically a couple dozen bytes. Your hardware and OS – BDB runs on everything from mobile phones to mainframes. The other processing that your application does. Data access pattern. Memory usage.

And only you know which rules you can bend, and which rules you can break.

Posted in Uncategorized | 12 Comments

Doing the splits

Back to the land of ‘what-if’. Last time we talked about prefetch and what it might buy us. Today we’re talking about yet another BDB non-feature: presplit.

The concept is motivated by a use case where we are pumping lots of data into a Btree database. We’re not limited by disk speed: perhaps the database fits in memory, logging is either in memory (possibly backed by replication) or we are non-transactional. At this blistering high speed, we see a new blip on the radar: page splits.

When a Btree is growing, pages inevitably fill up. When a new key data pair cannot fit into the leaf page it sorts to, the page must be split into two. A new page is allocated, and we copy half of the leaf page data is to the new page. Also the parent internal page is updated to have the first key on the new page. That’s three pages being modified. A lot of bookkeeping and shuffling is involved here, disproportionate to the amount of bookkeeping normally needed to add a key/value pair.

In keeping with the theme of pushing tasks into separate threads, we might envision a way to anticipate the need for a split. An extra check could happen during a DB->put, and would trigger the presplit when a page is 90% filled, or perhaps when one additional key/data pair of the same size of the current one would trigger a split. Whatever the trigger condition, information about what page needed to be split could be recorded – perhaps in a queue database, to be consumed by another thread whose sole job it is to do the splits and keep that activity out of the main thread.

The benefit is that by the time another record is added, the page is already split. The same amount of work is done (and it’s all CPU), but it’s done in advance, and in another thread. When the next DB->put rolls around, it can be fast, so latency is reduced. It’s a little like hotel laundry service – I put a dirty shirt in a bag on my door and in parallel with my busy day, the dirty shirt is being cleaned and made ready for me when I need it.

When CPU is maxed out, this is not really a smart optimization: it’s true that you’re doing work in advance, but since it’s all CPU, you’ve just reshuffled the ordering that you’re doing things. Throughput and latency might get slightly worse. Using the laundry analogy, when the CPU is maxed out it’s like the hotel staff is off busy doing other stuff, and the laundry is self serve. Putting my clothes in a bag and hanging it on the door so I can take it to the laundry myself just adds to the overhead.  And cleaning my shirt as soon as it’s dirty vs. when it’s needed may also be wasteful – I might be checking out of this hotel before I need that shirt!

On the other hand, CPU horsepower continues to grow. Just think about the steady increase of cores on almost every platform, even phones. So the max-out argument may become less important over time, at least for apps that need record splitting low latency.

Like last time, I have to say that I don’t know exactly if the benefit would be substantial enough. Although it’s fun to think about, I’m not completely convinced that the extra costs of having a presplit thread will be justified by the returns. Studying presplit and prefetch does seem like a great research assignment for an enterprising grad student. Code tinkering, measurements, publications, press releases, notoriety and the chance to give back to the open source community. Cool stuff all.

We’ve probably spent enough time in ‘what-if’ for a while. I promise next post we’ll get back firmly grounded in reality.

Posted in Uncategorized | Leave a comment

Playing Fetch

We’re going to visit the land of ‘what-if’ today and talk about a non-feature of BDB.  This continues the thread of speculative optimizations that I wrote about last week.

In some operating systems, older ones especially, there is a concept of read ahead.  I think current systems rely on the firmware of disk drives to provide a similar benefit.  In the classic read ahead approach, if you are reading a file sequentially, and your process asks for disk block 1 and then block 2, perhaps the OS should also asynchronously fetch block 3.  This is a speculative optimization.  There’s a pretty good chance that you’ll be rewarded by a block already in the OS cache by the time the process gets around to needing block 3.  The payoff is pretty reasonable – if you get a cache hit, or even if the block is on the way in to memory, you’ll reduce the latency of your request.  The downside is that if you don’t need the block, you’ll be doing extra work, which may reduce throughput on a loaded system.

Let’s suppose we’re using a cursor to walk through a BDB database, and the database doesn’t fit into cache.  As we visit leaf pages, what sort of prefetching optimizations can we expect?  The answer is not good.

BDB itself does not do any prefetching (see No Threads Attached).  But what if the OS itself does prefetching of disk blocks, will it happen there?  Generally, no, because sequential blocks in an underlying BDB file do not appear in key order.  They appear, more or less in the order they were allocated.  To see what I mean, look at a snippet of output from ‘db_dump -d h <my_favorite.db>’:

page 100: btree leaf: LSN [7][4706625]: level 1
prev: 3653 next: 2538 entries: 88 offset: 1260
page 101: btree leaf: LSN [7][7887623]: level 1
prev: 2377 next: 2792 entries: 98 offset: 1024
page 102: btree leaf: LSN [7][8387470]: level 1
prev: 3439 next: 3245 entries: 110 offset: 864
page 103: btree leaf: LSN [7][9687633]: level 1
prev: 3457 next: 2390 entries: 98 offset: 484
page 104: btree leaf: LSN [7][10442462]: level 1
prev: 3513 next: 5518 entries: 66 offset: 2108
page 105: btree leaf: LSN [7][9323239]: level 1
prev: 4033 next: 5439 entries: 64 offset: 2144
page 106: btree leaf: LSN [7][10313588]: level 1
prev: 2999 next: 4010 entries: 78 offset: 1480
page 107: btree leaf: LSN [7][4749567]: level 1
prev: 4333 next: 5087 entries: 74 offset: 1728
page 108: btree leaf: LSN [7][7223464]: level 1
prev: 4262 next: 2832 entries: 120 offset: 524
page 109: btree leaf: LSN [7][8078083]: level 1
prev: 2502 next: 3897 entries: 74 offset: 1896

The ordering (defined by ‘prev’ and ‘next’ links) is pretty mixed up.  As your cursor moves through the data, disk blocks are going to be requested rather randomly.  If your database is readonly, you can take advantage of this trick to get things in proper order. Otherwise, any OS prefetching is not going to help — unless the OS brings the entire file into memory.

Here’s another use case to think about.  You have a primary database and one or more secondary databases that represent indexed keys on your data.  This might allow you, for example, to find all Order records in a particular zip code, or all Orders that were shipped in a range of dates.  Secondary databases are typically small and can be quickly navigated, but when you start pulling the primary records out, they’ll be referenced in a jumbled order.  Using the reloading trick won’t really help.

In both of these cases, we have a primary database whose blocks are accessed in a scattered way, but in both of these cases, BDB has full knowledge about what block will be needed next.

If you’re on Linux, you might have a trick available.  You could snag the file descriptor from the primary’s DB object and use the Linux ‘readahead’ system call.  Given what we know about the scattered placement of blocks, it probably makes sense to read the entire file, and that only makes sense if the file is not too large in proportion to the available memory.

If that trick doesn’t make sense, you still may get some benefit from the disk cache built into the hardware.  The disk controller can slurp in a big hunk of contiguous data at once into its internal cache.  If the OS has done its best to allocate the file contiguously or at closely enough, you’ll get a boost — the read request for file blocks may be satisfied without waiting for the physical disk.

There’s a lot of ifs in the preceding paragraphs, which is another way to say that prefetching may sometimes happen, but it’s largely out of the control of the developer. BDB is the layer where the knowledge about the next block is, so prefetching would make the most sense to do in BDB.  It would be real nice to have something like a DB_PREFETCH flag on DB->open, coupled with some way (an API call? an event callback?) to get prefetching done in another thread.  In general terms, I’m thinking of how the memp_trickle API helps solves the issue of double writes in write-heavy apps.

Before we get too far into armchair API design, the best next step would be some proof that this would actually help real applications.  Does your app do cursor scans of entire (huge) databases, or use secondary tables as indices to scan through portions of a huge primary?  (huge == cannot fit in BDB, OS or disk cache memory).  Obviously, Oracle has not yet prioritized these use cases in their product.  Should they?

Posted in Uncategorized | 2 Comments

When trickle doesn’t work

In a past column, I’ve mentioned memp_trickle as a way to get beyond the double I/O problem. And it often works well for this, but there are times when it doesn’t.

Trickle is a sort of optimization that I would call speculative. These sorts of optimizations attempt to predict the future. We do work now, in a separate thread, because in the future the fruits of our work will be useful. In trickle’s case, we do writes from the cache now, in a separate thread, because in the future clean cache pages will eliminate one of our I/Os in the main thread, decreasing latency.

But gazing into the crystal ball of the future can give a hazy picture. One obvious case is that we might not benefit from the clean cache page, ever. Our program may simply stop, or have no more database requests. Generally we’re not particularly worried about that — BDB systems typically run forever, we’ll eventually get more traffic, updates, orders, etc.

Our second hazy case is that we may not need more clean cache pages. If our entire working set of accessed pages fits into the BDB cache, then we’ll be accessing the same pages over and over. No new pages needed. Trickle done on this sort of system will create extra I/O traffic. Consider a single leaf page in this scenario. It’s updated, perhaps once a second, but never written to disk, at least not until a checkpoint. Every update, we get for free, as far as I/O goes. Another way to look at these free updates is that the update per write ratio is way up. Add in a trickle thread, and it may be written more often (update per write goes down). That’s unneeded I/O.

Unneeded I/O yes, but this may not be a big problem. Remember in this scenario our entire working set fits into the BDB cache. Our main thread is not doing any I/O anyway. While trickle adds more I/O, but nobody is waiting on those spinning disks. If we were paying attention to our BDB stats, we’d see that we didn’t have a double I/O problem to begin with.

There’s another hazy case that’s a little more subtle. Even though our data accesses may not be entirely in cache, and we do see double I/Os, we may see trickle be counter-productive. This can happen if we’ve totally saturated our I/O. The extra burden of trickle adds to the I/O queue, and any I/O request will take longer. Trickle may still be helpful if our cache hit rate is low enough that we don’t have many free updates and we’ll really need a high proportion of pages that trickle creates.

Trickle’s bread and butter scenario is when there is a mix of get and put traffic (get benefits the most from trickles effects, puts are needed to create dirty pages that give trickle something to do), when I/O is not overwhelmed, when the system is not entirely in cache. On the butter side down, we see trickle not performing when we don’t have some of those conditions satisfied. There’s lots of in between when it’s not so clear, you just have to try it, fiddle with the frequency and percentage, and see.

I’ll have more to say about other sorts of speculative optimizations in later posts. For now, let’s just say that like other forms of speculation, this one has no guarantees. There is never any substitute for testing on your own system.

Posted in Uncategorized | 2 Comments

Readonly DB Reloaded

If you have a ‘readonly’ Btree database in BDB, you might benefit by this small trick that has multiple benefits. The trick is to reload the database. You’ll get a compact file with blocks appearing in order.

Perhaps you’re using BDB as a cached front end to your slower SQL database, and you dump an important view and import it into BDB. Maybe you’ve written a custom importer program. You’ve got this down to a process, that runs, say, hourly. Here’s the thing – if you happened to import the data in the same order that it will appear in BDB (i.e. the key is sorted ascending) you’ll get some great optimizations.

When a page is filled up and is being split, and BDB recognizes that the new item is at the end of the tree, then the new item is placed on its own page. That means that data inserted in order will ‘leave behind’ leaf pages that are almost completely filled. BDB also recognizes the opposite case, when a key is inserted at the beginning of the database, and makes the uneven split in the other direction.

Here are the benefits to such a compaction. More key/data pairs per page means fewer pages.  Your cache is more effective since you can squeeze more key/data pairs into memory at a time. Viewed from another angle, less of your BDB cache is effectively wasted space. (Nobody wants to say it this way but a fill factor of 72% is 28% wasted space). More keys on a page means fewer internal pages in the level above. There is a cascading effect – the btree may be shallower in a compact database than one uncompacted. Fewer levels means faster access.

One more benefit to reloading is that pages will be allocated in the underlying file in order. Sequential (cursor) scans through the database are going to appear as accesses to sequential file pages. If the OS performs readahead, your IO will be already done when you advance to the next block. Your OS may not have readahead, but your file system may try to group sequential disk blocks closely on the physical disk.  If the disk’s cache can accommodate it, sequential read requests may be satisfied in advance there. The takeaway is that even if BDB’s cache is not large enough to hold your nicely ordered file, you will probably get better read performance during cursor scans due to an optimization or cache at some other level.

If you’ve inserted all your key/data pairs in key-order then you’ve already getting these great benefits. But if you were unable to that, here’s the quick and easy way to get there. Remember, we’re talking about a readonly database, so the right time to do this is right after creating the db file and before your application opens it.

$ db_dump x.db | db_load new.x.db
$ mv new.x.db x.db

That’s it. I’ve written this in UNIX shell-ese, but it works similarly on other systems. One could easily write a C/C++/Java/C#/etc. function to do this for embedded installations.

And speaking of a function to do this, there is already the DB->compact function, which can be used on a running system to get many of the same effects we were touting for compacted data. DB->compact won’t try to optimize the page ordering. But that’s okay, it’s really solving a harder problem since it can be used on databases that are not readonly.

If your database is not strictly readonly, there’s a slight downside to a fully compacted database. Since all the pages are filled to the brim with key/data pairs, a new entry, any new entry, will split a page. You’re pretty much guaranteed that your first put and some subsequent ones will be more expensive. There’s a cost to splitting pages that needs to be weighed against the benefits you’ll get. Fortunately, DB->compact has an input fill factor; with an access pattern with higher proportion of scattered writes, you may want to lower the fill factor.

There you have it. Two lines of shell code, and gobs of verbiage to beat it to death. Good thing I don’t get paid by lines of code.

Posted in Uncategorized | 3 Comments