Readonly DB Reloaded

If you have a ‘readonly’ Btree database in BDB, you might benefit by this small trick that has multiple benefits. The trick is to reload the database. You’ll get a compact file with blocks appearing in order.

Perhaps you’re using BDB as a cached front end to your slower SQL database, and you dump an important view and import it into BDB. Maybe you’ve written a custom importer program. You’ve got this down to a process, that runs, say, hourly. Here’s the thing – if you happened to import the data in the same order that it will appear in BDB (i.e. the key is sorted ascending) you’ll get some great optimizations.

When a page is filled up and is being split, and BDB recognizes that the new item is at the end of the tree, then the new item is placed on its own page. That means that data inserted in order will ‘leave behind’ leaf pages that are almost completely filled. BDB also recognizes the opposite case, when a key is inserted at the beginning of the database, and makes the uneven split in the other direction.

Here are the benefits to such a compaction. More key/data pairs per page means fewer pages.  Your cache is more effective since you can squeeze more key/data pairs into memory at a time. Viewed from another angle, less of your BDB cache is effectively wasted space. (Nobody wants to say it this way but a fill factor of 72% is 28% wasted space). More keys on a page means fewer internal pages in the level above. There is a cascading effect – the btree may be shallower in a compact database than one uncompacted. Fewer levels means faster access.

One more benefit to reloading is that pages will be allocated in the underlying file in order. Sequential (cursor) scans through the database are going to appear as accesses to sequential file pages. If the OS performs readahead, your IO will be already done when you advance to the next block. Your OS may not have readahead, but your file system may try to group sequential disk blocks closely on the physical disk.  If the disk’s cache can accommodate it, sequential read requests may be satisfied in advance there. The takeaway is that even if BDB’s cache is not large enough to hold your nicely ordered file, you will probably get better read performance during cursor scans due to an optimization or cache at some other level.

If you’ve inserted all your key/data pairs in key-order then you’ve already getting these great benefits. But if you were unable to that, here’s the quick and easy way to get there. Remember, we’re talking about a readonly database, so the right time to do this is right after creating the db file and before your application opens it.

$ db_dump x.db | db_load new.x.db
$ mv new.x.db x.db

That’s it. I’ve written this in UNIX shell-ese, but it works similarly on other systems. One could easily write a C/C++/Java/C#/etc. function to do this for embedded installations.

And speaking of a function to do this, there is already the DB->compact function, which can be used on a running system to get many of the same effects we were touting for compacted data. DB->compact won’t try to optimize the page ordering. But that’s okay, it’s really solving a harder problem since it can be used on databases that are not readonly.

If your database is not strictly readonly, there’s a slight downside to a fully compacted database. Since all the pages are filled to the brim with key/data pairs, a new entry, any new entry, will split a page. You’re pretty much guaranteed that your first put and some subsequent ones will be more expensive. There’s a cost to splitting pages that needs to be weighed against the benefits you’ll get. Fortunately, DB->compact has an input fill factor; with an access pattern with higher proportion of scattered writes, you may want to lower the fill factor.

There you have it. Two lines of shell code, and gobs of verbiage to beat it to death. Good thing I don’t get paid by lines of code.

Advertisements

About ddanderson

Berkeley DB, Java, C, C , C# consultant and jazz trumpeter
This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to Readonly DB Reloaded

  1. Pingback: Playing Fetch | libdb

  2. John Zhang says:

    Very good information!
    I have a couple of questions along what you said.
    1. Will db_dump | db_load generate a db with the orginal format (eg Btree to Btree)? 2. We use an envirnment, and an index table saved in a BTree with a customer compare function for the keys. Is there any way we can do an offline compact and put the new db back into the env? If not,
    3. What would be a good way to implement the compact so that it will not impact the DB performance too much (we do insert/update/delete all the time)? Our key is a timestamp, I tried to specify a start and stop range, plus compact_pages, and wrap the compact call within a loop to cover different start-stop until the entire key range is examined. What I noticed was that at the beginning of the loop, compact runs just fine, but later, the compact would fail complaining about not enough locking object/space. It seems to me that when both start/stop and compact_pages are specified, compact would ignore start/stop. If that’s the case then compact would always scan the Btree from the begining, as the loop progresses, the scaned range will be larger and larger. Maybe just use start/stop and not setting compact_pages ?

    • ddanderson says:

      1. Yes, the format (also known as the ‘access method’) is preserved – you can see it in the output from db_dump.
      2. If you have a custom compare function, you need to add it to db_load to make this work! This is a reason to avoid custom compare function (but rather reorder the key) if possible. Look in db_load.c for a comment or #if 0 that marks the spot where you need to add it.
      3. I was not aware of problems when specifying both start/stop and number of pages when using DB->compact. Probably should be reported to Oracle. I guess you have to use start/stop.

      If you do insert/update/delete all the time, maybe focus compaction on parts of the DB that are least likely to change (e.g. timestamp < one hour ago).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s