Playing Fetch

We’re going to visit the land of ‘what-if’ today and talk about a non-feature of BDB.  This continues the thread of speculative optimizations that I wrote about last week.

In some operating systems, older ones especially, there is a concept of read ahead.  I think current systems rely on the firmware of disk drives to provide a similar benefit.  In the classic read ahead approach, if you are reading a file sequentially, and your process asks for disk block 1 and then block 2, perhaps the OS should also asynchronously fetch block 3.  This is a speculative optimization.  There’s a pretty good chance that you’ll be rewarded by a block already in the OS cache by the time the process gets around to needing block 3.  The payoff is pretty reasonable – if you get a cache hit, or even if the block is on the way in to memory, you’ll reduce the latency of your request.  The downside is that if you don’t need the block, you’ll be doing extra work, which may reduce throughput on a loaded system.

Let’s suppose we’re using a cursor to walk through a BDB database, and the database doesn’t fit into cache.  As we visit leaf pages, what sort of prefetching optimizations can we expect?  The answer is not good.

BDB itself does not do any prefetching (see No Threads Attached).  But what if the OS itself does prefetching of disk blocks, will it happen there?  Generally, no, because sequential blocks in an underlying BDB file do not appear in key order.  They appear, more or less in the order they were allocated.  To see what I mean, look at a snippet of output from ‘db_dump -d h <my_favorite.db>’:

page 100: btree leaf: LSN [7][4706625]: level 1
prev: 3653 next: 2538 entries: 88 offset: 1260
page 101: btree leaf: LSN [7][7887623]: level 1
prev: 2377 next: 2792 entries: 98 offset: 1024
page 102: btree leaf: LSN [7][8387470]: level 1
prev: 3439 next: 3245 entries: 110 offset: 864
page 103: btree leaf: LSN [7][9687633]: level 1
prev: 3457 next: 2390 entries: 98 offset: 484
page 104: btree leaf: LSN [7][10442462]: level 1
prev: 3513 next: 5518 entries: 66 offset: 2108
page 105: btree leaf: LSN [7][9323239]: level 1
prev: 4033 next: 5439 entries: 64 offset: 2144
page 106: btree leaf: LSN [7][10313588]: level 1
prev: 2999 next: 4010 entries: 78 offset: 1480
page 107: btree leaf: LSN [7][4749567]: level 1
prev: 4333 next: 5087 entries: 74 offset: 1728
page 108: btree leaf: LSN [7][7223464]: level 1
prev: 4262 next: 2832 entries: 120 offset: 524
page 109: btree leaf: LSN [7][8078083]: level 1
prev: 2502 next: 3897 entries: 74 offset: 1896

The ordering (defined by ‘prev’ and ‘next’ links) is pretty mixed up.  As your cursor moves through the data, disk blocks are going to be requested rather randomly.  If your database is readonly, you can take advantage of this trick to get things in proper order. Otherwise, any OS prefetching is not going to help — unless the OS brings the entire file into memory.

Here’s another use case to think about.  You have a primary database and one or more secondary databases that represent indexed keys on your data.  This might allow you, for example, to find all Order records in a particular zip code, or all Orders that were shipped in a range of dates.  Secondary databases are typically small and can be quickly navigated, but when you start pulling the primary records out, they’ll be referenced in a jumbled order.  Using the reloading trick won’t really help.

In both of these cases, we have a primary database whose blocks are accessed in a scattered way, but in both of these cases, BDB has full knowledge about what block will be needed next.

If you’re on Linux, you might have a trick available.  You could snag the file descriptor from the primary’s DB object and use the Linux ‘readahead’ system call.  Given what we know about the scattered placement of blocks, it probably makes sense to read the entire file, and that only makes sense if the file is not too large in proportion to the available memory.

If that trick doesn’t make sense, you still may get some benefit from the disk cache built into the hardware.  The disk controller can slurp in a big hunk of contiguous data at once into its internal cache.  If the OS has done its best to allocate the file contiguously or at closely enough, you’ll get a boost — the read request for file blocks may be satisfied without waiting for the physical disk.

There’s a lot of ifs in the preceding paragraphs, which is another way to say that prefetching may sometimes happen, but it’s largely out of the control of the developer. BDB is the layer where the knowledge about the next block is, so prefetching would make the most sense to do in BDB.  It would be real nice to have something like a DB_PREFETCH flag on DB->open, coupled with some way (an API call? an event callback?) to get prefetching done in another thread.  In general terms, I’m thinking of how the memp_trickle API helps solves the issue of double writes in write-heavy apps.

Before we get too far into armchair API design, the best next step would be some proof that this would actually help real applications.  Does your app do cursor scans of entire (huge) databases, or use secondary tables as indices to scan through portions of a huge primary?  (huge == cannot fit in BDB, OS or disk cache memory).  Obviously, Oracle has not yet prioritized these use cases in their product.  Should they?

Posted in Uncategorized | 2 Comments

When trickle doesn’t work

In a past column, I’ve mentioned memp_trickle as a way to get beyond the double I/O problem. And it often works well for this, but there are times when it doesn’t.

Trickle is a sort of optimization that I would call speculative. These sorts of optimizations attempt to predict the future. We do work now, in a separate thread, because in the future the fruits of our work will be useful. In trickle’s case, we do writes from the cache now, in a separate thread, because in the future clean cache pages will eliminate one of our I/Os in the main thread, decreasing latency.

But gazing into the crystal ball of the future can give a hazy picture. One obvious case is that we might not benefit from the clean cache page, ever. Our program may simply stop, or have no more database requests. Generally we’re not particularly worried about that — BDB systems typically run forever, we’ll eventually get more traffic, updates, orders, etc.

Our second hazy case is that we may not need more clean cache pages. If our entire working set of accessed pages fits into the BDB cache, then we’ll be accessing the same pages over and over. No new pages needed. Trickle done on this sort of system will create extra I/O traffic. Consider a single leaf page in this scenario. It’s updated, perhaps once a second, but never written to disk, at least not until a checkpoint. Every update, we get for free, as far as I/O goes. Another way to look at these free updates is that the update per write ratio is way up. Add in a trickle thread, and it may be written more often (update per write goes down). That’s unneeded I/O.

Unneeded I/O yes, but this may not be a big problem. Remember in this scenario our entire working set fits into the BDB cache. Our main thread is not doing any I/O anyway. While trickle adds more I/O, but nobody is waiting on those spinning disks. If we were paying attention to our BDB stats, we’d see that we didn’t have a double I/O problem to begin with.

There’s another hazy case that’s a little more subtle. Even though our data accesses may not be entirely in cache, and we do see double I/Os, we may see trickle be counter-productive. This can happen if we’ve totally saturated our I/O. The extra burden of trickle adds to the I/O queue, and any I/O request will take longer. Trickle may still be helpful if our cache hit rate is low enough that we don’t have many free updates and we’ll really need a high proportion of pages that trickle creates.

Trickle’s bread and butter scenario is when there is a mix of get and put traffic (get benefits the most from trickles effects, puts are needed to create dirty pages that give trickle something to do), when I/O is not overwhelmed, when the system is not entirely in cache. On the butter side down, we see trickle not performing when we don’t have some of those conditions satisfied. There’s lots of in between when it’s not so clear, you just have to try it, fiddle with the frequency and percentage, and see.

I’ll have more to say about other sorts of speculative optimizations in later posts. For now, let’s just say that like other forms of speculation, this one has no guarantees. There is never any substitute for testing on your own system.

Posted in Uncategorized | 2 Comments

Readonly DB Reloaded

If you have a ‘readonly’ Btree database in BDB, you might benefit by this small trick that has multiple benefits. The trick is to reload the database. You’ll get a compact file with blocks appearing in order.

Perhaps you’re using BDB as a cached front end to your slower SQL database, and you dump an important view and import it into BDB. Maybe you’ve written a custom importer program. You’ve got this down to a process, that runs, say, hourly. Here’s the thing – if you happened to import the data in the same order that it will appear in BDB (i.e. the key is sorted ascending) you’ll get some great optimizations.

When a page is filled up and is being split, and BDB recognizes that the new item is at the end of the tree, then the new item is placed on its own page. That means that data inserted in order will ‘leave behind’ leaf pages that are almost completely filled. BDB also recognizes the opposite case, when a key is inserted at the beginning of the database, and makes the uneven split in the other direction.

Here are the benefits to such a compaction. More key/data pairs per page means fewer pages.  Your cache is more effective since you can squeeze more key/data pairs into memory at a time. Viewed from another angle, less of your BDB cache is effectively wasted space. (Nobody wants to say it this way but a fill factor of 72% is 28% wasted space). More keys on a page means fewer internal pages in the level above. There is a cascading effect – the btree may be shallower in a compact database than one uncompacted. Fewer levels means faster access.

One more benefit to reloading is that pages will be allocated in the underlying file in order. Sequential (cursor) scans through the database are going to appear as accesses to sequential file pages. If the OS performs readahead, your IO will be already done when you advance to the next block. Your OS may not have readahead, but your file system may try to group sequential disk blocks closely on the physical disk.  If the disk’s cache can accommodate it, sequential read requests may be satisfied in advance there. The takeaway is that even if BDB’s cache is not large enough to hold your nicely ordered file, you will probably get better read performance during cursor scans due to an optimization or cache at some other level.

If you’ve inserted all your key/data pairs in key-order then you’ve already getting these great benefits. But if you were unable to that, here’s the quick and easy way to get there. Remember, we’re talking about a readonly database, so the right time to do this is right after creating the db file and before your application opens it.

$ db_dump x.db | db_load new.x.db
$ mv new.x.db x.db

That’s it. I’ve written this in UNIX shell-ese, but it works similarly on other systems. One could easily write a C/C++/Java/C#/etc. function to do this for embedded installations.

And speaking of a function to do this, there is already the DB->compact function, which can be used on a running system to get many of the same effects we were touting for compacted data. DB->compact won’t try to optimize the page ordering. But that’s okay, it’s really solving a harder problem since it can be used on databases that are not readonly.

If your database is not strictly readonly, there’s a slight downside to a fully compacted database. Since all the pages are filled to the brim with key/data pairs, a new entry, any new entry, will split a page. You’re pretty much guaranteed that your first put and some subsequent ones will be more expensive. There’s a cost to splitting pages that needs to be weighed against the benefits you’ll get. Fortunately, DB->compact has an input fill factor; with an access pattern with higher proportion of scattered writes, you may want to lower the fill factor.

There you have it. Two lines of shell code, and gobs of verbiage to beat it to death. Good thing I don’t get paid by lines of code.

Posted in Uncategorized | 3 Comments

Divide and Conquer

Berkeley DB is often used in extreme cases. One extreme case is an application that is almost write-only. That is, data is being collected as fast as possible, and shoveled into a BDB file or files. In some applications, there may be some simple queries directly on this data, in others the data may be digested at a later time. In any event, the game is all in how many records can you stuff into a data file per second.

If the application is fully transactional, the log may govern how fast the process can proceed. Full transactional semantics on a single box pretty much require one to do synchronized writes to the log, and that just takes time. Getting beyond this either asks for replication in the solution or some relaxation on your requirements for transactions (DB_TXN_NOSYNC and DB_TXN_WRITE_NOSYNC are your good buddies).

Once you choose your way to dodge the logging logjam, we can get some real throughput. However, when you start adding more threads to the problem, you’ll soon discover that there may be another bottleneck. One symptom is that multiple threads can start spending a good fraction of their time waiting for access to page zero (the ‘meta’ page). If you break at random times in a debugger, you might see often they are in the middle of a call to bam_split().  The function name ‘bam_split’ is a shorthand for btree access method (bam) page split. Yeah, I’m assuming btree here, but this can happen in hashes too.

It’s the nature of the problem.  Lots of new records.  Data fills leaf pages.  Pages split. New pages need to be allocated.  The single free list holding the pages is on the meta page.  The leaf pages, at least one internal node, and the meta page all have write locks. When lots of threads do this at the same time, the common element is the meta page. Contention.

Here’s one way around the problem: divide up your data. Say your key is an IP address. Imagine putting all IP addresses that are even into one DB file, and all the odd ones into another one. Whenever you want to look for an IP address, you know which table (odd or even) to look at. Now, let’s generalize that so that instead of odd/even, we can use the IP number modulo 4 (to get four database files). Or modulo 8, etc. To get a better distribution, or handle arbitrary key types, we can hash the key and take a modulo.

Sure, you can do all that yourself. But if you’re using BDB 4.8 or greater, you get this for free with the DB->set_partition method. And here’s a really cool thing: your cursors on a partitioned database will just work. You can still walk your ordered Btree, blissfully unaware that your data is coming a little bit from here, a little bit from there…

What does this buy you? Well, if you split into eight partitions, you get eight BDB files. That’s eight meta pages. If you’ve got four threads, the odds start getting much better than any particular page split will not be blocked against a page split in another thread. Less contention means greater throughput, lower latency, mom, apple pie, your team wins.

Do you have a favorite extreme case?  Let me know about it.

Posted in Uncategorized | 1 Comment

Taste your data

I saw Temple Grandin speak recently and I heard a number of nuggets of wisdom from someone who truly thinks different. If you don’t know the name, Dr. Grandin is a very high functioning autistic person who has revolutionized the livestock industry by understanding at a sensory level what cattle are feeling and how that influences behavior. The remarkable thing is that her autism gave her a new viewpoint that has been valuable for her and the industry.

One of her points was that many of us are top-down thinkers. We approach problems from the top, breaking it down bit by bit. Why is my program slow? Is it the framework? The data store? The network? Obviously, this sort of analysis is important and usually the right way to start (let’s select a great algorithm before implementing, please).  But when you’re not making progress at the tuning side of things and need another angle, it can be helpful at looking at things from the other end.

Temple is a sensory person.  The five senses affect her directly.  She has no choice but to start with these details and process bottom up. In our domain, that would be starting with the data. The raw bits.


So if you are trying to improve database performance, it can often be helpful to move beyond the code and look right at what data you are storing. In Berkeley DB, you can do this using db_dump.  db_dump is typically used to dump and restore databases, but why not use it as a raw visualization tool?

db_dump details

In the sample above (inspired by, but not copied from, real customer data), the first line with numbers: 09174000 is a key, followed by the data. Then another key, data, etc. I’ve shown just a couple key/data pairs just to get the creative juices flowing — you’ll probably want to pipe your db_dump through your favorite pager and let a lot of entries fill up your screen.

…to get ideas

Once you start immersing yourself in the actual data you’ll probably get lots of ideas. When I see data like in the sample above, I have to ask why there are all those zeroes in the same place for each data entry. Is there a way to condense or compress the data?

Why is this relevant? Because if you can get your data smaller, one of the nicest effects is in cache efficiency – with data that’s half its original size you’ll be able to fit twice as many entries in your currently sized cache. Another possible effect is reducing the number of overflow pages, which can have multiple good effects. (I’ll have to get into that in another post).

You might also sniff out some other peculiarities. Hmmm, the keys all appear to have zeros at the end, rather than leading zeros. And they aren’t correctly sorted. Did we forget to ensure the most significant byte goes first? That sort of error can cost a lot in the locality of accesses, which can again lead to caching inefficiencies.

Obviously, your data is going to look different, and you’re going to notice different things. But you’ll never notice unless you look.

Get down to the details – smell it, taste it. And then fix it.

Posted in Uncategorized | Leave a comment

Choose your battles

Here’s an interesting set of posts from Shaneal Manek, the developer behind (born as  He first makes the point that LISP is a fast language for internet server development, and with some small optimization hints, can be 4x faster than Java, and 15x faster than Python.  I think that qualifies him as a free thinker.

For most projects, I’ve long been a proponent of choosing a language based on how quickly and accurately developers can get their work done and not so much on the basis of how fast the language is.  Yeah, think about rewriting for speed in another language is a fine thing to think about in version 3 or 4 well after you have something that works and delivers the goods.  While someone in a lower level language is still debugging their intermittent crashes (the long tail of development), you’ll have time to both refine your feature set and tweak the parts of the application that are the slowest.

Which brings us to post number two.  He argues that if you’re going to think about optimizing and scaling, think about the huge chunk of time your application is waiting for databases or disks.

Because of the huge disparity in CPU and disk speed, many applications are disk or I/O bound. If rewriting their application in C makes your code run twice as fast, it may only buy you an absolute performance boost of 10%. On the other hand, if you could make your database go twice as fast, it might buy you a net 30% performance boost – which is huge (and still far easier to do than rewriting your app in C).

He makes his case with supporting benchmarks.  Yes, your mileage will vary.

Speaking of which, if your app has a server-based SQL backend and you are looking for a speed boost, I suggest you ask yourself these questions:

  • What is the proportion of time your app spends in database calls?
  • Out of that, for the heaviest used table (or even single query), what is the proportion of time spent waiting on that?
  • What would it take to rewrite just that table or single query access with a No-SQL solution?
  • What would it gain?

One of the user comments to Shaneal’s post says the comparison given is apples to oranges: MySQL is network based, and BDB is an in-process solution.  He suggests comparing MySQL with a network based key-value pair database.  The point is a good one, you don’t really know if the impressive speedups were due to BDB being a key-value vs. SQL, or whether it was due to the speed of having everything work within the same process.  I suspect the reason is some of both.

Database types in a 2 by 2 matrix

If you’re in the upper left corner, you can fully explore the matrix by trying the lower left first: SQLite (the BDB distribution now includes a version of SQLite that is backed by BDB).  For the upper right, you could try a network key value store, like memcachedb. Core BDB belongs in the lower right category.

In any event I think the main point is clear.  When looking for a speedup, understand what is taking time and figure out what the alternatives will buy you.  Beware these words — “I see this is taking a good chunk of our time, but unfortunately we can’t do anything about that”. Because sometimes you can.

Posted in Uncategorized | 1 Comment

No Threads Attached

Berkeley DB, at least the core API, is mostly threadless.  If you are using any of the APIs that have been around for more than five years, there’ll be nothing happening, threadwise, in the background.

I think a lot of folks miss this point.  You can certainly use multiple threads with BDB.  Just turn on that DB_THREAD flag and all your handles are free threaded.  But does this imply that BDB is also firing off background threads?  No indeed.

An example.  Let’s talk about the cache again – one of my favorite topics!  The cache may seem a little magical, but in fact its actions are synchronous with what is happening in your API calls. If the operation you are doing needs to get a page from a DB file, then the cache is involved (you can’t bypass the cache).  If the page isn’t in cache, then the least recently used page in the cache is evicted.  If that page is dirty, then it is written to the backing file. Ouch.  Only when that completes is the cache page freed up so that the page you need now is read into the cache.  Double ouch.  That all happens when you do a DB→get.  Yup, it might be up to two I/O calls, just for a single page access.  If your fetch gets multiple pages, as it would in a BTree, you’re potentially doing more I/O.  And to the point of this article, none of this happens in the background.  Finally, when the API call returns, no trailing operations, like async writes, happen in the background.

Now that might sound grim for a high performance embedded data store, but we’re talking about potential I/O, we can get rid of a lot of that.  First, you’ve sized your cache right, right?  So the vast majority of accesses to get pages are going to be found in the cache.  And when navigating a Btree, certain pages like the internal pages are going to be hit a lot.  It all depends on your access pattern and cache size of course, but these internal pages tend to be highly used, and rarely evicted.  So this mostly gets rid of one ouch.

If you aren’t lucky enough to have your database file completely in cache and your app isn’t readonly, you have a nice little function memp_trickle at your disposal.  That function walks through pages in your cache, making sure a certain percentage of them (you pick the number) are clean.  It doesn’t evict pages, but pages that are dirty (as a result of previous DB→put or Db→del calls) will get written out to the backing DB file. Getting back to our example, if your DB→get call does need a page in the cache, it will much more likely find a clean one.  And evicting a clean page implies just some bookkeeping, and no I/O. Putting the trickle call in its own thread, doing

while (true) { env->memp_trickle(pct); sleep(10); }

offloads a large percentage of those I/Os from a synchronous DB→get call to a background activity.  Goodbye double ouch.

A philosophy. This all is in keeping with the core BDB philosophy of keeping the programmer in control.  You can control every aspect of the trickle activity: whether to do it, what percentage to clean, how often to run, even what threading library you want to use.  In fact, trickle can happen in its own process, rather than a separate thread.

Other core features, like checkpointing and deadlock detection [1] are likewise ‘synchronous’ method calls that are often run as background threads or processes.  Calling these APIs doesn’t imply any threads being spun off.

Moving beyond the core API, more and more threads pop up.  For starters, the replication manager API (not the replication base) does use threads to manage replication activities.  And there’s definitely knowledge of threads in the Java API (see TransactionRunner) and Core DB (see failchk).  Berkeley DB Java Edition (JE) is a cool pure java version of Berkeley DB.  Its philosophy is quite a bit different, and a lot of the basic housekeeping like eviction and cleaning happen in background threads by default. You still have lots of control over configuration.

Finally, there’s this thing called SQLite that comes as part of BDB nowadays.  It’s the popular full blown embedded SQL engine built integrated with a BDB backend.  Talk about a change in philosophy.  I’ll be blogging about that in the future once I have a chance to put it through its paces.  Background threads in that?  I don’t see them.

BDB has come a long way.  But if you’re doing work with the good ol’ core API, and even with some of the new stuff, you have to keep this threadless model in mind.  Just remember to spin off your own trickle thread so you can dial up the performance.


[1] Deadlock detection also may be handled synchronously with almost no penalty, that’s a story for another day.

Posted in Uncategorized | 10 Comments