We’re going to visit the land of ‘what-if’ today and talk about a non-feature of BDB. This continues the thread of speculative optimizations that I wrote about last week.
In some operating systems, older ones especially, there is a concept of read ahead. I think current systems rely on the firmware of disk drives to provide a similar benefit. In the classic read ahead approach, if you are reading a file sequentially, and your process asks for disk block 1 and then block 2, perhaps the OS should also asynchronously fetch block 3. This is a speculative optimization. There’s a pretty good chance that you’ll be rewarded by a block already in the OS cache by the time the process gets around to needing block 3. The payoff is pretty reasonable – if you get a cache hit, or even if the block is on the way in to memory, you’ll reduce the latency of your request. The downside is that if you don’t need the block, you’ll be doing extra work, which may reduce throughput on a loaded system.
Let’s suppose we’re using a cursor to walk through a BDB database, and the database doesn’t fit into cache. As we visit leaf pages, what sort of prefetching optimizations can we expect? The answer is not good.
BDB itself does not do any prefetching (see No Threads Attached). But what if the OS itself does prefetching of disk blocks, will it happen there? Generally, no, because sequential blocks in an underlying BDB file do not appear in key order. They appear, more or less in the order they were allocated. To see what I mean, look at a snippet of output from ‘db_dump -d h <my_favorite.db>’:
page 100: btree leaf: LSN : level 1
prev: 3653 next: 2538 entries: 88 offset: 1260
page 101: btree leaf: LSN : level 1
prev: 2377 next: 2792 entries: 98 offset: 1024
page 102: btree leaf: LSN : level 1
prev: 3439 next: 3245 entries: 110 offset: 864
page 103: btree leaf: LSN : level 1
prev: 3457 next: 2390 entries: 98 offset: 484
page 104: btree leaf: LSN : level 1
prev: 3513 next: 5518 entries: 66 offset: 2108
page 105: btree leaf: LSN : level 1
prev: 4033 next: 5439 entries: 64 offset: 2144
page 106: btree leaf: LSN : level 1
prev: 2999 next: 4010 entries: 78 offset: 1480
page 107: btree leaf: LSN : level 1
prev: 4333 next: 5087 entries: 74 offset: 1728
page 108: btree leaf: LSN : level 1
prev: 4262 next: 2832 entries: 120 offset: 524
page 109: btree leaf: LSN : level 1
prev: 2502 next: 3897 entries: 74 offset: 1896
The ordering (defined by ‘prev’ and ‘next’ links) is pretty mixed up. As your cursor moves through the data, disk blocks are going to be requested rather randomly. If your database is readonly, you can take advantage of this trick to get things in proper order. Otherwise, any OS prefetching is not going to help — unless the OS brings the entire file into memory.
Here’s another use case to think about. You have a primary database and one or more secondary databases that represent indexed keys on your data. This might allow you, for example, to find all Order records in a particular zip code, or all Orders that were shipped in a range of dates. Secondary databases are typically small and can be quickly navigated, but when you start pulling the primary records out, they’ll be referenced in a jumbled order. Using the reloading trick won’t really help.
In both of these cases, we have a primary database whose blocks are accessed in a scattered way, but in both of these cases, BDB has full knowledge about what block will be needed next.
If you’re on Linux, you might have a trick available. You could snag the file descriptor from the primary’s DB object and use the Linux ‘readahead’ system call. Given what we know about the scattered placement of blocks, it probably makes sense to read the entire file, and that only makes sense if the file is not too large in proportion to the available memory.
If that trick doesn’t make sense, you still may get some benefit from the disk cache built into the hardware. The disk controller can slurp in a big hunk of contiguous data at once into its internal cache. If the OS has done its best to allocate the file contiguously or at closely enough, you’ll get a boost — the read request for file blocks may be satisfied without waiting for the physical disk.
There’s a lot of ifs in the preceding paragraphs, which is another way to say that prefetching may sometimes happen, but it’s largely out of the control of the developer. BDB is the layer where the knowledge about the next block is, so prefetching would make the most sense to do in BDB. It would be real nice to have something like a DB_PREFETCH flag on DB->open, coupled with some way (an API call? an event callback?) to get prefetching done in another thread. In general terms, I’m thinking of how the memp_trickle API helps solves the issue of double writes in write-heavy apps.
Before we get too far into armchair API design, the best next step would be some proof that this would actually help real applications. Does your app do cursor scans of entire (huge) databases, or use secondary tables as indices to scan through portions of a huge primary? (huge == cannot fit in BDB, OS or disk cache memory). Obviously, Oracle has not yet prioritized these use cases in their product. Should they?