Berkeley DB is often used in extreme cases. One extreme case is an application that is almost write-only. That is, data is being collected as fast as possible, and shoveled into a BDB file or files. In some applications, there may be some simple queries directly on this data, in others the data may be digested at a later time. In any event, the game is all in how many records can you stuff into a data file per second.
If the application is fully transactional, the log may govern how fast the process can proceed. Full transactional semantics on a single box pretty much require one to do synchronized writes to the log, and that just takes time. Getting beyond this either asks for replication in the solution or some relaxation on your requirements for transactions (DB_TXN_NOSYNC and DB_TXN_WRITE_NOSYNC are your good buddies).
Once you choose your way to dodge the logging logjam, we can get some real throughput. However, when you start adding more threads to the problem, you’ll soon discover that there may be another bottleneck. One symptom is that multiple threads can start spending a good fraction of their time waiting for access to page zero (the ‘meta’ page). If you break at random times in a debugger, you might see often they are in the middle of a call to bam_split(). The function name ‘bam_split’ is a shorthand for btree access method (bam) page split. Yeah, I’m assuming btree here, but this can happen in hashes too.
It’s the nature of the problem. Lots of new records. Data fills leaf pages. Pages split. New pages need to be allocated. The single free list holding the pages is on the meta page. The leaf pages, at least one internal node, and the meta page all have write locks. When lots of threads do this at the same time, the common element is the meta page. Contention.
Here’s one way around the problem: divide up your data. Say your key is an IP address. Imagine putting all IP addresses that are even into one DB file, and all the odd ones into another one. Whenever you want to look for an IP address, you know which table (odd or even) to look at. Now, let’s generalize that so that instead of odd/even, we can use the IP number modulo 4 (to get four database files). Or modulo 8, etc. To get a better distribution, or handle arbitrary key types, we can hash the key and take a modulo.
Sure, you can do all that yourself. But if you’re using BDB 4.8 or greater, you get this for free with the DB->set_partition method. And here’s a really cool thing: your cursors on a partitioned database will just work. You can still walk your ordered Btree, blissfully unaware that your data is coming a little bit from here, a little bit from there…
What does this buy you? Well, if you split into eight partitions, you get eight BDB files. That’s eight meta pages. If you’ve got four threads, the odds start getting much better than any particular page split will not be blocked against a page split in another thread. Less contention means greater throughput, lower latency, mom, apple pie, your team wins.
Do you have a favorite extreme case? Let me know about it.