No Threads Attached

Berkeley DB, at least the core API, is mostly threadless.  If you are using any of the APIs that have been around for more than five years, there’ll be nothing happening, threadwise, in the background.

I think a lot of folks miss this point.  You can certainly use multiple threads with BDB.  Just turn on that DB_THREAD flag and all your handles are free threaded.  But does this imply that BDB is also firing off background threads?  No indeed.

An example.  Let’s talk about the cache again – one of my favorite topics!  The cache may seem a little magical, but in fact its actions are synchronous with what is happening in your API calls. If the operation you are doing needs to get a page from a DB file, then the cache is involved (you can’t bypass the cache).  If the page isn’t in cache, then the least recently used page in the cache is evicted.  If that page is dirty, then it is written to the backing file. Ouch.  Only when that completes is the cache page freed up so that the page you need now is read into the cache.  Double ouch.  That all happens when you do a DB→get.  Yup, it might be up to two I/O calls, just for a single page access.  If your fetch gets multiple pages, as it would in a BTree, you’re potentially doing more I/O.  And to the point of this article, none of this happens in the background.  Finally, when the API call returns, no trailing operations, like async writes, happen in the background.

Now that might sound grim for a high performance embedded data store, but we’re talking about potential I/O, we can get rid of a lot of that.  First, you’ve sized your cache right, right?  So the vast majority of accesses to get pages are going to be found in the cache.  And when navigating a Btree, certain pages like the internal pages are going to be hit a lot.  It all depends on your access pattern and cache size of course, but these internal pages tend to be highly used, and rarely evicted.  So this mostly gets rid of one ouch.

If you aren’t lucky enough to have your database file completely in cache and your app isn’t readonly, you have a nice little function memp_trickle at your disposal.  That function walks through pages in your cache, making sure a certain percentage of them (you pick the number) are clean.  It doesn’t evict pages, but pages that are dirty (as a result of previous DB→put or Db→del calls) will get written out to the backing DB file. Getting back to our example, if your DB→get call does need a page in the cache, it will much more likely find a clean one.  And evicting a clean page implies just some bookkeeping, and no I/O. Putting the trickle call in its own thread, doing

while (true) { env->memp_trickle(pct); sleep(10); }

offloads a large percentage of those I/Os from a synchronous DB→get call to a background activity.  Goodbye double ouch.

A philosophy. This all is in keeping with the core BDB philosophy of keeping the programmer in control.  You can control every aspect of the trickle activity: whether to do it, what percentage to clean, how often to run, even what threading library you want to use.  In fact, trickle can happen in its own process, rather than a separate thread.

Other core features, like checkpointing and deadlock detection [1] are likewise ‘synchronous’ method calls that are often run as background threads or processes.  Calling these APIs doesn’t imply any threads being spun off.

Moving beyond the core API, more and more threads pop up.  For starters, the replication manager API (not the replication base) does use threads to manage replication activities.  And there’s definitely knowledge of threads in the Java API (see TransactionRunner) and Core DB (see failchk).  Berkeley DB Java Edition (JE) is a cool pure java version of Berkeley DB.  Its philosophy is quite a bit different, and a lot of the basic housekeeping like eviction and cleaning happen in background threads by default. You still have lots of control over configuration.

Finally, there’s this thing called SQLite that comes as part of BDB nowadays.  It’s the popular full blown embedded SQL engine built integrated with a BDB backend.  Talk about a change in philosophy.  I’ll be blogging about that in the future once I have a chance to put it through its paces.  Background threads in that?  I don’t see them.

BDB has come a long way.  But if you’re doing work with the good ol’ core API, and even with some of the new stuff, you have to keep this threadless model in mind.  Just remember to spin off your own trickle thread so you can dial up the performance.


[1] Deadlock detection also may be handled synchronously with almost no penalty, that’s a story for another day.


About ddanderson

Berkeley DB, Java, C, C , C# consultant and jazz trumpeter
This entry was posted in Uncategorized. Bookmark the permalink.

10 Responses to No Threads Attached

  1. dimitre says:

    1. If you do checkpoint do you still need to do memp_trickle? According to documentation you shouldn’t but I have seen app taking a long time to flush things to disk on exit even though it is check pointing every 30 secs.

    2. When you have one thread do checkpoint and another thread do puts, BDB will need someway to synchronize access. What is the granularity of that and how can it be controlled/tuned?

  2. ddanderson says:

    Great questions. Checkpoint and memp_trickle interact to be sure. Checkpoint will, as a ‘side-effect’, flush the cache (trickle %100!). But these two APIs are meant to address two different needs and are usually used on different time scales. Even if you checkpoint every 5 minutes, your cache may fill up with dirty pages in the meantime. It depends on your cache size and your access pattern. Checkpointing at a high frequency as a substitute for trickle is usually wasteful, as you’re pushing out lots of pages, many of which may be ‘redirtied’ soon.

    The best plan is usually to start with a checkpoint thread – pick a frequency that makes your restart recovery time tolerable (since that’s the main purpose of checkpointing). Then, if your stats show your cache is forcing out dirty pages, then it’s time to add a trickle thread.

    For point two, there are certainly mutexes in BDB that internally synchronize access to internal BDB data structures. The granularity varies, for example, there is often a mutex on entire lists that is held as long as the list is being modified, and a mutex on each item in a list that is held as long as the individual item is being manipulated or examined. Other large data structures (like the log region, the mpool region, etc.) typically have mutexes.

    There’s not much you can do to tune this directly – each item is held for only as long as needed to modify or examine an item, at which point the mutex is released. The length of holding a mutex is never longer than an API call, and it’s often just for a few lines of code. But that’s not always the case, and there can certainly be bottlenecks. db_stat will show you some of them, but gives you no clear indication on how to address it. Changes to code or parameters may affect your use of mutexes indirectly. For example, doing checkpoint less often will cause less overall locking of mutexes, especially in mpool data structures, which could improve overall throughput a bit. Perhaps more than a bit if those mutexes were part of a bottleneck.

    Mutexes are distinct from ‘locks’, which have the granularity of a single page (except for DB_QUEUE), each lock is associated with a single page in a DB file. These page locks are held during your API call when your program is examining or changing data on that page. These can be held beyond a single API call, for example, during a transaction or a cursor. You can control that a lot in how you code your use of transactions and cursors.

    • dimitre says:

      The reason we do 30 sec checkpoint is that we have 8 partitions (each a separate BDB instance) and we want the application to start and stop quickly. We also do not want to “risk” too much data as we do few thousand updates per second with TXN_NOSYNC. But I’ll look in db_stat and experiment with memp_trickle in between to see if that has any improvements on IO.

      For the checkpoint mutexes, I have seen occasionally db_hotbackup “stuck” in txn_checkpoint() waiting on some mutex. At the same time the normal app had all its threads that need BDB access also “stuck” on some mutex. I got this info through the stack trace of Process Explorer (from System Internals). Since it was production environment I couldn’t attach a debugger and actually see if the treads were making very slow progress or if they were truly deadlocked. Are there any rules of thumb when using db_hotbackup? Is it good idea to do txn_checkpoint() from 2 threads/processes? Anything else that could be causing this “stuck” state?

      • ddanderson says:

        Probably doing two checkpoints will be competing for the same resources, so not a good idea if you can avoid it.

        Two other solutions for handling ‘not losing too much data’ is making the log buffer small enough so you effectively define how much data you can afford to lose. Then you don’t have to constantly checkpoint for that reason, but I understand if you need to checkpoint for fast recovery. Another solution is to use replication – you won’t need super fast recovery because you’ll have a hot spare, and you won’t lose any data either.

        If you see something stuck on a mutex – really stuck for a long time, that shouldn’t happen (I think?!), whether we’re talking about checkpoint or any other situation. A deadlock, or lockup, based on acquiring a page lock can happen, but that lockup would be between threads doing normal API calls (put/get/del). I don’t think txn_checkpoint would participate in holding locks. So without seeing stack traces, I honestly don’t know what the trouble would be. Start by grabbing stack traces, prove to yourself that’s it’s really stuck (not making progress with I/O), with a small test case if you can, and report it to the Oracle forum.

  3. Pingback: When trickle doesn’t work | libdb

  4. Pingback: Playing Fetch | libdb

  5. chang says:

    Right, I did a benchmark for understanding impact of different page size and cache size combination. what made me surprise is increasing cache size does’t improve *put* performance in any kind of page size.

    I believe this is due to threadless philosophy, so that when cache is full, we have to wait for IO done.

    • ddanderson says:

      Yes. If your app is mostly/all puts, then having larger and larger cache just defers longer and longer when the I/O will happen. In fact, if your cache is larger than your collective database sizes, I/O won’t happen at all — until a checkpoint.

      • chang says:

        How do I use memp_trickle correctly and safely in the background thread?
        #1 In my benchmark, I only have DB handle, i.e. calling db_create and supplying a NULL DB_ENV pointer. can I directly use dbenv member of DB handle, but it is commented as a private

        #2 Do i need using DB_THREAD flag when creating DB handle?


      • ddanderson says:

        You will need to have an environment to run trickle. If all you have is a DB handle, you can get the environment via DB->get_env(). Yes, you’ll need the DB_THREAD flag.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s