Divide and Conquer

Berkeley DB is often used in extreme cases. One extreme case is an application that is almost write-only. That is, data is being collected as fast as possible, and shoveled into a BDB file or files. In some applications, there may be some simple queries directly on this data, in others the data may be digested at a later time. In any event, the game is all in how many records can you stuff into a data file per second.

If the application is fully transactional, the log may govern how fast the process can proceed. Full transactional semantics on a single box pretty much require one to do synchronized writes to the log, and that just takes time. Getting beyond this either asks for replication in the solution or some relaxation on your requirements for transactions (DB_TXN_NOSYNC and DB_TXN_WRITE_NOSYNC are your good buddies).

Once you choose your way to dodge the logging logjam, we can get some real throughput. However, when you start adding more threads to the problem, you’ll soon discover that there may be another bottleneck. One symptom is that multiple threads can start spending a good fraction of their time waiting for access to page zero (the ‘meta’ page). If you break at random times in a debugger, you might see often they are in the middle of a call to bam_split().  The function name ‘bam_split’ is a shorthand for btree access method (bam) page split. Yeah, I’m assuming btree here, but this can happen in hashes too.

It’s the nature of the problem.  Lots of new records.  Data fills leaf pages.  Pages split. New pages need to be allocated.  The single free list holding the pages is on the meta page.  The leaf pages, at least one internal node, and the meta page all have write locks. When lots of threads do this at the same time, the common element is the meta page. Contention.

Here’s one way around the problem: divide up your data. Say your key is an IP address. Imagine putting all IP addresses that are even into one DB file, and all the odd ones into another one. Whenever you want to look for an IP address, you know which table (odd or even) to look at. Now, let’s generalize that so that instead of odd/even, we can use the IP number modulo 4 (to get four database files). Or modulo 8, etc. To get a better distribution, or handle arbitrary key types, we can hash the key and take a modulo.

Sure, you can do all that yourself. But if you’re using BDB 4.8 or greater, you get this for free with the DB->set_partition method. And here’s a really cool thing: your cursors on a partitioned database will just work. You can still walk your ordered Btree, blissfully unaware that your data is coming a little bit from here, a little bit from there…

What does this buy you? Well, if you split into eight partitions, you get eight BDB files. That’s eight meta pages. If you’ve got four threads, the odds start getting much better than any particular page split will not be blocked against a page split in another thread. Less contention means greater throughput, lower latency, mom, apple pie, your team wins.

Do you have a favorite extreme case?  Let me know about it.

Posted in Uncategorized | 1 Comment

Taste your data

I saw Temple Grandin speak recently and I heard a number of nuggets of wisdom from someone who truly thinks different. If you don’t know the name, Dr. Grandin is a very high functioning autistic person who has revolutionized the livestock industry by understanding at a sensory level what cattle are feeling and how that influences behavior. The remarkable thing is that her autism gave her a new viewpoint that has been valuable for her and the industry.

One of her points was that many of us are top-down thinkers. We approach problems from the top, breaking it down bit by bit. Why is my program slow? Is it the framework? The data store? The network? Obviously, this sort of analysis is important and usually the right way to start (let’s select a great algorithm before implementing, please).  But when you’re not making progress at the tuning side of things and need another angle, it can be helpful at looking at things from the other end.

Temple is a sensory person.  The five senses affect her directly.  She has no choice but to start with these details and process bottom up. In our domain, that would be starting with the data. The raw bits.


So if you are trying to improve database performance, it can often be helpful to move beyond the code and look right at what data you are storing. In Berkeley DB, you can do this using db_dump.  db_dump is typically used to dump and restore databases, but why not use it as a raw visualization tool?

db_dump details

In the sample above (inspired by, but not copied from, real customer data), the first line with numbers: 09174000 is a key, followed by the data. Then another key, data, etc. I’ve shown just a couple key/data pairs just to get the creative juices flowing — you’ll probably want to pipe your db_dump through your favorite pager and let a lot of entries fill up your screen.

…to get ideas

Once you start immersing yourself in the actual data you’ll probably get lots of ideas. When I see data like in the sample above, I have to ask why there are all those zeroes in the same place for each data entry. Is there a way to condense or compress the data?

Why is this relevant? Because if you can get your data smaller, one of the nicest effects is in cache efficiency – with data that’s half its original size you’ll be able to fit twice as many entries in your currently sized cache. Another possible effect is reducing the number of overflow pages, which can have multiple good effects. (I’ll have to get into that in another post).

You might also sniff out some other peculiarities. Hmmm, the keys all appear to have zeros at the end, rather than leading zeros. And they aren’t correctly sorted. Did we forget to ensure the most significant byte goes first? That sort of error can cost a lot in the locality of accesses, which can again lead to caching inefficiencies.

Obviously, your data is going to look different, and you’re going to notice different things. But you’ll never notice unless you look.

Get down to the details – smell it, taste it. And then fix it.

Posted in Uncategorized | Leave a comment

Choose your battles

Here’s an interesting set of posts from Shaneal Manek, the developer behind signpost.com (born as postabon.com).  He first makes the point that LISP is a fast language for internet server development, and with some small optimization hints, can be 4x faster than Java, and 15x faster than Python.  I think that qualifies him as a free thinker.

For most projects, I’ve long been a proponent of choosing a language based on how quickly and accurately developers can get their work done and not so much on the basis of how fast the language is.  Yeah, think about rewriting for speed in another language is a fine thing to think about in version 3 or 4 well after you have something that works and delivers the goods.  While someone in a lower level language is still debugging their intermittent crashes (the long tail of development), you’ll have time to both refine your feature set and tweak the parts of the application that are the slowest.

Which brings us to post number two.  He argues that if you’re going to think about optimizing and scaling, think about the huge chunk of time your application is waiting for databases or disks.

Because of the huge disparity in CPU and disk speed, many applications are disk or I/O bound. If rewriting their application in C makes your code run twice as fast, it may only buy you an absolute performance boost of 10%. On the other hand, if you could make your database go twice as fast, it might buy you a net 30% performance boost – which is huge (and still far easier to do than rewriting your app in C).

He makes his case with supporting benchmarks.  Yes, your mileage will vary.

Speaking of which, if your app has a server-based SQL backend and you are looking for a speed boost, I suggest you ask yourself these questions:

  • What is the proportion of time your app spends in database calls?
  • Out of that, for the heaviest used table (or even single query), what is the proportion of time spent waiting on that?
  • What would it take to rewrite just that table or single query access with a No-SQL solution?
  • What would it gain?

One of the user comments to Shaneal’s post says the comparison given is apples to oranges: MySQL is network based, and BDB is an in-process solution.  He suggests comparing MySQL with a network based key-value pair database.  The point is a good one, you don’t really know if the impressive speedups were due to BDB being a key-value vs. SQL, or whether it was due to the speed of having everything work within the same process.  I suspect the reason is some of both.

Database types in a 2 by 2 matrix

If you’re in the upper left corner, you can fully explore the matrix by trying the lower left first: SQLite (the BDB distribution now includes a version of SQLite that is backed by BDB).  For the upper right, you could try a network key value store, like memcachedb. Core BDB belongs in the lower right category.

In any event I think the main point is clear.  When looking for a speedup, understand what is taking time and figure out what the alternatives will buy you.  Beware these words — “I see this is taking a good chunk of our time, but unfortunately we can’t do anything about that”. Because sometimes you can.

Posted in Uncategorized | 1 Comment

No Threads Attached

Berkeley DB, at least the core API, is mostly threadless.  If you are using any of the APIs that have been around for more than five years, there’ll be nothing happening, threadwise, in the background.

I think a lot of folks miss this point.  You can certainly use multiple threads with BDB.  Just turn on that DB_THREAD flag and all your handles are free threaded.  But does this imply that BDB is also firing off background threads?  No indeed.

An example.  Let’s talk about the cache again – one of my favorite topics!  The cache may seem a little magical, but in fact its actions are synchronous with what is happening in your API calls. If the operation you are doing needs to get a page from a DB file, then the cache is involved (you can’t bypass the cache).  If the page isn’t in cache, then the least recently used page in the cache is evicted.  If that page is dirty, then it is written to the backing file. Ouch.  Only when that completes is the cache page freed up so that the page you need now is read into the cache.  Double ouch.  That all happens when you do a DB→get.  Yup, it might be up to two I/O calls, just for a single page access.  If your fetch gets multiple pages, as it would in a BTree, you’re potentially doing more I/O.  And to the point of this article, none of this happens in the background.  Finally, when the API call returns, no trailing operations, like async writes, happen in the background.

Now that might sound grim for a high performance embedded data store, but we’re talking about potential I/O, we can get rid of a lot of that.  First, you’ve sized your cache right, right?  So the vast majority of accesses to get pages are going to be found in the cache.  And when navigating a Btree, certain pages like the internal pages are going to be hit a lot.  It all depends on your access pattern and cache size of course, but these internal pages tend to be highly used, and rarely evicted.  So this mostly gets rid of one ouch.

If you aren’t lucky enough to have your database file completely in cache and your app isn’t readonly, you have a nice little function memp_trickle at your disposal.  That function walks through pages in your cache, making sure a certain percentage of them (you pick the number) are clean.  It doesn’t evict pages, but pages that are dirty (as a result of previous DB→put or Db→del calls) will get written out to the backing DB file. Getting back to our example, if your DB→get call does need a page in the cache, it will much more likely find a clean one.  And evicting a clean page implies just some bookkeeping, and no I/O. Putting the trickle call in its own thread, doing

while (true) { env->memp_trickle(pct); sleep(10); }

offloads a large percentage of those I/Os from a synchronous DB→get call to a background activity.  Goodbye double ouch.

A philosophy. This all is in keeping with the core BDB philosophy of keeping the programmer in control.  You can control every aspect of the trickle activity: whether to do it, what percentage to clean, how often to run, even what threading library you want to use.  In fact, trickle can happen in its own process, rather than a separate thread.

Other core features, like checkpointing and deadlock detection [1] are likewise ‘synchronous’ method calls that are often run as background threads or processes.  Calling these APIs doesn’t imply any threads being spun off.

Moving beyond the core API, more and more threads pop up.  For starters, the replication manager API (not the replication base) does use threads to manage replication activities.  And there’s definitely knowledge of threads in the Java API (see TransactionRunner) and Core DB (see failchk).  Berkeley DB Java Edition (JE) is a cool pure java version of Berkeley DB.  Its philosophy is quite a bit different, and a lot of the basic housekeeping like eviction and cleaning happen in background threads by default. You still have lots of control over configuration.

Finally, there’s this thing called SQLite that comes as part of BDB nowadays.  It’s the popular full blown embedded SQL engine built integrated with a BDB backend.  Talk about a change in philosophy.  I’ll be blogging about that in the future once I have a chance to put it through its paces.  Background threads in that?  I don’t see them.

BDB has come a long way.  But if you’re doing work with the good ol’ core API, and even with some of the new stuff, you have to keep this threadless model in mind.  Just remember to spin off your own trickle thread so you can dial up the performance.


[1] Deadlock detection also may be handled synchronously with almost no penalty, that’s a story for another day.

Posted in Uncategorized | 10 Comments

Sizing cache

When I first starting working on performance with BDB core, I remember learning that the cache size tuning parameter was the biggest dial you could turn to boost performance for an app.  I envisioned this huge dial, and the temptation was to turn it all the way, up to eleven! But there are some subtleties involved here.  Picking that cache size number is a bit like Goldilocks and the Three Bears.  You don’t want it too small, you don’t want it too large, you want it just right. Fortunately, there’s usually some leeway, though like many things in BDB, it can depend a lot on your own situation.

Not too large. First of all, you need to consider how much memory you can afford for the BDB application.  Perhaps your scenario is to have a pretty much dedicated box for your BDB work.  In that case, you’ll want to pick a number that is somewhat less than physical memory. Even on a dedicated box, the OS needs memory and there are other processes (cron, shells) that may come and go.  For a dedicated box with more that a gig of memory, maybe around 75% of physical memory is a good place to start.  Out of that 75%, you’ll need to subtract out the data size of your application (your JVM heap size if you’re in java) – what you end up with can be used for BDB cache.  That 75% figure gives us a little safety margin.  If you pick a number that is too large, some of your BDB cache may get paged out, and that will lead to very bad performance (and db_stat won’t clue you in to that [1]).

Not too small. If you don’t have a dedicated system for your BDB installation, another alternative is to work from the bottom up.  By default, BDB environments have a cache size of approximately 256K [2].  That’s tiny, but almost guaranteed to work without giving errors about running out of space on all but the most memory starved systems.  It turns out that for most folks, this is really small.

How tiny is 256K?  Let’s suppose you have a database of a million items, each item has a 10 byte key and 100 byte data.  You have 8K disk blocks.  On the face of it, each key/data pair occupies 110 bytes, and with a million of those, that’s 110 megabytes.  There’s some extra overhead, and some free space in each BDB database, so let’s round the total size to 150 megabytes (we can talk about that guesstimate another time).  Another quick way to tell this size of your database is to take a look at the size of your .db file.  Okay, as a first guess, we can say that one out of every 600 pages in our 150M database fits in that 256K cache (600*256K = ~150M).  Yipes!  Doesn’t that mean a cache hit rate of less than %0.2? Hold on a second, it depends greatly on the locality of your references.  If you have a pretty localized access pattern, perhaps hitting the same keys many times, or maybe hitting nearby keys in the Btree, you’ll actually do a lot better than that.  A LOT.  You can look at your cache hit rate in BDB by running db_stat -m.

Well let’s suppose your actual cache hit rate is much better, about 70%. Now that sounds pretty good, doesn’t it?  Not really.  Each of those misses might take 1000 times as long to process (or more) than each hit.  To quote the proverb: disk slow memory fast.  You’re spending at least 99% of your time waiting for disk.  If you could decrease your misses by a half (or bring your cache hit rate to 85%) you would about double the speed of the database part of your application.

So how do you improve the cache hit rate?  Well the easiest way, is to pull that lever called the cache size.  Like we said, 256K is very small.  How about a megabyte instead? How about 10 megabytes?  Again, it depends on how much locality your application has, but generally speaking, you’ll see steady performance increases up to some plateau. As your databases get larger and larger that plateau may get higher and higher. Sometimes the plateau goes beyond the memory you have available – so you size as large as you can. But since you’re competing with other memory users on your system, bear in mind the Not too big guideline above.

Remember, all this is rough estimates and rules of thumb.  Your mileage will vary. The best way to know is to experiment with different cache sizes with your application on your system, with some real data loads.  Having an adequate set of tests with an adequate quantity and quality of data is pretty important to finding out how far to turn that dial.


[1] The system utility iostat will help you diagnose this. Also you might even hear your disk churn.
[2] For the default cache size, see the Berkeley DB source code, function __memp_env_create() in /mp/mp_method.c .

Posted in Uncategorized | 1 Comment

What took me so long

I’ve been meaning to do this for a long time.  Every time I visit a client.  Every time I discover something interesting in the API.  Every time I hear news about BDB.

You see, for years, I’ve been working with folks to improve their Berkeley DB experience.  To make their system go faster.  To handle more data.  Sometimes just to understand.  And with each interaction, I understand a little more, too.  I’ve found that similar themes come up again and again.  Cache isn’t big enough.  Let’s partition the data set.  How about we find some locality in your access patterns.  These are things I could blog about.

Many of these clients have given me hard problems, and I’ve learned a lot from that.  In this blog, we’re going to try to discover, dissect, and distill some of these lessons.  With your remarks and feedback, I’m hoping we’ll learn a little bit more.  Together.


Posted in Uncategorized | Leave a comment