Doing the splits

Back to the land of ‘what-if’. Last time we talked about prefetch and what it might buy us. Today we’re talking about yet another BDB non-feature: presplit.

The concept is motivated by a use case where we are pumping lots of data into a Btree database. We’re not limited by disk speed: perhaps the database fits in memory, logging is either in memory (possibly backed by replication) or we are non-transactional. At this blistering high speed, we see a new blip on the radar: page splits.

When a Btree is growing, pages inevitably fill up. When a new key data pair cannot fit into the leaf page it sorts to, the page must be split into two. A new page is allocated, and we copy half of the leaf page data is to the new page. Also the parent internal page is updated to have the first key on the new page. That’s three pages being modified. A lot of bookkeeping and shuffling is involved here, disproportionate to the amount of bookkeeping normally needed to add a key/value pair.

In keeping with the theme of pushing tasks into separate threads, we might envision a way to anticipate the need for a split. An extra check could happen during a DB->put, and would trigger the presplit when a page is 90% filled, or perhaps when one additional key/data pair of the same size of the current one would trigger a split. Whatever the trigger condition, information about what page needed to be split could be recorded – perhaps in a queue database, to be consumed by another thread whose sole job it is to do the splits and keep that activity out of the main thread.

The benefit is that by the time another record is added, the page is already split. The same amount of work is done (and it’s all CPU), but it’s done in advance, and in another thread. When the next DB->put rolls around, it can be fast, so latency is reduced. It’s a little like hotel laundry service – I put a dirty shirt in a bag on my door and in parallel with my busy day, the dirty shirt is being cleaned and made ready for me when I need it.

When CPU is maxed out, this is not really a smart optimization: it’s true that you’re doing work in advance, but since it’s all CPU, you’ve just reshuffled the ordering that you’re doing things. Throughput and latency might get slightly worse. Using the laundry analogy, when the CPU is maxed out it’s like the hotel staff is off busy doing other stuff, and the laundry is self serve. Putting my clothes in a bag and hanging it on the door so I can take it to the laundry myself just adds to the overhead.  And cleaning my shirt as soon as it’s dirty vs. when it’s needed may also be wasteful – I might be checking out of this hotel before I need that shirt!

On the other hand, CPU horsepower continues to grow. Just think about the steady increase of cores on almost every platform, even phones. So the max-out argument may become less important over time, at least for apps that need record splitting low latency.

Like last time, I have to say that I don’t know exactly if the benefit would be substantial enough. Although it’s fun to think about, I’m not completely convinced that the extra costs of having a presplit thread will be justified by the returns. Studying presplit and prefetch does seem like a great research assignment for an enterprising grad student. Code tinkering, measurements, publications, press releases, notoriety and the chance to give back to the open source community. Cool stuff all.

We’ve probably spent enough time in ‘what-if’ for a while. I promise next post we’ll get back firmly grounded in reality.


About ddanderson

Berkeley DB, Java, C, C , C# consultant and jazz trumpeter
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s