Taste your data

I saw Temple Grandin speak recently and I heard a number of nuggets of wisdom from someone who truly thinks different. If you don’t know the name, Dr. Grandin is a very high functioning autistic person who has revolutionized the livestock industry by understanding at a sensory level what cattle are feeling and how that influences behavior. The remarkable thing is that her autism gave her a new viewpoint that has been valuable for her and the industry.

One of her points was that many of us are top-down thinkers. We approach problems from the top, breaking it down bit by bit. Why is my program slow? Is it the framework? The data store? The network? Obviously, this sort of analysis is important and usually the right way to start (let’s select a great algorithm before implementing, please).  But when you’re not making progress at the tuning side of things and need another angle, it can be helpful at looking at things from the other end.

Temple is a sensory person.  The five senses affect her directly.  She has no choice but to start with these details and process bottom up. In our domain, that would be starting with the data. The raw bits.

Visualize…

So if you are trying to improve database performance, it can often be helpful to move beyond the code and look right at what data you are storing. In Berkeley DB, you can do this using db_dump.  db_dump is typically used to dump and restore databases, but why not use it as a raw visualization tool?

db_dump details

In the sample above (inspired by, but not copied from, real customer data), the first line with numbers: 09174000 is a key, followed by the data. Then another key, data, etc. I’ve shown just a couple key/data pairs just to get the creative juices flowing — you’ll probably want to pipe your db_dump through your favorite pager and let a lot of entries fill up your screen.

…to get ideas

Once you start immersing yourself in the actual data you’ll probably get lots of ideas. When I see data like in the sample above, I have to ask why there are all those zeroes in the same place for each data entry. Is there a way to condense or compress the data?

Why is this relevant? Because if you can get your data smaller, one of the nicest effects is in cache efficiency – with data that’s half its original size you’ll be able to fit twice as many entries in your currently sized cache. Another possible effect is reducing the number of overflow pages, which can have multiple good effects. (I’ll have to get into that in another post).

You might also sniff out some other peculiarities. Hmmm, the keys all appear to have zeros at the end, rather than leading zeros. And they aren’t correctly sorted. Did we forget to ensure the most significant byte goes first? That sort of error can cost a lot in the locality of accesses, which can again lead to caching inefficiencies.

Obviously, your data is going to look different, and you’re going to notice different things. But you’ll never notice unless you look.

Get down to the details – smell it, taste it. And then fix it.

Advertisement

About ddanderson

Berkeley DB, Java, C, C , C# consultant and jazz trumpeter
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s