Why are column oriented databases so much faster than row oriented databases?

  Terence Siganakis        2012-01-29 04:27:05       3,276        0    

I have been playing around with Hybrid Word Aligned Bitmaps for a few weeks now, and they turn out to be a rather remarkable data structure.  I believe that they are utilized extensively in modern column oriented databases such as Vertica and MonetDB.

Essentially HWABs are a data structure that allows you to represent a sparse bitmap (series of 0's and 1's) really efficiently in memory.  The key trick here is the use of run length encoding to compress the bitmap into fewer bits while still allowing for lightening fast operations.  

They key operation from my perspective is the "AND" operation.  "Why is that useful?" I hear you ask.  Well, imagine we have the following query:

SELECT Animal, Colour, COUNT(*) 
FROM AnimalDetails 
WHERE City='Melbourne'
GROUP BY Animal, Colour

On a table that looks like:

Animal, Colour, City
Dog, White, Melbourne
Cat, Black, Sydney
Cat, White, Melbourne
Dog, Black, Melbourne

In order for me to process the query, I need first to calculate the WHERE condition, which is to say find all the rows that satisfy the condition "City='Melbourne'".  

Then, given those rows (a list of row ids) I need to group the results to Animal and then by Colour.

Imagine for a moment that our data for "Animal", "Colour" and "City" are all stored as simple arrays, with the array index 0 for all arrays creating row 0, so Animal[0] = Dog, Colour[0] = White, City[0] = Melbourne, forming the first row.

That would work well, but its a bit clunky since I have to visit every row in the table to satisfy the query.  What I really want to do is to index the data, so I change the structure of the data, such that each Array is now a map using the distinct values as keys and a list of the row_ids that the value is found in.  Then my "Animal" column can be represented as:

Animal["Dog"] = {1, 4}
Animal["Cat"] = {2, 3}

Indicating that for the column "Animal" the value "Dog" can be found at rows 1, 4.  This is repeated for all the other columns.

This means that satisfying my WHERE condition means I only need to seek to the "Melbourne" value in my City structure, and read off the list of row_ids {1, 3, 4}.  Awesome!

It then also means that for my group by conditions I can do the same thing, but a little differently.  Since I have a list of all the distinct values, I can just walk through the list of distinct values and collect the list of row ids, then Intersect it with the result of the WHERE condition.

So to find the Distinct values of "Animal" that satisfy the condition I end up with:

Seek for "Dog" in Animals, return {1, 4} then INTERSECT {1, 3, 4} = {1, 4}
Seek for "Cat" in Animals, return {2, 3} INSERSECT {1, 3, 4} = {3}

I can then take those results and intersect them with the results of "Colour":

Dogs
"White" {1, 2} INSERSECT {1, 4} = {1} (White Dog)
"Black" {3, 4} INSERSECT {1, 4} = {4} (Black Dog)

Cats
"White" {1, 3} INTERSECT {3} = {3} (White Cat)
"Black" {2, 4} INSERSECT {3} = {} (No Black Cats)

The problem here is all these INTERSECT operations are really expensive, especially when the conditions end up producing a lot of rows.  

On my test data set (a million rows), a similar query to this took about 4000ms, with the vast majority of time being taken up in calculating the intersections.

Now, if you structure the data a little bit differently you can represent the values as bitmaps.

Animal["Dog"] = 1001
Animal["Cat"] = 0110

Here, "Dog" can be found at row id 1 and 4.  Therefore we set the value of the bit at rows 1 and 4 to 1, and 0 elsewhere.

Now my first index seek to satisfy my WHERE Condition is just a matter of looking up the Bitmap for "City = Melbourne", which I see is 1011.

I can then find my bitmaps for "Animal" and do the AND operation on the bitmaps rather than the intersection, giving me:

"Dog" 1001 AND 1011 = 1001 (rows {1,4})
"Cat" 0110 AND 1011 = 0010 (rows {3})

In my example, running the query using Bitmaps now takes in the order of 40ms, or about 100 times faster.

My next post will probably go into a bit more detail about how Hybrid Word Aligned Bitmaps actually work.  But if you are curious have a look at: 

https://sdm.lbl.gov/fastbit/

Or a pretty good .Net implementation:

http://softwarebotanysun.codeplex.com/

 

Cheers!

Source:http://siganakis.com/using-bitmap-indexes-in-query-processing

DATABASE  COLUMN ORIENTED  SPEED ANALYSIS  VERTICA 

       

  RELATED


  0 COMMENT


No comment for this article.



  RANDOM FUN

A day of an intern