Why a Column Oriented Database

 

There are multiple types of databases; we have transactional databases, document databases, and column oriented between others. To pick the right type of database for a project one has to be aware of the advantages and trades that each type has

For example, the main focus of transactional databases is on enforcing the rules that govern the data consistency after every update, but they trade in performance of data retrieval (familiar with the inner-join-maze?). This performance loss can be mitigated by improving the hardware that it runs on, or using very smart schemas, but until one puts a column database and a transactional one side by side, at the same hardware and software level, one does not realize what “performance” means

By the same hardware level I mean running in the same hardware and network environment, it always have amazed how some vendors claim huge performance, but when it is time to do an installation, the hardware requirement to reach that performance are... considerable

By software level I mean same OS, patches, security context, etc. as well as the type of database. Are both service oriented? Or are both file oriented? And how many files are we talking about? For some reason dBase comes to mind (1 file per table, plus 1 file per index, plus 1 file per memo field… aarrgghhh!!). So for example it wouldn’t be fair to compare SQLite with SQL Server because they are tuned for very different usage scenarios

Another example are Document databases which simplicity of storage for data consistency across documents. Some products can manage consistency rules, some other products do not

Then there are columns databases where the focus is in the performance of aggregations

In a column database the data is stored in columns, not in rows, nor are they key-value pairs. This kind of database is not concern with data changes because the most common scenario is that the data will be loaded once, and then multiple reports and aggregations can be extracted from there

Since data classification and aggregation is the priority it is convenient to correctly identify what columns would be used to filter the data that will be aggregated. If we think in a start schema, the first ones are dimensions that have a well-known set of possible values, and the later are measures which can be manipulated with math formulas

By pivoting the data storage some interesting properties appear. Given the next table:

Row State Population
0 CT 1000
1 NY 1500
2 CT 700
3 NY 300
4 NY 2000
5 CT 3500
6 NY 900

 

If the data is stored row by row it would look in disk the same as it was displayed. Or it can be stored by column, which will make it look like this

State Row
CT 0, 2, 5
NY 1, 3, 4, 6

 

Row Population
0 1000
1 1500
2 700
3 300
4 2000
5 3500
6 900

 

What is interesting is that the Dimension state has 2 contexts in which to aggregate the population, namely CT and NY. Also each context has an array of row numbers, so for each context we only need to store if a particular row belongs to that context or not. Yes or no. True or False. It can be stored as a bit array. So instead of storing a full integer size per row id, and since the number of rows is less than 8, both arrays can be managed inside a byte (8 bits)

Once the data is stored as a bit array, then intersections and unions become really fast because instead of traveling indices (or worse, the dreaded “table scan” in a SQL plan), they become bit operations of ‘and’ and ‘or’. Most processors can, in one instruction calculate the intersection of multiple rows

For all the benefits that this approach presents, there are some tradeoffs. Inserting new data becomes more cumbersome. When a dimension has a high cardinality (lots of different values) then the bitmap index becomes full of holes. For this last problem there are some ways to alleviate it, for example being able to have the bit array in a compressed format that still allows manipulation without expanding is the way to go

The idea of using bit arrays is not new. I have found papers form the late 60s about this topic. On the patent side, some compression algorythms are patented, but there are some patent-less implementations, EWAH comes to mind.

The performance gains of using a column database outweigh the tradeoffs for data reporting, analysis, and mining scenarios.

This is the first part of a series of articles where I will be explaining the advantages and complexities of the column databases, stay tuned.