What are the things we miss when designing a Data Warehouse?

In most cases, when we design and implement a Data Warehouse, we give the priority to requirements of business users and by doing so, we tend to miss some important things.Personally , I believe that there are two main mistakes we do when designing a data warehouse. First one is waiting till users report about performance issue to worry about performance of the DW . Second one is trying to make the DW optimized for read performance as much as possible and ignore write performance. Since these are kind of common mistakes we all do, I thought of writing a post about it.

It is very common that we ignore performance of the DW until we witness some slowness of the DW.  However, sometimes it is difficult to do performance optimizations when data volume is high and when fact tables have millions of records. Apart from that, those performance issue fixes could be time consuming and hence end user might not be able to access the system for certain period of time. It could be hours or days based on the fixes and data volume. When tables have millions of records, it take time to create/rebuild indexes, apply partitioning  and tables might get locked for that period of time. Therefore, consider best practices related to performance when you design the DW it self.

Don’t stick into default configurations

When it comes to MS SQL server, default database configuration covers most of the database requirements. However, when it comes to a DW, it is not a normal database. It servers different purpose than a normal database and therefore, stick into the default configurations might not be ideal. Personally, I believe that configurations like Fill Factor, File Groups, Log File Growth Size and Recovery Mode ( and there are much more)  must be changed according to your requirements when it comes to a DW.

For an example , Most people are not aware of what is fill factor and most cases they stick into the default fill factor of the SQL server. By default, Fill Factor is zero (or 100), which means the page file can be fully filled. But it is recommended by experts to keep fill factor between 80% – 90% to minimize page split.

Don’t only think about read performance

“I need my data warehouse to return query result as soon as possible and therefore it should be fully optimized for reading”.  This is something we always hear and It is kind of a valid statement too. However, when doing so people totally forget about write performance of the DW. Very recently, I saw the IT team of an organization struggle to  load data to their DW within a certain time frame because Insert/Update operations take huge time to complete. When the data volume of a DW become bigger and bigger, this can happen to most DW solutions unless it is carefully designed and maintained.

Don’t misusing Indexes

Most people believe that the more indexes you create in your table, better the performance is. Unfortunately , that is not true.  When creating clustered index and non-clustered indexes, there are some guidelines to be followed to obtain maximum benefit out of it. As a practice, Some people create non-clustered indexes whenever a column is used in a JOIN or a WHERE statement. I have seen large tables where VARCHAR and NVARCHAR  data type columns have been used as indexes , which is not recommended at all. Sometimes we create “Covering Indexes” whenever respective columns are used inside queries. Remember that having high number of indexes does not guarantee query performance . On the contrary, it degrade write performance of the DW.Discussing about how the indexes works and what are the best practices in creating indexes is a vast area and would require 10-20 blog post like this. What I want to emphasize in this blog post is be careful when you use indexes, if not it will create more issues than it solves.

Don’t ignoring Partitioning

It can be observed that, in most cases, partitioning has not been considered as a part of DW design and development. People believe that partitioning is only required when they experience performance issues. However, it becomes difficult to implement a partitioning mechanism when your DW grows bigger and bigger. When you don’t have a partitioning mechanism, it’s not only create performance issues when reading data, it causes issues when writing data to DW. Hence, make a dynamic partitioning mechanism as a part of your DW solution.

Keep a Backup Strategy in head

Implementing a back up strategy is the last thing we worry when we implement a DW solution. Even though it is the common approach, this might cause us problems if we have not thought about it in design stage it self. Decide your backup approach beginning and  keep it in mind when doing the development. You might have to take transaction log backups or DB backups. If your DW contains Terabytes of Data, you might have to consider backing up file groups rather that whole DW.

These are the mistakes comes to my mind right now. I will update this post whenever something else occur to me. Thanks for reading and comment if you have anything to add to this post.  Cheers !!

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s