How to fit an elephant into a fridge? (Only for Power BI Users)

Well, how many of you have faced memory exceptions when loading data into a Power BI model? How many of you have worried that a Power BI file might exceeds 1GB limit when publishing to the service? How many of you have faced problems due to time consuming data refreshes? Few days ago, I faced all those problems and this post is all about how I tried to overcome those issues as much as possible.

image

What makes your PBIX file big and what makes it throws memory exceptions? It all comes down to data and calculated columns. In Power BI, since it uses columnar database, tabular engine tries to compress data as much as possible. It provides optimum compression when a column has low number of unique values (Low- Cardinality). When no of unique values is high (High-Cardinality), it takes more space in both memory and disk.   Let us see what we can do use this precious memory as effective as possible.

1. Remove all unwanted columns within Power Query Editor

First thing is first. When you load data to your Power BI report, try to filter out all unwanted columns. Most tables, datasets have lots of fields coming from source systems and most of those fields might not be meaningful for any analysis. You can either exclude those columns when you write a select query or can use Power Query editor to remove before you load data to the model.

Remember hidden columns still take space in your model. Rather than hiding columns within Power BI, consider remove those within Power Query Editor.

2. Remove unwanted rows

This may seems obvious. But how often we filter out unwanted rows when loading data to a model. Analyze your data and identify invalid data scenarios, historical data which are not meaningful to business users, etc. and filter those records within your query or Power Query editor.

3. Avoid high cardinality columns

Single high-cardinality column in one of your biggest table can take 75-80% of total space within the model.  Keep your eye on text fields such as email addresses, usernames and comments within large tables as those are the main culprits inside your model.  Try to remove those within Power Query editor.

Another option is to break a high cardinality column into two low cardinality columns. For example, if you have a column with invoice numbers like “INOP:76857”, you can break this into two columns, one with “INOP:” and other one with “76857”

4.Avoid Large table to Large Table joins inside model

Let’s say you have two large tables and you need to join these two tables within the model. As an example, If you have Order Header table and Order Detail table and both tables have millions of records, these two tables take space within your model separately. Rather than bringing both tables to the model and then making relationships, you can join these two tables within your select query and then bring it as a one dataset. This way you can save space you need for primary keys and foreign keys; which are obliviously high-cardinality columns.

5.Date-Time/Time fields

Datetime fields as well as Time fields can easily be high cardinality columns within a large table. To overcome that you can break data-time column in to two columns. one for date part and other one for time part.

Additionally, If you can compromise precision of those data, you can round time field into nearest 15 min or 1 hour based on your analytical needs. That way you change a high cardinality column into a low cardinality column.

6. Remove/ hash unique-Identifier

Unique identifiers in source systems are always causes issues to me. I personally don’t know why people use a unique identifier as the primary key of a table. When you have a large fact table with a guid field, it can take up to 70-80% of total memory used by that table.  if possible, exclude it before loading to the data.

Another thing you can try is to hash unique-identifiers into integer. You can use a function like BINARY_CHECKSUM  for this conversion. However, it must be a deterministic hash function to avoid any data issues. 

7. Avoid calculated columns

It is not only data, calculated columns also take lot of space in your model. If you have multiple calculated columns, some of which may be act as an intermediate step to create another column, consider removing those columns. Consider replacing a calculated column using a measure whenever possible. You can use iterating DAX functions to use row context in side the measure.

8. Aggregations and Composite Model

If there is a possibility, consider using Power BI aggregations to bring only summarize data to the model and keep detail data in source system. That way you can keep your model small while not compromising analysis you want to perform in the report.

9. Incremental refresh

Now that incremental refresh feature is available for pro users, you can stop full refreshing large transaction tables in Power BI.  It reduces refresh time as well as avoid unwanted strain on source systems.

10. Use Vertipaq Analyzer/DAX studio to understand your data model

Vertipaq Analyzer and DAX studio allow you to analyze your data model and gives you lots of details such as data size, data types, cardinality about each field in a table. With that information you can understand which tables/columns uses more memory in the model and you can act base on that.

image

Vertipaq Analyzer: https://www.sqlbi.com/tools/vertipaq-analyzer/

DAX Studio: https://daxstudio.org/

Well, that is it for this post. If you have any other ideas, please comment so that someone else can benefit from that. Thank you for reading and cheers!!

y

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s