Reducing No of Meta Data Queries send from Power BI to Synapse

Last week I faced an interesting yet very challenging problem. One of clients I was working on was facing a serious problem with Power BI and I have been asked to provide some assistance to troubleshoot the problem. The client is using Azure Synapse as Data Warehousing solution and Power BI as reporting and analytical solution. However, due to the data volume and complexity of the processing in Synapse, DWU300 was more than enough for them for a quite a long period. However, all the sudden business users and Data Engineering team started to complain that they can’t connect to Synapse frequently and Synapse rejects incoming connections to the data warehouse as it already exceeds 512 concurrent open connections (sessions) limit. Knowing that they don’t have that much of users and complex reports, client wanted to know what is happening with Synapse.

Luckily for me, Client already knew which report generates so many sessions within Synapse. This dataset only had 42 queries, 40 coming from Synapse and 2 coming from SharePoint. First thing I did was investigate what happens in Synapse when that dataset get refreshed. I used below SQL query to identify no of requests and sessions generated by Power BI.

select r.* ,s.*
 from [sys].[dm_pdw_exec_requests] r
inner join sys.dm_pdw_exec_sessions s on s.session_id=r.session_id
where s.[status] in ('Active', 'Idle') 
and s.[app_name] like '%Mashup Engine%'

I was surprised with the outcome of the query. There were 625 sessions created and 1250 SQL requests were coming from Power BI in just one refresh. Client was already in DWU 1000 to just to get 1024 sessions; In any other source system, this would not have an issue. But in Synapse, this is a huge problem.

While there were so many problems like No Query folding, loading reference tables to the model and complex M query transformations, What surprised me most was number of meta data related queries generated from Power BI. Out of all 1250 queries, 625 queries was “Use [DataWarehouse]” queries. From rest of 625 queries, only 75 queries were actually data queries; rest was meta data related queries generated by Power BI. Even the 75 queries were due to duplicate queries generated by Power BI for reference queries which I knew I could fix by optimising the data model. However, when 70% of sessions were created by metadata queries, I was pulling my hair thinking what to do for those sessions/ queries. In other words, I knew that Power BI should only create 75 sessions for 75 queries it generate. But it creates 625 sessions, just to accommodate these meta data queries 😦

Let’s replicate this problem by taking one table as an example. Following are the queries generated by Power BI when loading one table. In this case it creates 14 queries and 7 sessions. Now if this table is referenced inside another query, these set of queries are re-created again from Power BI. Which means 14 sessions. Just to load one simple table. I was really frustrated with Power BI thinking why can’t they reuse the meta data coming from one table, why do they have to do it for all the tables from same data source. Then again, I’m not here to tell Power BI team to do what they should be doing.

USE [Contoso DW]
select @@version + ‘ EngineEditon: ‘ + cast(SERVERPROPERTY(‘EngineEdition’) as varchar(4))
USE [Contoso DW]
select t.[TABLE_CATALOG], t.[TABLE_SCHEMA], t.[TABLE_NAME], t.[TABLE_TYPE], tv.create_date [CREATED_DATE], tv.modify_date [MODIFIED_DATE], cast(e.value as nvarchar(max)) [DESCRIPTION] from [INFORMATION_SCHEMA].[TABLES] t join sys.schemas s on s.name = t.[TABLE_SCHEMA] join sys.objects tv on tv.name = t.[TABLE_NAME] and tv.schema_id = s.schema_id and tv.parent_object_id = 0 left outer join sys.extended_properties e on tv.object_id = e.major_id and e.minor_id = 0 and e.class = 1 and e.name = ‘MS_Description’
USE [Contoso DW]
select r.[ROUTINE_SCHEMA], r.[ROUTINE_NAME], r.[ROUTINE_TYPE], p.create_date [CREATED_DATE], p.modify_date [MODIFIED_DATE], cast(e.value as nvarchar(max)) [DESCRIPTION] from [INFORMATION_SCHEMA].[ROUTINES] r join sys.schemas s on s.name = r.[ROUTINE_SCHEMA] join sys.objects p on p.name = r.[ROUTINE_NAME] and p.schema_id = s.schema_id and p.parent_object_id = 0 left outer join sys.extended_properties e on p.object_id = e.major_id and e.minor_id = 0 and e.class = 1 and e.name = ‘MS_Description’
USE [Contoso DW]
select s.name [TABLE_SCHEMA], o.name [TABLE_NAME], c.name [COLUMN_NAME], cast(c.column_id as bigint) [ORDINAL_POSITION], c.is_nullable [IS_NULLABLE], case when (t.is_user_defined = 0 and t.name is not null) then t.name when (c.system_type_id = 240 or t.name is null) then ‘udt’ else t_system.name end [DATA_TYPE], case when (c.system_type_id in (59, 62)) then 2 when (c.system_type_id in (48, 52, 56, 60, 104, 106, 108, 122, 127)) then 10 else null end [NUMERIC_PRECISION_RADIX], c.precision [NUMERIC_PRECISION], case when (c.system_type_id in (59, 62)) then null else c.scale end [NUMERIC_SCALE], case when (c.system_type_id in (40, 41, 42, 43, 58, 61)) then c.scale else null end [DATETIME_PRECISION], case when (c.system_type_id in (231, 239)) then floor(c.max_length / 2) when (c.system_type_id in (165, 167, 173, 175)) then c.max_length else null end [CHARACTER_MAXIMUM_LENGTH], cast(e.value as nvarchar(max)) [DESCRIPTION], d.definition [COLUMN_DEFAULT], cc.definition [COLUMN_EXPRESSION], case when c.is_identity = 1 or c.is_computed = 1 or t.system_type_id = 189 then 0 else 1 end [IS_WRITABLE] from sys.objects o join sys.schemas s on s.schema_id = o.schema_id join sys.columns c on o.object_id = c.object_id left join sys.types t on c.user_type_id = t.user_type_id left join sys.types t_system on t.system_type_id = t_system.user_type_id left join sys.default_constraints d on d.object_id = c.default_object_id left join sys.computed_columns cc on c.object_id = cc.object_id and c.column_id = cc.column_id left join sys.extended_properties e on o.object_id = e.major_id and c.column_id = e.minor_id and e.class = 1 and e.name = ‘MS_Description’ where 1=1;
USE [Contoso DW]
select convert(nvarchar, fk.object_id) [FK_NAME], cast(f.constraint_column_id as bigint) [ORDINAL], s1.name [TABLE_SCHEMA_1], o1.name [TABLE_NAME_1], c1.name [PK_COLUMN_NAME_1], s2.name [TABLE_SCHEMA_2], o2.name [TABLE_NAME_2], c2.name [PK_COLUMN_NAME_2], f.constraint_object_id, f.constraint_column_id from sys.foreign_key_columns f join sys.foreign_keys fk on f.constraint_object_id = fk.object_id join sys.objects o1 on o1.object_id = f.parent_object_id join sys.schemas s1 on s1.schema_id = o1.schema_id join sys.objects o2 on o2.object_id = f.referenced_object_id join sys.schemas s2 on s2.schema_id = o2.schema_id join sys.columns c1 on c1.object_id = o1.object_id and c1.column_id = f.parent_column_id join sys.columns c2 on c2.object_id = o2.object_id and c2.column_id = f.referenced_column_id where 1=1 union select convert(nvarchar, fk.object_id) [FK_NAME], cast(f.constraint_column_id as bigint) [ORDINAL], s1.name [TABLE_SCHEMA_1], o1.name [TABLE_NAME_1], c1.name [PK_COLUMN_NAME_1], s2.name [TABLE_SCHEMA_2], o2.name [TABLE_NAME_2], c2.name [PK_COLUMN_NAME_2], f.constraint_object_id, f.constraint_column_id from sys.foreign_key_columns f join sys.foreign_keys fk on f.constraint_object_id = fk.object_id join sys.objects o1 on o1.object_id = f.parent_object_id join sys.schemas s1 on s1.schema_id = o1.schema_id join sys.objects o2 on o2.object_id = f.referenced_object_id join sys.schemas s2 on s2.schema_id = o2.schema_id join sys.columns c1 on c1.object_id = o1.object_id and c1.column_id = f.parent_column_id join sys.columns c2 on c2.object_id = o2.object_id and c2.column_id = f.referenced_column_id where 1=1 order by f.constraint_object_id, f.constraint_column_id;
USE [Contoso DW]
select s.name [TABLE_SCHEMA], o.name [TABLE_NAME], i.name [INDEX_NAME], s.name [TABLE_SCHEMA], o.name [TABLE_NAME], cc.name [COLUMN_NAME], cast(ic.key_ordinal as bigint) [ORDINAL_POSITION], i.is_primary_key [PRIMARY_KEY] from sys.objects o join sys.schemas s on s.schema_id = o.schema_id join sys.indexes as i on i.object_id = o.object_id join sys.index_columns as ic on ic.object_id = i.object_id and ic.index_id = i.index_id join sys.columns as cc on ic.column_id = cc.column_id and ic.object_id = cc.object_id where (i.is_primary_key = 1 or i.is_unique_constraint = 1 or i.is_unique = 1) and o.type in (‘U’, ‘V’) and ic.key_ordinal <> 0 and 1=1 order by i.name, s.name, o.name;
USE [Contoso DW]
select [$Table].[AccountKey] as [AccountKey], [$Table].[ParentAccountKey] as [ParentAccountKey], [$Table].[AccountLabel] as [AccountLabel], [$Table].[AccountName] as [AccountName], [$Table].[AccountDescription] as [AccountDescription], [$Table].[AccountType] as [AccountType], [$Table].[Operator] as [Operator], [$Table].[CustomMembers] as [CustomMembers], [$Table].[ValueType] as [ValueType], [$Table].[CustomMemberOptions] as [CustomMemberOptions], [$Table].[ETLLoadID] as [ETLLoadID], [$Table].[LoadDate] as [LoadDate], [$Table].[UpdateDate] as [UpdateDate] from [dbo].[DimAccount] as [$Table]

As you can see, only the last query returns any data and rest of the queries are to get data meta data about the Synapse data warehouse. Let’s focus on how I have loaded the data within M query. Just like any of us do, first step is to define the source and next step is to navigate to required table, in this case “DimAccount” table.

If we check the M Query for this, this is how it looks like.

I felt that all these meta data queries are generated by these 2 steps as it tells Power BI to go and navigate Synapse to find the DimAccount table. Why should I do that! I can tell Power BI just bring the data from SQL server using a SELECT statement without doing any searching.

Therefore, I changed my M query like below and check what are the queries created by Power BI when refresh data.

Luckily, it only creates 3 sessions instead of 7 sessions and yet it executes same query data query 2 times. One for actual data and one to get meta data required for Power BI. And most importantly, it breaks “Query Folding”!

USE [Contoso DW]
select @@version + ‘ EngineEditon: ‘ + cast(SERVERPROPERTY(‘EngineEdition’) as varchar(4))
USE [Contoso DW]
SELECT * FROM [dbo].[DimAccount]
USE [Contoso DW]
SELECT * FROM [dbo].[DimAccount]

Therefore, I looked for the help from one and only Chris Webb and saw that I can still keep Query folding on if I use Value.NativeQuery function.

This time, I changed the query like below and checked what queries are sent from Power BI to Synapse.

Surprisingly, 2nd query which actually returned data in previous method does not returns any data as it is wrapped with SELECT top 0 * statement and Query Folding is back!

USE [Contoso DW]
select @@version + ‘ EngineEditon: ‘ + cast(SERVERPROPERTY(‘EngineEdition’) as varchar(4))
USE [Contoso DW]
select top 0 * from ( SELECT * FROM [dbo].[DimAccount] ) as [_]
USE [Contoso DW]
SELECT * FROM [dbo].[DimAccount]

As you can see, with this simple method, I reduced 7 sessions into 3 sessions. I wish I could do more and reduce it into 1 session or at least 2. But I think Power BI needs meta data about the table and Synapse DW version. Hence reducing it to 2 or 1 will not be possible; may be. But if the same table is referenced multiple times and multiple tables like that create 100s of sessions in Synapse, I think this is good enough for me.

Update: Note that when I use the query as reference query, Power BI introduce another meta data query to get top 1000 rows required for it’s internal optimisations. In this case I did 2 steps, 1st step to remove a column and 2nd one to group data. At the end of the 2 steps, I can see that Query Folding is still there.

This time it creates 4 sessions, 8 queries as shown below. If I do exactly the same thing as I did in very first way using navigations, it generates 16 queries, 8 sessions. just double the number of sessions! There could be other steps to optimize this further. But for the movement, this gives me some breathing space. 🙂

USE [Contoso DW]
select @@version + ‘ EngineEditon: ‘ + cast(SERVERPROPERTY(‘EngineEdition’) as varchar(4))
USE [Contoso DW]
select top 0 * from ( SELECT * FROM [dbo].[DimAccount] ) as [_]
USE [Contoso DW]
select top 1000 [$Table].[AccountKey] as [AccountKey], [$Table].[ParentAccountKey] as [ParentAccountKey], [$Table].[AccountLabel] as [AccountLabel], [$Table].[AccountName] as [AccountName], [$Table].[AccountDescription] as [AccountDescription], [$Table].[AccountType] as [AccountType], [$Table].[Operator] as [Operator], [$Table].[CustomMembers] as [CustomMembers], [$Table].[ValueType] as [ValueType], [$Table].[CustomMemberOptions] as [CustomMemberOptions], [$Table].[ETLLoadID] as [ETLLoadID], [$Table].[LoadDate] as [LoadDate], [$Table].[UpdateDate] as [UpdateDate] from ( SELECT * FROM [dbo].[DimAccount] ) as [$Table]
USE [Contoso DW]
select [rows].[AccountType] as [AccountType], count(1) as [Count] from ( select [AccountType] from ( SELECT * FROM [dbo].[DimAccount] ) as [$Table] ) as [rows] group by [AccountType]

Thank you very much for reading and stay safe!


Azure Data Factory-Best Practices for Security

Data is the most important assert of an organization! Safeguarding organizational data has paramount importance for any organization and companies are spending millions of dollars for that. In this post, I’m trying to cover how we can configure Azure Data Factory in a secured way to protect the data it processes. In below diagram I’m trying to place my way of securing organizational data when Azure Data Factory is used in a data integration solution. Your implementation you might or might not be able to use all the component I have listed down in this post based on your organizational requirements.

One thing that remember that in this post I will not go in details for any of these features. Nevertheless, I will add Microsoft documentation related to each section so that you can read more about it.

Figure: A secured way of implementing Azure Data Factory

Multiple ADF Instances

Designing Azure resource groups and Azure resources is very important when it comes to an ADF implementation. A proper resource design will decide how secure your data in cloud. First thing to consider in design is not to share an ADF instance with multiple projects or multiple environments. Having multiple ADF instances does not cost you anything extra. On other hand, it helps you to control the access each environment for only specific people. For an example, if you have projects called XYZ and ABC, create different resource groups named XYZ-Dev, XYZ-Test , XYZ-Prod, ABC-Dev etc. Within each RG, you can create separate ADF instance. That way you can control access to storage accounts, key vaults and ADF instances to provide least possible access.

Figure: Creating different RGs and resources for a project

Express Route or Point to Site /Site to Site VPN

In most of the ADF implementations, ADF uses Self Hosted Integration Runtime (SHIR) to connect to on-prem network to transfer data from on-prem data sources into staging environment or to the final destination in cloud. In that case if you don’t have set up express route or VNet Peering, data is transferred via public internet. Although data is transferred using TLS via TCP and HTTPS, the most secured way it to create private connection between on-prem network and Azure using express route or IPSec VPN between corporate network and Azure.

But you can read more on this in below Microsoft link:

How SHIR should be configured: https://docs.microsoft.com/en-us/azure/data-factory/data-movement-security-considerations/?WT.mc_id=DP-MVP-5004277

How to create express route: https://docs.microsoft.com/en-us/azure/expressroute/expressroute-howto-circuit-portal-resource-manager/?WT.mc_id=DP-MVP-5004277

How to create Site to Site VPN:https://docs.microsoft.com/en-us/azure/vpn-gateway/vpn-gateway-howto-site-to-site-classic-portal/?WT.mc_id=DP-MVP-5004277

How to create Point to Site VPN: https://docs.microsoft.com/en-us/azure/vpn-gateway/vpn-gateway-howto-point-to-site-classic-azure-portal/?WT.mc_id=DP-MVP-5004277

Figure: Configure Express Route with ADF
Figure: Configure IPsec VPN with ADF

Configure Corporate Firewall Rules

If you organization does not have a express route or IPSec VPN configured then you can use corporate firewall to restrict access from outside to make sure only ADF control channel can communicate with the on-prem network. You can get domain names used by Azure Data Factory by going to you SHIR in ADF and click view Service URLs. It will provide domain name list as shown in the image below and you can add those Fully Qualified Domain Names to your corporate firewalls allow list. Note that this is not the only list of domain names you will have to whitelist. Read below link on setting up firewall rules to understand what other steps you need to do.

Read more on setting up firewall rules: https://docs.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime?tabs=data-factory#get-url-of-azure-relay/?WT.mc_id=DP-MVP-5004277

More on getting Azure Relay FQDNs :https://docs.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime?tabs=data-factory#get-url-of-azure-relay/?WT.mc_id=DP-MVP-5004277

Figure: Getting FQDNs list for SHIR
Figure: Example of Azure Relay FQDNs

Managed Private Endpoints

Managed private Endpoints are still preview in ADF. But for me, this is huge when it comes to security. When you want to secure your data in cloud, the main thing you do is encapsulate those resources within a VNet and disabled the connectivity from public internet. in that case if you don’t use MPE, ADF will not be able to connect those sources. If you enable MPE for integration runtime, ADF creates a private endpoint connection between sources and ADF and data will not flow through Azure shared network.

While you can setup this for new integration runtime you create, you can set this for Auto resolve integration runtime while creating the ADF instance

Read more on ADF Managed Private end points: https://docs.microsoft.com/en-us/azure/data-factory/managed-virtual-network-private-endpoint?WT.mc_id=Portal-Microsoft_Azure_CreateUIDef/?WT.mc_id=DP-MVP-5004277

Figure: Set up managed private end point for auto resolve IR

IP Whitelisting in Azure Resources

If your cloud data sources are not protected using VNet, next available option is to use IP whitelisting which only allows traffic come from Azure Integration runtime to communicate with the data sources. However, note that this IP addresses are different based on integration runtime location. In the case of Azure auto resolve IR this will be tricky as we don’t have a control over the location of Integration runtime location. Microsoft has published a document with fixed IP address range used by each Azure Integration runtime. These IP addresses change based on the datacenter you create Azure IR. Also note that this static IP address list might be updated and hence you might want to keep your eye on it in regular basis.

Note that in case of storage accounts, if IR and storage accounts are in same region, these IP rules will not be effective.

You can download the JSON file which containers IP addresses of Data Factory and add those to network whitelisting.

Download link: https://docs.microsoft.com/en-us/azure/virtual-network/service-tags-overview#discover-service-tags-by-using-downloadable-json-files/?WT.mc_id=DP-MVP-5004277

Figure: Example IP list for Australia East Data Factory

Managed Identity

When ADF connects to cloud data sources, there are multiple ways to authenticate. While easiest and most common practice is to use user name and password, or any tokens/keys, it is the most unsecured way to connect to your data. Whenever data source is supported, use Managed Identity to connect to cloud data sources. For each ADF instance, Azure creates a Managed Identity with the name of the ADF instance. You can use this to authenticate by providing RBAC access to resources with least required permissions. For example if you want to read the data from a blob container, add ADF managed Identity to reader role into respective blob container.

More on ADF Managed Identity: https://docs.microsoft.com/en-us/azure/data-factory/data-factory-service-identity/?WT.mc_id=DP-MVP-5004277

Azure Key Vault

In the case of data sources are not supported to use Managed Identity, you might have to use Passwords, keys for authentication. In such cases don’t use these information in plain text or hard code in ADF. Always store credentials in a Key Vault and refer it within ADF. One key mistake people do is maintaining one Key Vault and have multiple keys for different environments. That compromise the security as well as make it difficult to move ADF artifacts between environments.

More on using Key Vault to store credential for ADF:https://docs.microsoft.com/en-us/azure/data-factory/store-credentials-in-key-vault/?WT.mc_id=DP-MVP-5004277

Bring your own key

Unless you configure to use your own Key Vault in ADF, ADF uses system generated keys to encrypt any information related to ADF such as connection information, meta data and any cached data. While this is secure enough, you can always use your own keys stored in an Azure Key vault to encrypt the Azure Data Factory.

More on bring your own key: https://docs.microsoft.com/en-us/azure/data-factory/enable-customer-managed-key/?WT.mc_id=DP-MVP-5004277

Figure: Setting up bring your own key while creating an ADF instance.

Encryption at Rest

Encryption at rest is not something related to ADF and hence I kept it to last. But this is very important when it comes to data security. ADF do not store any data within it unless for any caching requirement. Therefore when data is stored in staging environment or final destination, make sure to use encryption capabilities come with respective source type. For example for blob storage accounts, you can enable encryption at rest and even chose to use your own keys for that.

I will keep update this post on what other security features we can consider for a secured ADF implementation. Feel free to add you ideas on this post as comments. Thanks for reading and stay safe!

How to implement a branching strategy for Azure Data Factory?

Using source controlling and branching is not a new concept for anyone who has some exposure to software engineering. However, data engineering and software engineering hardly cross their paths and hence I have seen some ETL developers and data engineers struggle to understand and use source branching properly within their code repositories. Although data engineers configure a source control such as Azure DevOps Git repo within ADF, I have seen that they only use default main branch for all the development. Therefore, In this post, I’m trying to cover most common branching strategies I have come across and what are the Pros and Cons of each strategy. Once thing to remember is although ADF now support Automated Publishing, depend on how you implement branching , you might or might not be able to use Automated publishing feature. To learn more about automated publishing refer below link.

Automated Publish in ADF: https://asankap.wordpress.com/2021/03/02/automated-publish-in-azure-data-factory-using-devops/

Okay, Now lets have a look on different branching strategies used in Azure Data Factory.

Approach 1: Main branch only

This is the simplest way. When you configure a repository in ADF, by default it creates the main branch and the adf_publish branch. You can use main branch to do all the development and once it’s ready to move in to a higher environment (test or prod in this case), you will have to manually publish ADF artifacts by clicking the “Publish” button in ADF. In the publish process, ADF creates ARM template files and put inside “adf_publish” branch. You can use those ARM templates for ARM template deployment within Azure DevOps for CI/CD implementation.

CI/CD for Azure Data Factory: https://docs.microsoft.com/en-us/azure/data-factory/continuous-integration-deployment/?WT.mc_id=DP-MVP-5004277

Pros and Cons with this approach:

  • Easy to implement
  • Only suitable for small implementations
  • Cannot use when multiple engineers develop simultaneously
  • Cannot use Automated Publishing as CI and PR triggers are unavailable
  • Difficult with continues development and issue fixing as higher environments and main branch are in different states at a given time.

Approach 2: Main Branch as Collaboration branch

In this approach, main branch is used as coloration branch and each developer will have it’s own branch to do the development. Any developer come to team have to make a pull request from main branch to the developer branch to get latest ADF artifacts to start development. Once development is done, he/she will make a pull request to merge changes back to main branch. Once all the changes are in main branch, either manual publish or automated publish can be triggered. If using automated publish, release pipeline should make sure that only selected pull requests trigger the release to a higher environment.

  • Multiple developers can work simultaneously
  • Difficult to use with Automated Publishing as multiple developers update collaboration branch (main branch) in different times.
  • Cannot break development into features and all the development must be released as a one feature.

Approach 3: Feature branches as Collaboration branches

In this approach, multiple teams can work in same ADF instance to develop different features. Team members who work in a specific feature create a pull request to the feature branch, which is created using the main branch. Once a developer is ready, he/she can create a PR (Pull Request) to marge his development to the feature branch. When the feature is ready to go to a higher environment, it will be moved to main branch using a PR. If required, feature branch can be configured to use for Automated Deployment. That way feature releases can be created rather than waiting for total development to finish. If not, the main branch can be used for both manual or automated publishing as indicated in diagram below.

Pros and Cons in this apporach:

  • Multiple teams/ developers can work on different features simultaneously
  • Individual features can go to higher environments without any dependencies from other developer’s work
  • Complex implementation in branching and release management

Approach 4: Main branch as Production equivalent branch

This approach is kind of extended implementation of approach 3. In here, main branch is maintained as Production equivalent. In other worlds, releases are done using a different collaboration branch and once test and UAT is completed and artifacts are moved into production, Collaboration branch is merged into main branch to keep main branch equal to production environment. In that if something goes wrong in Prod, you always have production code in your hand and it’s just a matter to cloning the code and creating a new ADF instance to troubleshoot the prod issue.

Pros and Cons in the approach:

  • Easy to fix production issues
  • Production code is safe and always in hand
  • Complex implementation

Creating branches in ADF is purely dependent on how big the your ADF implementation is and how big the development team is. There are no hardcore rules to say that one should create branches like this way or that way. In this post, I wanted to show you couple of ways you can implement branching in your ADF projects and how each approach handles different problems we face in release management. If you have implemented branching differently to manage different scenarios, please feel free to comment and share it with others. Thank you reading and stay safe.

Power BI Tips: Struggling to work with large models in your PC? This could help you

Working with large data volumes in Power BI desktop can be tricky if you don’t have a high spec machine. When you are using Import mode to create models in Power BI desktop, it loads all the data to memory and use CPU to process any changes you do to model. In that case, if you have millions of records, there is a high possibility that your PC will struggle for resources and sometimes Power BI desktop application can crash as well. In this post I’m going to show a simple trick which will help you to reduce no of records in power bi model while keeping it all the records in Power BI Service. That way you don’t have to worry about your PC configurations and keep doing the development in a low-end PC.

To show that, I have used Contoso Retail DW and have created a simple star schema using Fact Sales and related dimensions. Fact Sales table have close to 3.5 million records and I’m struggling to load all records into my PBIX file in my development PC as there is not enough memory and CPU power available.

Therefore, what I’m trying to do is minimize the no of records I work in PBIX file and I try to make to work without any issue in Power BI service once deployed to the service. First, I’m going to create a parameter called “RecordCount” in Power Query Editor and set the value as 2000. Note that I have set Type as Text.

Then I opened the Fact Sales table in Power Query editor and click on the top left corner to open the menu for transformations. In the menu click “Keep Top Rows” and set Number or Rows to 2000. This will filter the table and only load the top 2000 records to the PBIX file.

Next right clicks the FactSales table on the left hand Quires panel and click “Advanced Editor”. This will open up Advanced Editor where I modify the Keep First Row step. In here, I change it to a If statement. The logic is if the value of the parameter is “ALL” then it should get all the records in the fact sales table, if not get the FirstN records based in the value set for the parameter “RecordCount”. check the highlighted part of the query.

Next important thing I want to check is whether the change I did affect the query folding in Power Query. Therefore, I right click on the “Kept First Rows” and check the Native Query. It seems no issue with native query.

When I check the row count from the Power BI visual side, I can verify that it only loads the row count I specified in the Parameter. Then I can modify the model and create required measures/columns and visuals without Power BI desktop consuming high resources in my PC.

Once all the development is done in Power BI desktop, I published the Power BI file to service. However, in service I don’t want to see 2000 rows from Fact Sales table, I want all the records in the fact table. For that, click in 3 dots in the dataset in service and select settings in the menu.

it will open up the settings are of the data set and under that there is an option called “parameters”. Change the value of the parameter to “ALL”. Remember the IF statement I used in advanced editor, that will allow me to get all the data from the fact table when I refresh the data set now.

When I refresh the dataset in Power BI service and check the row count now, it shows me that model now have all the source data, not 2000 records as in Power BI file in your PC.

This way you can create Power BI models with large data volumes without worrying about configuration of your PC. Thanks for reading and stay safe! Cheers.

How to transit users from an existing reporting solution to Power BI?

With Power BI becoming more and more popular, it can be observed that organizations are trying to move their current reporting platform to a Power BI based solution. However, success of this migration is not mainly dependent of features of Power BI. The most important thing is how to make the adoption of Power BI success within the organization. You can find lots of articles about how an organization should rollout an adoption plan. For a start, refer below article by Microsoft Power BI team.

https://powerbi.microsoft.com/fr-fr/blog/driving-power-bi-adoption-in-your-organization-learn-how-microsoft-does-this-at-scale/

However, in this post, I am not going to talk about how to create an adoption plan and how to roll it out. You can find enough resources for that in internet. It is a long term, complex process. Rather I want to share my experience on how we have transit business users from existing reporting platform to Power BI. I hope it will help you if you are planning to replace your reporting solution from Power BI. There are certain things you can do to help your business users to get used to Power BI.

Do not fix what is not broken.

Existing reporting solution have already gone through so many iterations of development before it comes to current stage. it is not only about business logics, it’s about information available and how it is represented within reports as well. So, when you introduce Power BI reports, do not re-create all reports from the scratch. Try to cover what is available in current reports as much as possible. I know, it can be large table/matrix and might not be cool as fancy Power BI reports. However, if you change everything overnight, business users will struggle to cope with the changes. Give what they have being using so far in same format, may be same look and feeling. That is the first steps of a long journey, changes will come.

Understand how they extract insights from old reports.

Sometimes you might not understand until you talk with business users on how they extract insights from a lengthy report. They must be doing their own calculations by extracting data from current reports. I have seen this heavily as business users extract data and create own excel files for further analysis. Sometimes they have their own way of reading a lengthy paginated report. Help them by creating those analysis within Power BI. You can make those analysis as landing page of your report and replica of current view as detail view. Eventually business users will stop using the replica of current view as they do not want to perform analysis by their own. It will be a slow process to distant them from old way of analysing data.

Monitor the usage.

Monitor the usage of reports from Power BI Admin portal. That way you can identify who is using which section of the report mostly. If you succeed previous step, you should see a pattern where users spend more time in summarize or insight pages you introduced to them and less time on replica of previous reports. If that is not the case, have discussion with them and see what is not working for them. Why do they feel that still they need to go to old way of analysing the data? Get their feedbacks on how the report should look like and evolve the reports you created based on those feedbacks.

Workshops

It does not matter how cool the Power BI report you created, if business users cannot extract the information they want. In most cases, it is not your newly created report pages do not provide information, it’s just they can’t understand how it works. Conduct workshops within your business users. Show them how new reports works and show them how to extract insights from those report pages rather than going to replica of old reports.

Be upfront about limitations and provide alternative solutions.

When business users come from one reporting platform to another, first thing they do is compare two and see how to do certain things they did in previous platform now. Help them with the comparison and train them how achieve those functionalities in Power BI. There could be instance where some of those features are not available in Power BI. For example, people who use Macro in excel reporting might want to use old macro they have been used in Power BI. Be upfront about it and work out a solution for it. If something is achievable using a workaround, explain those in workshops you do to transit business users.

Lots of training

I believe training is the key. You just cannot implement Power BI in an organization without giving them comprehensive training to users on how to use it. When you organize training, remember that it is not one shoe fits everyone scenario. Within an organization there are different kinds of Power BI user groups.

  • Business Users – Who do not create reports or dashboards. Other people create report and they only analyze, drilldown to details. This category only require training on how to use Power BI to get insights and how to share content between others. Business users such as top management of an organization is a good example for this category.
  • Power BI Developers- Users who creates reports/models for business users. They are the users who write m-queries, build models, and create complex measures using DAX. They need advance training on things like data modelling, DAX, and 3rd party tools, etc.
  • Power BI Admin- People who administrate Power tenant might seems like a very small group. May be only a couple of users. Nevertheless, they are the guardians of the most valuable asserts of an organization. the DATA! Providing proper training on how to administrate the tenant has paramount importance.

Based on these categories, organize customized trainings to business users. In the training, try to use actual data, or data familiar to business users as much as possible. That will make the trainings more interesting.

Identify Super users

Within a group or department, you will be able to identify people who are keen on the transition and able to adapt to Power BI quickly than others. We normally called these business users as super users. Recognize them among other users and help them as much as possible with their queries. Once they are on board, they can surely help to other users within the group/ department. In other words, they will be the go-to person within a group. Business users are more comfortable with talking and getting assistance from people within their group rather than IT department. If we go future ahead, we can train them to be trainers within an organization so they will conduct Power BI trainings within the group.

Again, I know how hard it is to move business users from a reporting solution to Power BI. But it is not an overnight process. You cannot expect everyone will onboard in the same phase as well. Follow these steps and try to understand business users by collecting continuous feedback. If you have more ideas which has worked for you, please comment below so it would help everyone. Thanks for reading, Stay safe, Cheers!

Creating multi-stage build pipeline to deploy Azure Data Factory to different environments

Implementing CI/CD process can be a cumbersome task for a data engineer as they don’t like/ they are not familiar with concepts such as DevOps. DevOps is an essential part in software engineering, but for Data engineering it’s fairly new concept. Nevertheless, it is a part of DataOps and data engineers are expected to use DevOps in their project.

Since it’s bit of hassle to understand DevOps concepts and implement release pipeline, I thought of creating a YAML file which creates a multi-stage build pipeline. That way anyone who wants to create a CI/CD pipeline can copy this YAML file and change the sections as commented and create a pipeline, which does ARM template build as well as deployment to different environments.

SETP 1: Create environments for each stage

In typical project we will have at least 3 environment, Development, Test and Production. For each of these environment, you need to create environments within DevOps. Go to your project in Azure DevOps and create one environment for each.

Then within each environment click on Approvals and check and add the users/user groups who can approve the release for each environment. For example , if you add a user inside Production environment, only that user can approve the production ADF release.

STEP2: Modify the below YAML file

I have created a YAML file which does following task.

  1. Build ADF artifacts in to ARM template
  2. Deploy built ARM template into Test ADF instance
  3. Deploy built ARM template into Production instance

Please read the comments inside the YAML file and replace values marked in XXX characters with appropriate values for your environment. In my case, I use Azure Key vault to store all the credential information.

# Sample YAML file to validate and export an ARM template into a Build Artifact
# Requires a package.json file located in the target repository

trigger:
- master # name of the branch which trigger the build
stages:
- stage: 'Build' 
  displayName: 'Build ARM template files to deploy'
  jobs: 
  - job: 'Build'
    displayName: 'Build ARM Template'
    pool:
      vmImage: 'ubuntu-latest'
    variables:
      wwwrootDir: 'Tailspin.SpaceGame.Web/wwwroot'
      dotnetSdkVersion: '3.1.300'
    steps:
    # Installs Node and the npm packages saved in your package.json file in the build

    - task: NodeTool@0
      inputs:
        versionSpec: '10.x'
      displayName: 'Install Node.js'

    - task: Npm@1
      inputs:
        command: 'install'
        verbose: true
      displayName: 'Install npm package'

    # Validates all of the ADF resources in the repository. You will get the same validation errors as when "Validate All" is clicked
    # Enter the appropriate subscription and name for the source factory 

    - task: Npm@1
      inputs:
        command: 'custom'
        customCommand: 'run build validate $(Build.Repository.LocalPath) /subscriptions/XX[Subscription Id]XX/resourceGroups/XX[resource group name] XX/providers/Microsoft.DataFactory/factories/XX[name of the dev ADF instance]XX'
      displayName: 'Validate'

    # Validate and then generate the ARM template into the destination folder. Same as clicking "Publish" from UX
    # The ARM template generated is not published to the ‘Live’ version of the factory. Deployment should be done using a CI/CD pipeline. 

    - task: Npm@1
      inputs:
        command: 'custom'
        customCommand: 'run build export $(Build.Repository.LocalPath) /subscriptions/XX[subscription id]XX/resourceGroups/XX[name of the resource group]XX/providers/Microsoft.DataFactory/factories/XX[dev adf instance name]XX "ArmTemplate"'
      displayName: 'Validate and Generate ARM template'

    # Publish the Artifact to be used as a source for a release pipeline

    - task: PublishPipelineArtifact@1
      inputs:
        targetPath: '$(Build.Repository.LocalPath)/ArmTemplate'
        artifact: 'ArmTemplates'
        publishLocation: 'pipeline'
- stage: 'Test'
  displayName: 'Deploy Changes to Test Environment'
  dependsOn: Build
  condition: succeeded('Build')
  jobs: 
  - deployment: DeployToTest
    pool:
      vmImage: 'vs2017-win2016'
    environment: Test 
    strategy:
      runOnce:
        deploy:
          steps: 
          - task: DownloadPipelineArtifact@2 #downloading artifacts created in build stage
            inputs:
              source: 'current'
              path: '$(Pipeline.Workspace)'
          - task: AzureResourceManagerTemplateDeployment@3
            displayName: 'ARM Template deployment: Resource Group scope'
            inputs:
              azureResourceManagerConnection: 'XX[Azure service connection name]XX'
              subscriptionId: 'XX[subscription id]XX' #your subscription id
              resourceGroupName: 'XX[resource group of Test ADF instance]XX' #resource group of the target ADF instance 
              location: 'XX[Location of Test ADF instance]XX' #location of the target ADF instance 
              csmFile: '$(Pipeline.Workspace)/ArmTemplates/ARMTemplateForFactory.json' # Location of the ARM template in downloaded build
              csmParametersFile: '$(Pipeline.Workspace)/ArmTemplates/ARMTemplateParametersForFactory.json' #Location of the parameter ARM template of build
              overrideParameters: '-factoryName "XX[Test adf instance name]XX" -ls_azuresql_asankap_properties_typeProperties_connectionString_secretName "ConynectionString-ContosoSQL" -ls_blob_asankastorage_properties_typeProperties_serviceEndpoint "https://XXXX.blob.core.windows.net" -ls_keyvault_properties_typeProperties_baseUrl "https://kv-XXXX-dev.vault.azure.net/"' 
        #XXXX values needs to be replaced for target ADF instance, this include name of the target adf instance, key vault url and etc.  
- stage: 'Production'
  displayName: 'Deploy Changes to Production'
  dependsOn: Test
  condition: succeeded('Test')
  jobs: 
  - deployment: DeployToProduction
    pool:
      vmImage: 'vs2017-win2016'
    environment: Production 
    strategy:
      runOnce:
        deploy:
          steps:
          - task: DownloadPipelineArtifact@2 #downloading artifacts created in build stage
            inputs:
              source: 'current'
              path: '$(Pipeline.Workspace)'
          - task: AzureResourceManagerTemplateDeployment@3
            displayName: 'ARM Template deployment: Resource Group scope'
            inputs:
              azureResourceManagerConnection: 'XX[Service connection Name]XX'
              subscriptionId: 'XX[subscription id]XX' #your subscription id
              resourceGroupName: 'XX[Production RG name]XX' #resource group of the target ADF instance 
              location: 'XX[Prod ADF Instance location]XX' #location of the target ADF instance 
              csmFile: '$(Pipeline.Workspace)/ArmTemplates/ARMTemplateForFactory.json' # Location of the ARM template in downloaded build
              csmParametersFile: '$(Pipeline.Workspace)/ArmTemplates/ARMTemplateParametersForFactory.json' #Location of the parameter ARM template of build
              overrideParameters: '-factoryName "XX[Prod ADF instance name]XX" -ls_azuresql_asankap_properties_typeProperties_connectionString_secretName "ConnectionString-ContosoSQL" -ls_blob_asankastorage_properties_typeProperties_serviceEndpoint "https://XX.blob.core.windows.net" -ls_keyvault_properties_typeProperties_baseUrl "https://kv-XX-dev.vault.azure.net/"'
              #values needs to be replaced for target ADF instance, this include name of the target adf instance, key vault url and etc.  
        

STEP 3: Create build pipeline

Next step is to create a YAML pipeline. Go to pipeline in Azure DevOps and select create new pipeline. Then select Azure Repos Git as code location and select where you want to store the YAML file you modified in previous step.

Next in the configure section select “Starter Pipeline”. This will open you an editor window. Paste the YAML file you modified in STEP2.

Click on Save and run and if you do modify the YAML file correctly , you will see DevOps build the ARM template and deploy to the different environment based on the approval process you set.

Thanks for reading. Cheers!

Automated publish (CI/CD) for Azure Data Factory using DevOps

ADF team recently announced that they support automated publish for ADF and it was something I wanted for a while in my day today ADF works. Before that when someone wants to create a release in DevOps, that person must go to the master branch and click the publish button manfully to generate ARM template files which are used within ARM template deployment in DevOps. With this announcement, whenever you make a pull request to your collaboration branch, in most cases the master branch, a new build is created and hence manual intervention is not required anymore.  You can read more about this in below Microsoft documentation.https://docs.microsoft.com/en-us/azure/data-factory/continuous-integration-deployment-improvements/?WT.mc_id=DP-MVP-5004277

The reason why I wanted to write this post is that I felt that this documentation is not comprehensive enough so that someone new to DevOps cannot read this document and configure it. Therefore, I’m going to create a step by step guide for this.

In my case I have my master branch and a feature branch in my dev ADF instance.  The idea is whenever I make a pull request from my working branch (“AsankaP”) to master, it should trigger a new release and push the release to QA.

image
Branch set up in Dev ADF

Folder Structure of the Repository

To set up the build pipeline , it needs to understand how I have set up the folder structure of my repository. Within my Moana repository, I have created a folder called “ADFIntegration” and used that to store all the ADF artifacts. In other words my root folder is “ADFIntegration.”

GIT Repository configuration of Azure Data Factory

Within that root folder, I have created a folder called “DevOpsFiles” to store all the files required in this build pipleline setup.

Repository File Structure

STEP 1: Copy package.json file to your master branch repo.

Copy and paste below JSON code and save it as package.json inside your dev repository. Although You can save this file inside any location within repo, ,create a folder within your ADF repository folder and save it inside that fodder to avoid any confusions. In my case, I stored it within the “DevOpsFiles” folder. Even though the nmp package version is 0.1.5 by the time I update this post, it might have changed over time. Check for the latest version of the ADF nmp package using bellow link and change the version to the latest stable version.

Check for latest nmp version: https://www.npmjs.com/package/@microsoft/azure-data-factory-utilities

{
    "scripts":{
        "build":"node node_modules/@microsoft/azure-data-factory-utilities/lib/index"
    },
    "dependencies":{
        "@microsoft/azure-data-factory-utilities":"^0.1.5"
    }
} 

Store Package.json file in repo

STEP2: Creating a Build Pipeline

Go go the DevOps and click on Pipelines and the create a new pipeline. Select Azure Repo as your code location and select where you want to save the YAML file.

image
Select Azure Repo Git for YAML file location

Then you need to select in which repo you want to save the YAML file, I select the same repo I have my ADF code.

Select the same repo as ADF repo

Then it will ask how you want to configure your release and select “Starter Pipeline”.

Select Starter Pipeline

Replace the default template YAML code with the code below and change the area I have mentioned within square brackets. Read the comments in the YAML code to understand where you need to change information.

You will need to change:

  • Azure Subscription Id
  • Dev ADF Instance Name
  • Root folder name [folder location of ADF Artifacts]
  • Folder Location of the package.json file

Apart from that, if your main branch is not master, change “master” to that branch name in trigger section.

# Sample YAML file to validate and export an ARM template into a build artifact
# Requires a package.json file located in the target repository

trigger:
- main #collaboration branch [if the branch is not main, change it to the respective branch name]

pool:
  vmImage: 'ubuntu-latest'
steps:

# Installs Node and the npm packages saved in your package.json file in the build

- task: NodeTool@0
  inputs:
    versionSpec: '10.x'
  displayName: 'Install Node.js'

- task: Npm@1
  inputs:
    command: 'install'
    workingDir: '$(Build.Repository.LocalPath)/ADFIntegration/DevOpsFiles' #replace with the package.json folder, if package file is saved within root foder, remove "DevOpsFiles". if there is no root folder, only keep Build.Repository.LocalPath


 
    verbose: true
  displayName: 'Install npm package'

# Validates all of the Data Factory resources in the repository. You'll get the same validation errors as when "Validate All" is selected.
# Enter the appropriate subscription and name for the source factory.

- task: Npm@1
  inputs:
    command: 'custom'
    workingDir: '$(Build.Repository.LocalPath)/ADFIntegration/DevOpsFiles' #replace with the package.json folder, if package file is saved within root folder, remove "DevOpsFiles". if there is no root folder, only keep Build.Repository.LocalPath

    customCommand: 'run build validate $(Build.Repository.LocalPath)/ADFIntegration /subscriptions/[resource group id of your ADF instance]/resourceGroups/testResourceGroup/providers/Microsoft.DataFactory/factories/[name of your ADF instance]' # Change "ADFIntegration" to the name of your root folder, if there is not root folder, remove that part and keep Build.Repository.LocalPath only.

  displayName: 'Validate'

# Validate and then generate the ARM template into the destination folder, which is the same as selecting "Publish" from the UX.
# The ARM template generated isn't published to the live version of the factory. Deployment should be done by using a CI/CD pipeline. 

- task: Npm@1
  inputs:
    command: 'custom'
    workingDir: '$(Build.Repository.LocalPath)/ADFIntegration/DevOpsFiles' #replace with the package.json folder, if package file is saved within root folder, remove "DevOpsFiles". if there is no root folder, only keep Build.Repository.LocalPath


    customCommand: 'run build export $(Build.Repository.LocalPath)/ADFIntegration /subscriptions/[resource group id of your ADF instance]/resourceGroups/testResourceGroup/providers/Microsoft.DataFactory/factories/[name of your ADF instance] "ArmTemplate"'  # Change "ADFIntegration" to the name of your root folder, if there is not root folder, remove that part.


  displayName: 'Validate and Generate ARM template'

# Publish the artifact to be used as a source for a release pipeline.

- task: PublishPipelineArtifact@1
  inputs:
    targetPath: '$(Build.Repository.LocalPath)/ADFIntegration/DevOpsFiles/ArmTemplate' #replace with the package.json folder,if package file is saved within root folder, remove "DevOpsFiles". if there is no root folder, only keep Build.Repository.LocalPath

    artifact: 'ArmTemplates'
    publishLocation: 'pipeline'

Once all the modifications are done, click “Save and Run” to see whether build pipeline is working correctly. If there is no error, it should generate artifacts in build folder. In my case ARM Template folder.

Build pipeline execution

If you click on the highlighted section.”1 artifact produced”, you should be able to see generated ARM template files by the build.

ARM template files generated from the build pipeline

STEP 3: Create a Release Pipeline

Next step is to create a release pipeline. Go to Releases in DevOps and create new release pipeline. Select Empty Job to start with.

Select empty job in release pipeline.

Click on Add an Artifact and select build as source type. Select the name of the build pipeline you created in previous step. You can keep Latest to the Default Version or can change based on your release requirement.

Select Build as source type

Then click on “Continuous Deployment Trigger” and enable continuous deployment as below. This will make sure when a new build is available, a release is trigger from the release pipeline. Make sure you select correct build branch. In my case master branch.

Enable continuous deployment trigger

Then go to stage and add an agent job. In the Agent Job, click + icon and search for ARM template. Select ARM template and click Add.

Add ARM template deployment task to agent

In ARM template deployment provide Subscription Information, Azure Resource Manager connection and Resource group. By this time, you should have create a ADF instance for QA and Prod. In my case QA ADF instance is asankap-QA

Click on Template and select “ARMTemplateForFactory.json” as the template file as below.

Select template file

Click on “Template parameters” and select “ARMTemplateParametersForFactory.JSON” as parameter file.

Select Template parameter file

Then click on Override template parameters and set the parameters manually. Unfortunately, since build is only accessible in run time, parameters are not generated and you will have to add them manually one by one. Check below image to see how to add parameters you want to override. You can get these parameter names from the JSON you specified previously. Make sure you change the factoryName to your QA or Production ADF instance name.

Manually adding template parameter values.

Save the release pipeline and do a modification to in DEV ADF instance and make a pull request to update the master branch.

Create a pull request to trigger a release

If you go and check DevOps, you will see that the build pipeline is triggered and once the build is ready, the release pipeline is triggered to move changes QA ADF instance.

Build pipeline is triggered by the pull request.

Release pipeline is triggered and a release is created.

I hope this post helps you to create fully automated release pipelines for your ADF environment. Thanks for reading and stay safe! Cheers!

Power BI Dataflow vs Azure Data Factory Power Query

Very recently Microsoft Azure Data Factory team announced that they have renamed “Data Wrangling Dataflow” into “Power Query” and this made a quite a bit of excitement among BI professionals. While Data Wrangling Dataflow was there for a while as a preview feature in ADF, this renaming it-self made people to wonder whether they can use Azure Data Factory Power Query to replace Power BI Dataflow, specially to overcome certain limitations with Power BI Dataflow such as refresh rate and computation power. However, these two features have different architectures and hence one must needs to understand how these two features work before making any such decisions. In this post I’m trying to cover main differences between these two and hope this will helps you to have a better idea on where you can use each component within a BI solution.

Processing Architecture

When it comes to architecture, both Power Dataflow and ADF Power Query use Power Query Online Mashup Editor to allow users to perform data transformations. Because of this one can misunderstand that both features work similar way. However, ADF converts M query generated by the editor into Spark code and perform data processing at scale. On the other hand, Power BI Dataflow executes M query within a Mashup container process (I believe .NET Process). Which means even though front end looks similar, backend process entirely different to each other.

Processing Architecture Difference

ETL vs ELT

Power BI Dataflow act as a ETL (Extract, Transform and Load) tool within self-service BI space. It allows non- technical users to read data from multiple sources and perform data transformations and finally store in a shared or dedicated Azure Data Lake Gen2 storage account. Since Power BI Dataflow is coupled with Power BI, Power BI allows to analyze data stored inside the data lake easily.

However, since ADF Power Query does not support any on-premises data sources or even most of the cloud-based data sources, it needs to accommodate ELT (Extract, Load and Transform) approach. In other words, you will have to use ADF copy actives to load data to any of the supporting data sources ( check the image below) and then use Power Query to perform transformation on loaded data. Once data is processed you can persist data into a storage account or a cloud data warehouse.

Supported Source System Types for Power Query in ADF

Parallel vs Distributed Processing

As mentioned in the architecture section, BI dataflow uses Mashup Container Process to transform data. In Power BI service, based on the license, one might get shared capacity or dedicated capacity to run these processes parallel within a single server. If you have dedicated capacity, you must be able to scale up the processing power by increasing capacity and the container memory size.

On the other hand, ADF Power Query uses Spark, the well-known distributed processing engine. In this case data is processed using multiple worker nodes and collected inside the master node. When you use Power Query within ADF, it will allow you to select the size of cluster which is used to perform data transformations. If someone needs to scale out, it is a matter of selecting a cluster with higher number of nodes.

Parallel vs Distributed Processing

Matured vs Preview

Power BI dataflow has been there for a while and it has been used by many people in production. Apart from that it allows to perform lot of data transformations and supports a lot of connectors, both on premises and cloud.

However, ADF Power Query or Wrangling Dataflow has been in preview mode for a while now. I don’t believe that anyone would or should use Preview feature for in production. Additionally, as of now, Power Query in ADF only supports limited data transformation functions. Although you see that all the transformations are available in UI, when you save a Power Query, it will let you know most of the transformations are not supported as of now. I hope this will change when the go to production. Apart from that, again as of now, Power Query only supports limited number of data source systems.

Feature Comparison

FeaturePower BI DataflowADF Power Query
Data SourceOn-Premise and Most Cloud based Data SourcesLimited, Few Azure data sources
DestinationADLS Gen2, Shared or DedicatedCloud Data warehouses, Azure Storage Accounts
Data LoadSmall to Medium Data VolumeMedium to Large Data Volume
M Query Transformations SupportExtensiveLimited
Power BI SupportYesNo ( Power BI still can connect to a sink using a connector)
Maximum data Refresh Rate8 in Shared/ 48 in DedicatedNo limit
Linked EntitySupportNot Support
Common Data Model (CDM)SupportNot Support
CostCheap( Comes free with Power BI Licensing)Expensive (Depending on the cluster size and the execution time)

Conclusion

Azure Data Factory already have Mapping Dataflow to process data using Spark cluster and ADF team has been focusing more to enhancing it. However, they seems to have interest about allowing users to use Power Query for data processing. I believe from that Power Query experience, ADF team is trying to bring business users ( citizen data integrators) to Azure Data Factory platform. These business users who are already using Power BI Dataflow must know that ADF Power Query is not the same as what they have been using. ADF Power Query allows business users to process data at scale without writing any code using same UI they have been using within Power BI. In my personal experience, Power BI dataflow can not be replaced by Power Query in ADF even if these two have some similarities.(as of now)

Thank you for reading and stay safe! Cheers!

How traditional data warehousing concepts have changed/should change with Power BI?

10 Years ago, when I started my career as a BI engineer the first task I was given was to study Kimball methods on dimensional modeling. At that time data warehousing was a specific domain within IT industry and knowing Kimball data warehousing concepts was a must for people who worked in data and analytics. However, with Power BI become increasingly popular among both IT and business users in data and analytic field, I feel that certain DW concepts have changed/ should change by its influence.

What makes the difference?

Well, back then, almost all the reports were created by the IT department and only the IT department knew how the data is calculated, populated and which tables/fields contain what information. However, with Power BI, Business users are more like to create their own reports and perform their own analysis; IT department only needs to provide data to them. I believe this change has impacted profoundly on the design and the implementation of a data warehouse in an organization.

image

Figure 01: How the usage of a DW has changed with Power BI

Apart from that certain features available in tools like Power BI has made some of the data warehousing concepts invalidated and some features need new concepts to be introduced to data warehouse design and implementation. Let’s talk about how Power BI has changed DW design patterns.

Star or Snowflake

First and foremost, it is required to decide on whether to use Star Schema or Snowflake schema in dimensional modeling when designing a DW. In traditional data warehousing this decision is purely base on the data volume and which method is easier to implement. However, it commonly accepted that Power BI works best with Star Schema compare to Snowflakes due to how data is stored in tabular engine and how relationships are maintained. Apart from that Star schema is simple compare to Snowflake and that makes business users life much easier when they perform data modeling and analysis using Power BI.

Surrogate Keys

In Data warehousing, there are couple of ways to implement surrogate keys.

  1. Auto increment integer values
  2. Auto generated unique identifier values
  3. Hash value of natural keys

In method 2 and 3, surrogate keys are populated using string values and when it comes to Power BI, these columns act as high cardinality fields. As we need to use memory in the most optimal way within Power BI, it is ideal to avoid high cardinality columns as much as possible. Therefore, I believe that using auto-increment integer or small integer values as surrogate keys is the best methods if you are using Power BI with the Data warehouse.

Fact Table Surrogate Key is another concepts which has changed with Power BI. Having high-cardinality columns with no business value only consumes precious, limited memory available within Power BI.

Accumulated /Periodic Snapshots Fact Tables

In traditional data warehousing, there are couple of fact table types.

  1. Transaction Fact Tables
  2. Periodic Snapshot Fact Tables
  3. Accumulated Fact Tables

The idea is to extract the most granular level data from source systems and store in transactional fact tables and create periodic and accumulated facts based on that data by grouping data for different periods and stages. Populating these two types of fact tables needs additional ETL jobs. However, with Power BI, data is by default aggregated and implementing time series calculations are straight forward. Along with that, features such as Power BI aggregation, implementing and maintaining periodic and accumulated fact tables has become a resource overkill.

Precalculated Time Intelligence Columns

Implementing  calculated columns for time intelligence values such as YTD/MTD fields within is a common practice within traditional data warehousing. Since time series values are pre calculated, it does not incur any performance issues when business users query the DW. However, with Power BI, time calculations are easy to implement and vertipaq engine is capable of calculating these values on the fly in no time. Hence pre-calculated time series fields are no longer required in a data warehouse.

Table/Filed Naming

In data warehousing, we use to certain naming conventions when we name tables and fields and those naming conventions are only used among technical users. We never worry about how readable and how much it aligned with business terms as we know business users will never going to touch the DW. However, things has changed with Power BI as anyone can explore the DW now. Using suffix or prefix such as Dim/Facts to name tables makes business users life hard. Moreover, having camelCase, Pascal Case naming for tables, columns is not ideal and should use more readable business naming which can easily be identified by business users. Apart from that maintaining same names between primary keys and foreign keys helps to model in Power BI as it automatically creates relationships based on name and data type.

Date Dimension

Date dimension is the most important dimension in any data warehouse. In DW implementation, we maintain a separate ETL process to populate and update the date dimension. However, Power BI has multiple ways to implement a data dimension and it does not require any manual intervention from user. Therefore, unless there is a business requirement, I believe having and maintaining a data dimension is no longer mandatory if you are only using Power BI to analyze data in the DW.

ETL Audit Fields

Having audit information such as Data Load Date/Update Date is very helpful in fact tables as Power BI incremental refresh can leverage that information. Apart from that business users can have a visibility on how up to date their data is and when and who made changes to data. Therefore, they don’t require to comeback to IT team when there is any doubt on accuracy of the data.

Simple Calculations

When I was implementing DW back then, I used to store all the calculated fields within Fact tables irrespective of how simple/ complex those calculations are. Implementing calculated fields such as Min, Max, Average within ETL process and storing those fields in DW is no longer required as Power BI handles those calculations off-the-shelf.

Conclusion

If there is anything in this world which does not change is changing. Some of the DW concepts has changed and new concepts has come to play with the tools such as Power BI. When Power BI team add more features with the product, I believe that Data warehousing concepts will change even more in near future. Well, we might not need a data warehouse at all.

I hope this post gives you some idea on how Power BI has changed some of the traditional data warehousing concepts. Thanks for reading and stay safe. Cheers!

Power BI: One for All? All for One?

unnamed

This is a question I always had, but with composite models supports Direct Query for  Power BI datasets, I thought I want to write about it. I feel that with the use of composite models feature in Power BI, we can break Power BI development into two parts, front-end and back-end development. The question I had was “Power BI Developer”: Is it one role or one person is performing multiple roles? Don’t you think a Power BI developer puts different hats when developing a single report?

A decade ago within software industry, a software engineer had to perform different roles such as UX Engineer, UI Engineer, software developer , Data base administrator, Infrastructure Engineer, Business Analysist and all other roles we now find in software development world. But things have changed over the time and we see each person only perform one specific task within a software development project. However, if you look an organization, you will see that a Power BI developer is performing all the roles which I’m going to discuss in this post. I believe this one for all approach leads to most problems we see in Power BI implementations within lot of organizations. Let’s see what are the different roles performed by a Power BI developer today.

Power BI Back-End Developer

A role of a Power BI back-end developer is the most critical role. A properly designed model can make a huge difference in an organization as it makes end users life easy. A Power BI backend developer must know:

  • Data modeling techniques such as star schema, snowflake schema.
  • Good knowledge about DAX query language and concepts associated with it such as row context, filter context, etc..
  • Good knowledge on M-Query language.
  • How to use different relationship types and how relationship direction works.
  • How tabular engine works: Formula engine and storage engine.
  • How vertipaq engine stores data.
  • Standards and best practices on data modeling.
  • How to troubleshoot performance issues.
  • How to use Power BI external tools for data modeling.

I have seen how not having a developer with a solid knowledge about all above areas creates problems for Power BI implementations.

Problem with the one for all approach:

  • Badly written Dax codes causing to performance issues.
  • Bad data model designing could lead to performance and usability issues.
  • Zero or little attention given to standards within data models, naming conventions makes business users life harder.
  • Badly written M-queries leads to data loading issues.
  • Incorrect use of calculated columns, datasets leads to bigger data models which leads to performance issues when refreshing data.
  • One dataset for a one report approach leads to complexities and creates strains on source systems.

Power BI Front-End Developer

It doesn’t matter how nice a data model design is, Business users normal don’t pay much attention to the data model design unless they see any performance issues in reports. What they really like to see is how nice their reports are? How easy to use those reports? how quickly you can navigate between reports/ dashboards and visuals? How easy to obtain insights from a report? To cater all these user expectations a Power BI front-end developer must know

  • Proper usage of colors, font-styles, and icons.
  • Good understanding on different features such as buttons, images, bookmarks and drilldown/drill through features available within Power BI
  • UX concepts such as layout of a report.
  • Proper usage of background image, shadow, and all other UI related properties.

When a Power BI developer doesn’t have a good taste in UI, it can leads to lot of problems for a Power BI implementation which could ultimately lead to business users creating their own reports or going back to excel.

Problem with the one for all approach:

  • Poorly designed dashboards/reports
  • Bad choice of colors and fonts leads to wrong interpretation of data.
  • Bad User EXperience for business users.
  • Wrong use of visuals makes it hard to get insights out of a report/dashboard for business users.

Power BI Administer

Power BI administrator is a separate role which mainly focus on maintaining and monitoring the Power BI environment with an organization. A Power BI Administrator must have a good knowledge on

  • How different tenant settings works within admin portal.
  • How to maintain and plan premium capacities within the tenant.
  • What are the different security options available within Power BI tenant level to make sure organizational data is secured.
  • How to monitor usage patterns within the tenant.

Problem with one for all approach:

  • Security related issues as a developer control whole organizational data and distribution of data.
  • Data security issues as little to no attention given to security within the Power BI tenant
  • Misuse of resources within tenant by business users.

Business Analyst

From my experience, there is often a gap between the business user and the Power BI developer. While business users don’t understand technical details behind a report, report developer don’t clearly understand business scenarios business users try to  analyze within a report.  However, in most organizations, it is the developer who talks with the end user and try to get requirements for reports and provides different options to them on how to analyze data.  When the developer perform the role of a business analyst he must need to

  • Be a good listener and understand the business cases explained by the business users.
  • Provide suggestions to business users about how to use different Power BI features to analyze data effectively.
  • Help business users to create a meaningful data stories using Power BI features.

Problem with one for all approach:

  • Invalid interpretations of data.
  • Invalid or meaningless story telling within reports.
  • Wrong calculations, drill through and drilldown scenarios.

Conclusion:

cute-businessman-team-help-each-others-weight-lifting-busin-cute-businessman-team-help-each-others-weight-lifting-flat-128498787

In post I’m not saying that an organization must have multiple people working within Power BI team and each should have only one task to do. But I believe that each member in a data and analytic team must put more time and focus to be an expert in at least one role within a team. For an example, there can be one developer who is an expert in data modeling within a team while there can be another person who has good sense on UI and UX.

Again, All for one or one for all? Have a bit of thinking around this. Thanks for reading and stay safe! Cheers!