Very recently Microsoft Azure Data Factory team announced that they have renamed “Data Wrangling Dataflow” into “Power Query” and this made a quite a bit of excitement among BI professionals. While Data Wrangling Dataflow was there for a while as a preview feature in ADF, this renaming it-self made people to wonder whether they can use Azure Data Factory Power Query to replace Power BI Dataflow, specially to overcome certain limitations with Power BI Dataflow such as refresh rate and computation power. However, these two features have different architectures and hence one must needs to understand how these two features work before making any such decisions. In this post I’m trying to cover main differences between these two and hope this will helps you to have a better idea on where you can use each component within a BI solution.
When it comes to architecture, both Power Dataflow and ADF Power Query use Power Query Online Mashup Editor to allow users to perform data transformations. Because of this one can misunderstand that both features work similar way. However, ADF converts M query generated by the editor into Spark code and perform data processing at scale. On the other hand, Power BI Dataflow executes M query within a Mashup container process (I believe .NET Process). Which means even though front end looks similar, backend process entirely different to each other.
ETL vs ELT
Power BI Dataflow act as a ETL (Extract, Transform and Load) tool within self-service BI space. It allows non- technical users to read data from multiple sources and perform data transformations and finally store in a shared or dedicated Azure Data Lake Gen2 storage account. Since Power BI Dataflow is coupled with Power BI, Power BI allows to analyze data stored inside the data lake easily.
However, since ADF Power Query does not support any on-premises data sources or even most of the cloud-based data sources, it needs to accommodate ELT (Extract, Load and Transform) approach. In other words, you will have to use ADF copy actives to load data to any of the supporting data sources ( check the image below) and then use Power Query to perform transformation on loaded data. Once data is processed you can persist data into a storage account or a cloud data warehouse.
Parallel vs Distributed Processing
As mentioned in the architecture section, BI dataflow uses Mashup Container Process to transform data. In Power BI service, based on the license, one might get shared capacity or dedicated capacity to run these processes parallel within a single server. If you have dedicated capacity, you must be able to scale up the processing power by increasing capacity and the container memory size.
On the other hand, ADF Power Query uses Spark, the well-known distributed processing engine. In this case data is processed using multiple worker nodes and collected inside the master node. When you use Power Query within ADF, it will allow you to select the size of cluster which is used to perform data transformations. If someone needs to scale out, it is a matter of selecting a cluster with higher number of nodes.
Matured vs Preview
Power BI dataflow has been there for a while and it has been used by many people in production. Apart from that it allows to perform lot of data transformations and supports a lot of connectors, both on premises and cloud.
However, ADF Power Query or Wrangling Dataflow has been in preview mode for a while now. I don’t believe that anyone would or should use Preview feature for in production. Additionally, as of now, Power Query in ADF only supports limited data transformation functions. Although you see that all the transformations are available in UI, when you save a Power Query, it will let you know most of the transformations are not supported as of now. I hope this will change when the go to production. Apart from that, again as of now, Power Query only supports limited number of data source systems.
|Feature||Power BI Dataflow||ADF Power Query|
|Data Source||On-Premise and Most Cloud based Data Sources||Limited, Few Azure data sources|
|Destination||ADLS Gen2, Shared or Dedicated||Cloud Data warehouses, Azure Storage Accounts|
|Data Load||Small to Medium Data Volume||Medium to Large Data Volume|
|M Query Transformations Support||Extensive||Limited|
|Power BI Support||Yes||No ( Power BI still can connect to a sink using a connector)|
|Maximum data Refresh Rate||8 in Shared/ 48 in Dedicated||No limit|
|Linked Entity||Support||Not Support|
|Common Data Model (CDM)||Support||Not Support|
|Cost||Cheap( Comes free with Power BI Licensing)||Expensive (Depending on the cluster size and the execution time)|
Azure Data Factory already have Mapping Dataflow to process data using Spark cluster and ADF team has been focusing more to enhancing it. However, they seems to have interest about allowing users to use Power Query for data processing. I believe from that Power Query experience, ADF team is trying to bring business users ( citizen data integrators) to Azure Data Factory platform. These business users who are already using Power BI Dataflow must know that ADF Power Query is not the same as what they have been using. ADF Power Query allows business users to process data at scale without writing any code using same UI they have been using within Power BI. In my personal experience, Power BI dataflow can not be replaced by Power Query in ADF even if these two have some similarities.(as of now)
Thank you for reading and stay safe! Cheers!