Data is the most important assert of an organization! Safeguarding organizational data has paramount importance for any organization and companies are spending millions of dollars for that. In this post, I’m trying to cover how we can configure Azure Data Factory in a secured way to protect the data it processes. In below diagram I’m trying to place my way of securing organizational data when Azure Data Factory is used in a data integration solution. Your implementation you might or might not be able to use all the component I have listed down in this post based on your organizational requirements.
One thing that remember that in this post I will not go in details for any of these features. Nevertheless, I will add Microsoft documentation related to each section so that you can read more about it.
Multiple ADF Instances
Designing Azure resource groups and Azure resources is very important when it comes to an ADF implementation. A proper resource design will decide how secure your data in cloud. First thing to consider in design is not to share an ADF instance with multiple projects or multiple environments. Having multiple ADF instances does not cost you anything extra. On other hand, it helps you to control the access each environment for only specific people. For an example, if you have projects called XYZ and ABC, create different resource groups named XYZ-Dev, XYZ-Test , XYZ-Prod, ABC-Dev etc. Within each RG, you can create separate ADF instance. That way you can control access to storage accounts, key vaults and ADF instances to provide least possible access.
Express Route or Point to Site /Site to Site VPN
In most of the ADF implementations, ADF uses Self Hosted Integration Runtime (SHIR) to connect to on-prem network to transfer data from on-prem data sources into staging environment or to the final destination in cloud. In that case if you don’t have set up express route or VNet Peering, data is transferred via public internet. Although data is transferred using TLS via TCP and HTTPS, the most secured way it to create private connection between on-prem network and Azure using express route or IPSec VPN between corporate network and Azure.
But you can read more on this in below Microsoft link:
How SHIR should be configured: https://docs.microsoft.com/en-us/azure/data-factory/data-movement-security-considerations/?WT.mc_id=DP-MVP-5004277
How to create Site to Site VPN:https://docs.microsoft.com/en-us/azure/vpn-gateway/vpn-gateway-howto-site-to-site-classic-portal/?WT.mc_id=DP-MVP-5004277
How to create Point to Site VPN: https://docs.microsoft.com/en-us/azure/vpn-gateway/vpn-gateway-howto-point-to-site-classic-azure-portal/?WT.mc_id=DP-MVP-5004277
Configure Corporate Firewall Rules
If you organization does not have a express route or IPSec VPN configured then you can use corporate firewall to restrict access from outside to make sure only ADF control channel can communicate with the on-prem network. You can get domain names used by Azure Data Factory by going to you SHIR in ADF and click view Service URLs. It will provide domain name list as shown in the image below and you can add those Fully Qualified Domain Names to your corporate firewalls allow list. Note that this is not the only list of domain names you will have to whitelist. Read below link on setting up firewall rules to understand what other steps you need to do.
Read more on setting up firewall rules: https://docs.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime?tabs=data-factory#get-url-of-azure-relay/?WT.mc_id=DP-MVP-5004277
Managed Private Endpoints
Managed private Endpoints are still preview in ADF. But for me, this is huge when it comes to security. When you want to secure your data in cloud, the main thing you do is encapsulate those resources within a VNet and disabled the connectivity from public internet. in that case if you don’t use MPE, ADF will not be able to connect those sources. If you enable MPE for integration runtime, ADF creates a private endpoint connection between sources and ADF and data will not flow through Azure shared network.
While you can setup this for new integration runtime you create, you can set this for Auto resolve integration runtime while creating the ADF instance
Read more on ADF Managed Private end points: https://docs.microsoft.com/en-us/azure/data-factory/managed-virtual-network-private-endpoint?WT.mc_id=Portal-Microsoft_Azure_CreateUIDef/?WT.mc_id=DP-MVP-5004277
IP Whitelisting in Azure Resources
If your cloud data sources are not protected using VNet, next available option is to use IP whitelisting which only allows traffic come from Azure Integration runtime to communicate with the data sources. However, note that this IP addresses are different based on integration runtime location. In the case of Azure auto resolve IR this will be tricky as we don’t have a control over the location of Integration runtime location. Microsoft has published a document with fixed IP address range used by each Azure Integration runtime. These IP addresses change based on the datacenter you create Azure IR. Also note that this static IP address list might be updated and hence you might want to keep your eye on it in regular basis.
Note that in case of storage accounts, if IR and storage accounts are in same region, these IP rules will not be effective.
You can download the JSON file which containers IP addresses of Data Factory and add those to network whitelisting.
When ADF connects to cloud data sources, there are multiple ways to authenticate. While easiest and most common practice is to use user name and password, or any tokens/keys, it is the most unsecured way to connect to your data. Whenever data source is supported, use Managed Identity to connect to cloud data sources. For each ADF instance, Azure creates a Managed Identity with the name of the ADF instance. You can use this to authenticate by providing RBAC access to resources with least required permissions. For example if you want to read the data from a blob container, add ADF managed Identity to reader role into respective blob container.
More on ADF Managed Identity: https://docs.microsoft.com/en-us/azure/data-factory/data-factory-service-identity/?WT.mc_id=DP-MVP-5004277
Azure Key Vault
In the case of data sources are not supported to use Managed Identity, you might have to use Passwords, keys for authentication. In such cases don’t use these information in plain text or hard code in ADF. Always store credentials in a Key Vault and refer it within ADF. One key mistake people do is maintaining one Key Vault and have multiple keys for different environments. That compromise the security as well as make it difficult to move ADF artifacts between environments.
More on using Key Vault to store credential for ADF:https://docs.microsoft.com/en-us/azure/data-factory/store-credentials-in-key-vault/?WT.mc_id=DP-MVP-5004277
Bring your own key
Unless you configure to use your own Key Vault in ADF, ADF uses system generated keys to encrypt any information related to ADF such as connection information, meta data and any cached data. While this is secure enough, you can always use your own keys stored in an Azure Key vault to encrypt the Azure Data Factory.
More on bring your own key: https://docs.microsoft.com/en-us/azure/data-factory/enable-customer-managed-key/?WT.mc_id=DP-MVP-5004277
Encryption at Rest
Encryption at rest is not something related to ADF and hence I kept it to last. But this is very important when it comes to data security. ADF do not store any data within it unless for any caching requirement. Therefore when data is stored in staging environment or final destination, make sure to use encryption capabilities come with respective source type. For example for blob storage accounts, you can enable encryption at rest and even chose to use your own keys for that.
I will keep update this post on what other security features we can consider for a secured ADF implementation. Feel free to add you ideas on this post as comments. Thanks for reading and stay safe!