Using source controlling and branching is not a new concept for anyone who has some exposure to software engineering. However, data engineering and software engineering hardly cross their paths and hence I have seen some ETL developers and data engineers struggle to understand and use source branching properly within their code repositories. Although data engineers configure a source control such as Azure DevOps Git repo within ADF, I have seen that they only use default main branch for all the development. Therefore, In this post, I’m trying to cover most common branching strategies I have come across and what are the Pros and Cons of each strategy. Once thing to remember is although ADF now support Automated Publishing, depend on how you implement branching , you might or might not be able to use Automated publishing feature. To learn more about automated publishing refer below link.
Automated Publish in ADF: https://asankap.wordpress.com/2021/03/02/automated-publish-in-azure-data-factory-using-devops/
Okay, Now lets have a look on different branching strategies used in Azure Data Factory.
Approach 1: Main branch only
This is the simplest way. When you configure a repository in ADF, by default it creates the main branch and the adf_publish branch. You can use main branch to do all the development and once it’s ready to move in to a higher environment (test or prod in this case), you will have to manually publish ADF artifacts by clicking the “Publish” button in ADF. In the publish process, ADF creates ARM template files and put inside “adf_publish” branch. You can use those ARM templates for ARM template deployment within Azure DevOps for CI/CD implementation.
CI/CD for Azure Data Factory: https://docs.microsoft.com/en-us/azure/data-factory/continuous-integration-deployment/?WT.mc_id=DP-MVP-5004277
Pros and Cons with this approach:
- Easy to implement
- Only suitable for small implementations
- Cannot use when multiple engineers develop simultaneously
- Cannot use Automated Publishing as CI and PR triggers are unavailable
- Difficult with continues development and issue fixing as higher environments and main branch are in different states at a given time.
Approach 2: Main Branch as Collaboration branch
In this approach, main branch is used as coloration branch and each developer will have it’s own branch to do the development. Any developer come to team have to make a pull request from main branch to the developer branch to get latest ADF artifacts to start development. Once development is done, he/she will make a pull request to merge changes back to main branch. Once all the changes are in main branch, either manual publish or automated publish can be triggered. If using automated publish, release pipeline should make sure that only selected pull requests trigger the release to a higher environment.
- Multiple developers can work simultaneously
- Difficult to use with Automated Publishing as multiple developers update collaboration branch (main branch) in different times.
- Cannot break development into features and all the development must be released as a one feature.
Approach 3: Feature branches as Collaboration branches
In this approach, multiple teams can work in same ADF instance to develop different features. Team members who work in a specific feature create a pull request to the feature branch, which is created using the main branch. Once a developer is ready, he/she can create a PR (Pull Request) to marge his development to the feature branch. When the feature is ready to go to a higher environment, it will be moved to main branch using a PR. If required, feature branch can be configured to use for Automated Deployment. That way feature releases can be created rather than waiting for total development to finish. If not, the main branch can be used for both manual or automated publishing as indicated in diagram below.
Pros and Cons in this apporach:
- Multiple teams/ developers can work on different features simultaneously
- Individual features can go to higher environments without any dependencies from other developer’s work
- Complex implementation in branching and release management
Approach 4: Main branch as Production equivalent branch
This approach is kind of extended implementation of approach 3. In here, main branch is maintained as Production equivalent. In other worlds, releases are done using a different collaboration branch and once test and UAT is completed and artifacts are moved into production, Collaboration branch is merged into main branch to keep main branch equal to production environment. In that if something goes wrong in Prod, you always have production code in your hand and it’s just a matter to cloning the code and creating a new ADF instance to troubleshoot the prod issue.
Pros and Cons in the approach:
- Easy to fix production issues
- Production code is safe and always in hand
- Complex implementation
Creating branches in ADF is purely dependent on how big the your ADF implementation is and how big the development team is. There are no hardcore rules to say that one should create branches like this way or that way. In this post, I wanted to show you couple of ways you can implement branching in your ADF projects and how each approach handles different problems we face in release management. If you have implemented branching differently to manage different scenarios, please feel free to comment and share it with others. Thank you reading and stay safe.