Git best practices with Azure Data Factory
31 May 2023
Are you intimidated by all of the options when trying to add Git tracking to Azure Data Factory (“ADF”)? Check out my top tips!
Benefits include:
- Easier collaboration when many people need to work on the same factory at the same time.
- Easier auto-synchronization of content between closely related factories.
TL;DR quick pointers
- Create a brand new git repository before syncing an ADF factory to it.
- If you’re more familiar with traditional git-based development than you are with ADF, here’s a tip: think of the “ADF Studio” website as an ADF-specific IDE.
- (If you’re familiar with ADF but new to git, ignore what I just said. ADF Studio will magically auto-sync your point-and-click work into files stored in your git repository, and vice-versa, and it’s perfectly OK for you to just throw up your hands and say “Cool, magic.”)
- Edit your ADF factories with a “trunk-based” branching strategy, which means:
- Configure your git repository so that it’s impossible to for ADF Studio to make direct edits to the contents of your “
main
” branch. - Never let any branches of your git repository exist for longer than a few hours unless they’re named “
main
” or “adf_publish
.”- (Yes, hours! You can do it! I’ll coach you!)
- Configure your git repository so that it’s impossible to for ADF Studio to make direct edits to the contents of your “
- In ADF Studio , when you set up auto-syncing to your git repository:
- Make “
main
” the “collaboration branch” (the only one with a working “publish” button in ADF Studio). - Make “
adf_publish
” the “publish branch” (don’t worry what it’s for).
- Make “
Read on for all of the juicy details and a hands-on exercise.
Set up a new git repository
- Create a blank Git repository in a hosting service like Azure DevOps (“ADO”) Repos or GitHub.
- Add a file called
README.md
to it (it doesn’t matter what the contents of the file are, just add one), making sure that the branch that gets created to contain this file is called “main
.” - Spin a branch off of the “
main
” branch called “adf_publish
.”- ADO new branch instructions
- GitHub new branch instructions
- (You won’t use it the rest of this article, but ADF Studio is going to want to know it exists.)
- IMPORTANT: Lock down the “
main
” branch from further direct edits. (Don’t bother locking down the “adf_publish
” branch from direct edits – those should stay enabled.)- ADO branch protection instructions
- Important: If you don’t set project-wide “default branch” policies, be sure to require at least 1 reviewer for “
main
” at the repository level. If you just “lock” themain
branch without putting any “policies” in place, it won’t actually be protected.
- Important: If you don’t set project-wide “default branch” policies, be sure to require at least 1 reviewer for “
- GitHub branch protection instructions
- Important: You must select all 3 of the following options:
- “Require a pull request before merging”
- “Lock branch”
- “Do not allow bypassing the above settings”
- I think you might also need to require approvals? Not sure. If ADF Studio lets you save anything to “
main
” once you start editing, then come back and also require approvals.
- Important: You must select all 3 of the following options:
- ADO branch protection instructions
You always want ADF Studio to error out when you try to save any work done in the context of the “main branch
.” Just trust me on this.
If you click save while working in the “main branch
” of ADF Studio, and if you have branch protection correctly configured on the “main
” branch of your git repository, then ADF Studio should briefly flash an error message in the upper-right corner of the screen reading:
“Failed to save in the Git repository: Unable to save (whatever you just tried to save). Either select another branch or resolve the permissions in the Git repository.”
Reminder:
- “Git” is an international standard for version-tracking changes to a set of files.
- “GitHub” is a company that helps you host copies of sets of files whose changes you are tracking using the “git” protocol.
- (As is the sorta-kinda-competing company, Azure DevOps Repos. “Sorta-kinda” because Microsoft owns them both.)
I hear a lot of people say they “don’t use git” because they’re Azure DevOps users who don’t use GitHub. They tend to presume that “Git” is short for “GitHub.” It’s not.
GitHub is merely one of many vendors that makes tools to facilitate using Git.
Create an ADF factory
Create a blank ADF factory and open it in ADF Studio.
Connect your ADF factory to your git repository
Toward the top of ADF Studio, to the right of the alert saying “Azure Data Factory allows you to configure a Git repository with either Azure DevOps or GitHub. Git is a version control system that allows for easier change tracking and collaboration,” click the “Set up code repository” button.
In the flyout pane at the right-hand side of the screen:
- Set Repository Type to either “Azure DevOps Git” or “GitHub.”
- If you choose “Azure DevOps Git:”
- Make sure that “Azure Active Directory” is set to the Azure subscription in which your Azure DevOps organization lives, then click the “Continue” button at the bottom of the flyout pane.
- Leaving the radio options set to “Select repository,” pick the name of your Azure DevOps organization name from “Azure DevOps organization name” and pick the name of your Azure DevOps project from “Project name.”
- Troubleshooting: As of late 2023, the “Cloud” and “Cloud (cross-tenant sign-in)” radio options will be grayed out with a tooltip reading “Azure DevOps Git (Cloud) is not supported in government clouds” if your ADF resource lives in Azure Government Cloud, leaving only “Server (on-premise)” as an option. It makes sense that “Cloud” wouldn’t be an option since there is no cloud version of ADO that lives in Government Cloud (there’s just
dev.azure.com
, notdev.azure.gov
). With any luck, Microsoft will work on making the cross-tenant option available sooner rather than later.
- Make sure that “Azure Active Directory” is set to the Azure subscription in which your Azure DevOps organization lives, then click the “Continue” button at the bottom of the flyout pane.
- If you choose “GitHub:”
- Make sure that “GitHub repository owner” is set to the organization name or username that owns your git repository.
- Presuming your web browser isn’t blocking popups, when you click the “Continue” button at the bottom of the flyout pane, you should see a new browser window open titled, “Sign in to GitHub to continue to AzureDataFactory” if your web browser isn’t already signed into GitHub. Enter your username and password and click the “Sign in” button.
- Troubleshooting: If your web browser isn’t blocking popups but you don’t get a sign-in prompt, try closing out your browser session and clearing out your cookies and cache and starting over with these steps. Alternatively, try opening a brand new incognito/private web browser session and starting over with these steps.
- Troubleshooting: If your ADF resource lives in Azure Government Cloud, before you can see a GitHub.com sign-in popup, first you’ll have to fill out a “Government Cloud BYOA” prompt. Whereas Microsoft publishes a GitHub.com OAuth application named “
AzureDataFactory
” that connects commercial-cloud ADF resources to GitHub.com, you have to build your own GitHub.com OAuth application if you’re trying to connect a government-cloud ADF resource to GitHub.com. The “Government Cloud BYOA” prompt is where you tell your ADF resource details about the GitHub.com OAuth application you created. After you fill out that form, you’ll be able to proceed with logging into GitHub.com just like people using commercial-cloud ADF resources do, only perhaps instead of granting Microsoft’sAzureDataFactory
GitHub.com OAuth application permission to log into GitHub.com as you, you’ll be granting permission to the OAuth application you handmade.
- In the same popup window, you might be taken to a page titled “Authorize AzureDataFactory” if granting the
AzureDataFactory
“OAuth application” (published by Microsoft) hasn’t been done to both your GitHub username and, if applicable, to the organization that owns the repository you’d like to synchronize to. Go ahead and click the “Authorize AzureDataFactory” button.- Note: Before clicking the button, if you entered an organization name as the “GitHub repository owner” earlier, about halfway down the popup, the name of that organization should be listed. (Do not click the little red “
x
” and remove that organization – you want it there.)
- Note: Before clicking the button, if you entered an organization name as the “GitHub repository owner” earlier, about halfway down the popup, the name of that organization should be listed. (Do not click the little red “
- Back in the “Configure a repository” flyout pane in ADF, leave the “Select repository” radio button checked and choose the appropriate repository from the “Repository name” picklist.
- Troubleshooting: If there are none, and if you specified an organization name earlier as the repository owner:
- Make sure your GitHub username actually has read-write access to any repositories owned by the GitHub organization.
- Make sure someone with “owner” rights to that GitHub organization has approved the “AzureDataFactory” OAuth application in GitHub to have full control of private repositories (
https://github.com/organizations/YOUR-ORG-NAME-HERE/settings/oauth_application_policy
– and maybe more directly athttps://github.com/orgs/YOUR-ORG-NAME-HER/policies/applications/815702
but I’m not sure – substituting forYOUR-ORG-NAME-HERE
, of course). - Once you’ve ensured both of these, click the “Back” button at the bottom of the ADF flyout pane followed by the “Continue” button and see if the “Repository name” picklist now has a list of repositories in it to which you have read/write access.
- Troubleshooting: If there are none, and if you specified an organization name earlier as the repository owner:
- Presuming your web browser isn’t blocking popups, when you click the “Continue” button at the bottom of the flyout pane, you should see a new browser window open titled, “Sign in to GitHub to continue to AzureDataFactory” if your web browser isn’t already signed into GitHub. Enter your username and password and click the “Sign in” button.
- Make sure that “GitHub repository owner” is set to the organization name or username that owns your git repository.
- If you choose “Azure DevOps Git:”
- Pick “
main
” for Collaboration branch. - Leave Publish branch as the default “
adf_publish
.” - Leave Root folder as the default “
/
.” - Leave Use custom comment checked.
- Uncheck Import existing resources to repository.
- (Working with a legacy ADF factory instead of a freshly-created one? Leave this unchecked anyway – I’ve got some tips at the bottom of this article just for you.)
- Click the “Apply” button at the bottom of the flyout pane.
- If the flyout pane changes to be titled “Set working branch,” put the radio button to “Use existing” and pick “
main
” and hit the “Save” button at the bottom of the flyout pane.
Edit your factory
Working with a legacy ADF factory instead of a freshly-created one?
Create a new factory to practice these steps on while you follow along.
Refrain from doing the steps in this section against your legacy factory until you’ve read this whole article, including my tips at the bottom of the article about importing legacy factories.
Importing ends up just being a fancy variation on the process of editing. However, you’ll enjoy the experience more if importing is the very first edit you make to a legacy ADF factory after adding a new git repository.
- In ADF Studio, click the “Author” icon in the far left-navigation pane of ADF Studio (it looks like a pencil).
- Up toward the top left, click the picklist currently titled “
main branch
” (to the left of the “Validate all,” “Save,” and “Publish” buttons). - From that picklist, click “New branch.”
- In the “Create a new branch” popup, enter a “Branch name” like “
an-awesome-idea
” and leave “Base on” set to “main branch
.” Click the “Create” branch.- (Note: if you were to open your repository in a different tab and look at the contents of the new “
an-awesome-idea
” branch, it’d have a “commit” history with 1 change in it, and you’d now see one file in the branch called “readme.md
” with textual content reading “Initialized by Azure Data Factory!”)
- (Note: if you were to open your repository in a different tab and look at the contents of the new “
- In the “Factory Resources” navigation pane toward the left, click the three dots to the right of “Pipelines” and click “New pipeline.”
- “Pipelines” should now be expanded and have a new pipeline element in it, likely named “
pipeline1
.” Toward the far right side of the screen, in the “Properties pane’s “General” tab, change the “Name” to “just-testing
.” - Click “Save.”
- You might also see a flyout pane at right asking if you’d like to make a custom comment about this change, which you can go ahead and do, and then click “OK” to actually make the “Save” button properly save your changes.
- (Note: In the real world, before you clicked “Save,” you probably would made the pipeline actually do something useful and validated it.)
- (Note: if you were to open your repository in a different tab and look at the contents of the new “
an-awesome-idea
” branch, its “commit” history would have grown longer, and you’d now also see a folder in the branch called “pipeline
” with one file in it called “just-testing.json
” with the following textual contents:){ "name": "just-testing", "properties": { "annotations": [] } }
- Up toward the top left, click the picklist currently titled “
an-awesome-idea branch
,” and from that picklist, click “Create pull request.” A new tab will open. - Validate that the pull request is going from your “
an-awesome-idea
” branch (labeled “compare” in GitHub) and to your “main
” branch (labeled “base” in GitHub). If you’re using GitHub, click the “Create pull request” button to proceed to a page where you can set a title, description, and reviewers. (In ADO, you’re already on such a page.) - Give your pull request a meaningful Title and Description. If your team has other things you need to do, like manually tag certain people as important for reviewing it, do so. Then click the “Create” button (ADO) or “Create pull request” button (GitHub).
- Merge the pull request from
an-awesome-idea
intomain
and deletean-awesome-idea
. Here’s how:- In ADO:
- Have a colleague look over and click the “Approve” button for the pull request you just created. (Or approve it yourself if your team allows doing so.)
- Click the “Complete” button.
- Be sure to leave the “Delete
an-awesome-idea
after merging” checkbox checked. - Click the “Complete merge” button in the bottom right corner of the right-side flyout pane.
- In GitHub:
- Have a colleague formally review and sign off on the pull request you just created (or review it yourself if that’s allowed on your team). (Sorry, I can’t test exactly how this works because I’m on a free GitHub account and branch protection doesn’t work properly on free accounts.)
- Click the “Merge pull request” button.
- Edit the short-message and long-message for the git “commit” to
main
that you’re about to make, if applicable, and click the “Confirm merge” button. - Click the “Delete branch” button that appears to the right of the message “Pull request successfully merged and closed: You’re all set—the
an-awesome-idea
branch can be safely deleted.”
- In ADO:
- Return to your ADF Studio web browser tab.
- Up toward the top left, click the picklist currently titled “
an-awesome-idea branch
”, and from that picklist, click “main branch.” - Note that you now have a pipeline called “
just-testing
” in the “main branch” view of ADF Studio.- Yay!
- Up toward the top left, click the picklist currently titled “
main branch
” and note that there’s no longer any option titled “an-awesome-idea branch,” which is a good thing.- (You can click elsewhere to close the picklist now.)
If you need to make further edits to your ADF factory in a few hours, just repeat these steps, only when you create a new branch, call it something like “another-awesome-idea
.”
Obviously, you’ll want to come up with a better naming standard than this for branches.
Maybe something like a datestamp, followed by a hyphen, followed by a word or two that sums up what kinds of edits you’re trying to make to the configuration of your ADF factory.
Trunk-based git branching
Creating and destroying branches off of “main
” every time you want to edit your factory’s configuration in ADF Studio might feel like an inefficient way to edit the contents of “main
,” but I promise it’s considered an industry best practice.
The steps you just took are knwon as a “trunk-based” strategy to letting multiple developers safely collaborate on editing a shared codebase.
Even outside of an ADF context, trunk-based development looks like taking the following steps:
- Create a new branch off of the git-tracked repository’s primary shared branch and give it a meaningful name related to the work being done.
- Edit the contents of this newly-created branch using your favorite code-editing tool (in this case, that’d be ADF Studio’s point-and-click builder).
- Within a few hours (or possibly days, but preferably hours), use your git repository hosting service’s “pull request” process to merge the changes you’ve made from your recently-created branch back into the primary shared branch.
- Note that your team might require colleagues to review and approve your work as part of this process.
- Immediately delete the recently-created branch from your git repository. Don’t leave it lying around as clutter.
If two developers create separate branches off of “main” on the same day, the second developer to try and merge a “pull request” back into main
may find the user interface for doing so just a bit more cluttered than the first developer to finish their work found the user interface. (Because the first developer’s changes are now also part of “main
” and if the second developer “stepped on their toes,” they might have a bit of “conflict resolution” to fiddle with.)
However, in my experience, “trunk-based” development is still the tidiest, least cluttered way to have two or more developers work on one codebase at the same time, compared to alternatives such as:
- A free-for-all in ADF Studio without any connection to a git-tracked repository at all, hoping two people aren’t working on the same factory at the same time.
- Leaving branches not named “
main
” or “adf_publish
” in existence for any longer than a few hours/days.- (Trust me, long-lived branches are just more conflict resolution to deal with in the long term.)
- (Do your work in short, few-hour spurts separated from each other by the creation, pull-request-merging, and deletion of new branches that don’t exist very long. It’s like cleaning and drying all of your cookware and counters between each part of a complicated, multi-part recipe. You’ll be glad you did.)
So please take a leap of faith and give it a try!
For trunk-based skeptics
Still not sold?
Need to save your work on a bunch of ideas for editing pipelines that you won’t be allowed to “publish” into your live ADF Studio (and therefore want to keep out of main
while you wait) for weeks?
Try this:
- Go ahead and create a new branch off of
main
whose name makes it obvious you’d like no one else to touch it for a few weeks. - Play with it to your heart’s content in ADF Studio for a few weeks. Hit the “debug” button on pipelines you’ve put into it, etc. etc. etc.
- Remember that the work you’re doing in ADF Studio against this weeks-long branch is just editing a bunch of plaintext files in that branch of your git repository – files named things like
/pipeline/just-testing.json
. - Realize that this means that copying your changes from your weeks-long branch into yet another branch is just a matter of moving files around.
- When it’s time to think about bringing the changes you spent weeks working on into
main
, don’t do a “pull request” directly from your weeks-long branch intomain
.- (Heck, none of your teammates want to have to deal with reviewing the massive file difference between your weeks-long branch, which was spun off of “
main
” weeks ago, and “main
” as it exists today. They’ll thank you for what you’re about to do.)
- (Heck, none of your teammates want to have to deal with reviewing the massive file difference between your weeks-long branch, which was spun off of “
- Instead, create a brand new branch off of
main
with your usual naming standard for branches that are doomed to die within a few hours, and then do a “pull request” from your weeks-long branch to your freshly-created hours-long branch.- The conflict-resolution process will be every bit as complex as if you were merging into main, because you’re comparing changes you made to a weeks-old copy of
main
against a fresh copy ofmain
. - However, the consequences of making a mistake during the conflict-resolution process will be much smaller.
- If you make a total mess of your brand-new hours-long branch, just delete it and try again.
- Tip: In this case, don’t let the “pull request” process auto-delete your weeks-long branch when you finish merging the pull request into your hours-long branch. After all, you might need to delete the hours-long branch and try again. You don’t want to have accidentally deleted both the weeks-long and the hours-long branches when your confidence is low. You could lose weeks of hard work!
- (Although note that most git repository hosting services have ways to restore a deleted branch into existence, so you probably wouldn’t have lost your weeks-long work forever. You’d just have to do more steps to get it back.)
- The conflict-resolution process will be every bit as complex as if you were merging into main, because you’re comparing changes you made to a weeks-old copy of
- When you’re happy with the way your brand-new hours-long branch looks (back in ADF Studio, set the editor to be on your hours-long branch, look around, debug pipelines, etc.), congratulate yourself.
- You now have an “hours-long” branch whose contents look exactly as if you had clicked weeks’ worth of ADF Studio buttons in the span of a few minutes.
- You can now safely manually delete your “weeks-long” branch.
- From here, just work as usual off of your “hours-long” branch, pretending you were simply an extremely productive ADF Studio clicker in the last few minutes, and pull-request-merge it into “
main
” the way you would with work that only actually takes a few hours to complete.
That said, please do try breaking up your work into bite-sized pieces that are safe to put into main
(and risk a colleague clicking the “Publish” button against) every few hours.
For example, maybe you release a new pipeline into main
and publish it live in one short burst of work with its own hours-long branch.
And then a few weeks later, you edit that pipeline and add a trigger do it in another short burst of work with its own hours-long branch.
Think of the way Subway or Chipotle chop all of their lettuce and store it into Cambro containers before the store even opens for business.
No one is eating the lettuce until a customer walks in the door and asks, “Lettuce, please.”
Can you change the way you think about work that needs to be done in ADF Studio? Can you break up your to-do list into “prep” and “go-live” steps?
If so, you can stick with a purer form of “trunk-based” branching and I promise you’ll find conflict resolution to be pretty hassle-free.
(P.S. There’s a similar concept in programming called “feature flagging” that’s more analogous to the cashier refusing to hand customers their burrito bowl until they’ve paid, but that’s a more complex concept than a simple reminder to “try thinking of your work in ‘prep’ vs. ‘do’ steps.”)
Publish branch
- Q: Say, Katie, why didn’t we talk about the “
adf_publish
” branch? - A: Don’t worry about it. We want it to exist, but we can ignore what’s going on inside of it for now.
Every time you click the “Publish” button in ADF Studio, ADF Studio does two things behind the scenes:
- ADF Studio goes operationally live with whatever configuration you’re currently looking at in the “
main branch
” of your ADF Studio interface.- You’ve always had to click “Publish” to get things to “go live” and become the configuration that runs when you hit “Trigger now” on a pipeline, when you let a pipeline run on a schedule, etc.
- This hasn’t changed by adding git synchronization to your factory.
- (That is, there’s nothing special about your factory’s configurations being merely saved into “
main
.” Things in ADF Studio’s “main branch
” are as much in draft-mode, until you hit the Publish button, as things in ADF Studio’s older non-git-backed “live mode” always were.)
- (That is, there’s nothing special about your factory’s configurations being merely saved into “
- ADF Studio edits the contents of the “
adf_publish
” branch of your Git repository to contain a backup copy of the most-recently-published state of your factory.
If you later decide to set up auto-synchronization between ADF factories (such as from a nonproduction factory into its equivalent production factory), you’ll end up learning to use Azure DevOps or GitHub to create “CI/CD pipelines” that auto-execute every time ADF Studio auto-edits “adf_publish
.”
For example, you write a “pipeline” whose job it is to overwrite the configuration of some other factory (e.g. a “production” one) and make it look exactly like the configuration of the factory (e.g. a “nonproduction” one) that’s connected to your git-tracked repository (the repository whose “adf_publish
” branch’s contents just got auto-updated).
Don’t worry about how powerful this level of inter-factory automation sounds – when you write “pipelines,” you can insert “wait for a human to approve that it’s a good idea to let this pipeline run in full” steps into them. That is, you’ll be able to write a pipeline that doesn’t try to auto-overwrite the production factory (each time the nonproduction factory’s “publish” button gets clicked) without human approval, if that makes you happy.
But as I said, don’t worry about what’s going on behind the scenes in “adf_publish
” for now.
Just let ADF Studio do whatever it wants in “adf_publish
” and ignore it until you need to pay attention to it.
Import your legacy factory
If you were need to add git tracking to a legacy “live mode” ADF factory containing lots of published resources:
- Over in ADO or GitHub, spin off one more branch from “
main
.” You can call it something like “initial-import
.” - In ADF Studio, follow all of the “Connect your ADF factory to your git repository” steps above, including the one about unchecking Import existing resources to repository.
- Click the “Manage” icon in the far left-navigation pane of ADF Studio (it looks like a toolbox with a wrench in it).
- Click the “Git configuration” option from the near-left-navigation pane.
- Click the “Import resources” button at the top of the main pane of the screen.
- In the “Import existing resources” popup, set “Select branch” to “
initial-import
” (notmain
!) and click the “Import” button. - Click the “Author” icon in the far left-navigation pane of ADF Studio (it looks like a pencil).
- Up toward the top left, make sure the picklist is set to “
initial-import branch
.” - Look around the “Factory Resources” pane and make sure everything’s there.
- Troubleshooting: You might have to hard-reload your browser tab, switch to viewing the “
main branch
” and then go back to viewing the “initial-import branch
,” etc. It seems ADF Studio needs a little prodding to go look at the “initial-import
” branch of your git repository after an import.
- Troubleshooting: You might have to hard-reload your browser tab, switch to viewing the “
- Up toward the top left, click the picklist currently titled “
initial-import branch
,” and from that picklist, click “Create pull request.” - Merge the contents of “
initial-import
” into “main
” within your git repository. Do the “pull request” just like you would for the day-to-day editing process I described earlier in this article.- Be sure to let the pull request completion process delete your “
initial-import
” branch – you don’t need it anymore.
- Be sure to let the pull request completion process delete your “
- Up toward the top left, click the picklist currently titled “
initial-import branch
,” and from that picklist, click “main branch.” - Note that your previous work is now visible in the “main branch” view of ADF Studio.
- Yay!
- You might want to hit “Publish” for good measure, but I don’t think you should need to.
- (Warning: I’m not actually sure if “Import resources” imports the last-saved work from Live Mode or the last-published work from Live Mode. Double-check that there isn’t anything sitting around in the “main branch” view of ADF Studio that you don’t want published before clicking “Publish.“)