best kept secrets: What can source control do for you? (Part 2)

In Part 1, I covered the basic mechanics of git and listed a few ways in which source control can be helpful. I mentioned easy recovery of lost files and previous versions, tracing the source of errors while debugging, and keeping a record of your accomplishments. In this post I'll further explore the reasons you should use source control, then I'll briefly go over git branching and sharing code using github.

Is source control really worth the effort?

If you've spent much time in a research environment, you have probably discovered the benefits of keeping a good lab notebook. Lab notebooks remind you of what you did and what you plan to do, and they also keep track of your justification for those decisions. They help you remember important details both during data collection and when it comes time to publish. It takes just a little extra time every day to keep your lab notebook thorough and up to date, and most of the time you don't really need it to remember what you did. But once in a while, when you really need to remember something important, that notebook is priceless.

Source control is like a lab notebook, but for code. Initially it may seem like an annoying time sink to keep track of all your changes, but if you make source control a consistent part of your workflow it will eventually prove its worth many times over.

However, despite these many apparent benefits, most researchers do not use source control. They may see it as a tool to use only occasionally when they have to, or as something that would be nice if they had time. Or, even worse, they are convinced that they don't really need source control and that their current methods work just fine.

If these viewpoints seem reasonable to you, consider this: you're probably already using source control, but poorly! For example... Do you have lots of old commented code hanging around in your files "just in case"? Do you have multiple versions of many files (e.g. script.txt, script2.txt, script_OLD.txt, and final_script.txt) all in the same directory? Do you use the "Date Modified" metadata to check if you've recently made any changes to a file? These haphazard methods of source control are equivalent to using dozens of scribbled notes on pieces of paper scattered around your desk as a lab notebook. It's crazy, it's horribly inefficient, and it's asking for trouble! Well-designed tools for source control (like git) are like structured lab notebooks; they are worth your time and effort.

Exploring variations with branching and merging

OK, back to learning git. In addition to the basic function of tracking changes, git also has a more advanced feature called branching. This works exactly as it sounds – it makes a duplicate branch at a particular point in history, and subsequent changes to both branches cause them to diverge. The command to create a new branch is simply:

$ git branch NEW_BRANCH_NAME

To switch branches, use the command:

$ git checkout BRANCH_NAME

(Note that your changes must be either committed or stashed to switch branches.) This seems pretty straightforward, right? In fact, we might envision the branch structure to be something like a phylogenetic tree.

However, there is also a way to recombine branches, called merging. Merging will start at the common ancestor of two branches and apply all the changes that were made for both branches in order, resulting in a final merged branch that shares all the traits of both of its parents.

To merge two branches, first checkout the branch that you want to contain the merged changes (you are merging onto this branch), then merge:

$ git checkout MERGE_BRANCH_NAME$

git merge OTHER_BRANCH_NAME

Now OTHER_BRANCH has remained unchanged, but MERGE_BRANCH contains all the combined edits from both branches. If there are any conflicts (i.e., the exact same code was edited differently in the two branches), you will instead see a merge conflict error, along with a list of files containing conflicts. In each file, the conflict will be highlighted to show the most recent version of the code from each branch. You may resolve the merge conflict by deleting the version you do not want to keep (along with the highlight text of course!) and saving and committing your changes.

So really, the code doesn't look like a typical phylogenetic tree at all. There can be plenty of transfer back and forth between branches, and the result probably looks a lot more like primordial soup instead. If your code looked like this, I bet you'd be pretty happy to have git automatically keeping track of it for you.

How to use branching and merging in research

Why are branching and merging useful? Consider this example: let's say you have one branch of experiment code that works and which you're currently using in an experiment. But say you also have a new variation of this experiment that you plan to run in the future. You can make a new branch from your current experiment code and start making lots of big changes for the new experiment. It's easy to switch between branches, so you can use both variations on the same computer as needed.

But now let's imagine that while running your current experiment you find a bug and fix it. And while you're at it, maybe you make some other subtle changes like saving an extra variable on each trial or changing the stimulus duration. If you were manually keeping track of these changes, you would have to immediately make the corresponding edits in your new branch. You might forget, or make a mistake, or you might even have to make the edits multiple times if you have multiple branch variations. This method is tedious and error-prone. With merging you can just merge your current working branch onto your experimental branch, and git will automatically take care of the changes and let you know if there are any conflicts.

Collaborating across computers and people

So far we've only covered using git on your local computer, but it's also possible to share code via an external code repository on github. Github has a list of resources and tutorials for how to use its site, so I won't get too much into the details here. The main difference is the push and pull commands. Pushing means sending your committed changes to a central version of the code on github, and pulling downloads changes that someone else has committed and pushed. This is useful for two or more people who are writing and editing shared code simultaneously, and it's also useful for a single researcher who makes edits to experimental code from multiple computers.

To recap – source control has lots of uses! It helps individual researchers keep track of what they've done and troubleshoot errors and bugs, it makes experimentation painless and error-free, and it also allows researchers to merge changes across computers and easily collaborate with others. What can source control do for you? As it turns out, quite a bit!

best kept secrets

Wednesday, July 30, 2014

What can source control do for you? (Part 2)

No comments:

Post a Comment

About Me

Blog Archive