best kept secrets: best programming practices

Showing posts with label best programming practices. Show all posts

Monday, August 11, 2014

How to write multi-purpose code

I often make the claim that writing good clean code saves you time, and one big source of savings is the ability to reuse code. But how is this different from reusing messy, poorly written code? How can you write your code in such a way that it becomes easy to repurpose?

Writing multi-purpose code requires thoughtfulness. Typically when you first start writing a script, you are just focused on getting the computer to produce the desired output (like a stimulus on the screen or a chart in an analysis). You fiddle with the code until it does what you want, and then you stop. But wait! At this point it's likely that the code only does what you want, and getting it to do anything else would be a serious challenge. Instead of just focusing on the output, consider going a little further and incorporating some of the following seven principles to help make your code more reusable in the future.

1) Break up the problem

When writing code for an experiment, often there are many steps that need to be completed to produce the final output. Segregating the code into each of these component steps will help you figure out which steps are unique to this experiment and which steps are more generally useful. If you segregate the code into its component parts (using functions or subscripts, for example), it will be easier to pull out sections of code for modularizing or repurposing later.

2) Use clear variable names

This may seem obvious, but using descriptive variable names is an important part of writing code that is easy to understand and reuse. There is a strong tendency when writing code to want to use short, simple variable names like 'a' for 'aperture'. This can be ok if the variable is within small, self-contained bits of code like functions. In research, however, it's often the case that variables get called in multiple disparate contexts throughout a very long script and it's also common for researchers to return to their code after months of absence. In those cases it's important to use longer variable names that clearly communicate what they represent.

3) Avoid "magic numbers"

Magic numbers are values that are hard-coded into the way the code works. Unlike variables, which are clearly defined and easily changed, magic numbers are baked into the structure of the code. For example, a researcher might code the number of loops to be '8' (e.g., for i = 1:8) because they "know" that there will always be 8 items to iterate through. Or they might perform a slicing operation by selecting specific items (e.g., list[251:252]) because they "know" they will always want those particular items out of that list. But if you repurpose the code these assumptions may change, and it can be painfully difficult to find and change all the magic numbers in the code later on. It's often very simple to make variables for the values in question, and if you also clearly define the variables with descriptive names it's easier to check your work and modify the code later on.

4) Use asserts liberally

Speaking of assumptions, I often find that scientists make a lot of assumptions about the input for their experimental code (probably because they are usually the only one using it). For example, a researcher might assume that a particular input value is always divisible by two, or always greater than zero, or less than the maximum size of the list. Usually these sorts of assumptions are not even acknowledged, or when they are acknowledged it is through the use of comments. Unfortunately it is a sad fact of programming that comments will almost always become outdated at some point, and nobody ever reads them anyhow. The point is: when in doubt, use an assert.

Here is an example in Python (and Matlab's syntax is very similar):

a = 25

b = 50

assert b/a > 1, "b must be greater than a"

In this example we want to assume that 'b/a' is greater than one (for some other purpose later on in the code) so we write an assert that explicitly states the requirements that need to be satisfied. If 'b' is greater than 'a' then all goes well and the program continues. However, it's possible that we've changed 'b' and didn't realize it, perhaps because we re-wrote the code for a new experiment. Now 'b' is no longer greater than 'a' and we get the error message 'b must be greater than a'. Without the assert it's conceivable that we wouldn't have noticed our error and we could have gotten some weird output value as a result. Depending on how sure we are of what the output value should be, we may or may not catch this error. Even if we did catch the error (for example if it caused a fatal problem later on in the code), the assert allows us to pinpoint the location of the error for faster debugging. This explicit method of programming makes reusing your code safer and easier.

5) Don't duplicate code

One major impediment to reusing code is the time it takes to edit the code for its new purpose. To minimize the cost of reusing your code later, try to avoid duplicating code. If you find yourself wanting to repeat several lines of code with minor changes, consider making those changes into variables (i.e. a list of values) then iterating through the values using a loop. On the other hand, if you find you are copying blocks of code from one file to another with minor changes, consider turning the code into a function that you can call in both files.

6) Code defensively

If you really want to write multi-use code, it's important to make your code as robust as possible to different inputs. You can think of this like "defensive coding", where you try to anticipate how your code might break and address those situations ahead of time. You have to consider edge cases such as dividing by zero, inputs that are the wrong size or shape, or inputs which call values that are out of bounds. This is a skill that requires practice, creativity, and experience with actually breaking code. For example, it's a classic mistake to define variables using different units (e.g. centimeters and meters) and then either make a mistake in the conversion process or even forget to make the conversion at all. To anticipate this type of error, you might decide to define all your variables using only meters, or perhaps even write (and test!) a short function to do the conversion for you.

7) Clean up the code after the deadline

Things tend to get a bit hairy around deadlines, and often times code quality suffers. If you think there's any chance you might reuse your messy last-minute code, clean it up immediately after the deadline. It's much faster and simpler to do this sooner rather than later so you don't forget how the code works.

Learning how to incorporate these principles in your code may take extra time and effort at first, but with practice you can develop good habits that will continue to pay off in the future. Breaking up the problem, avoiding "magic numbers", and using clear variable names will save you time if you refactor or reuse your code later. Even if you don't reuse the code, asserts, minimal code duplication, and defensive coding will reduce your chance of errors now.

Wednesday, July 30, 2014

What can source control do for you? (Part 2)

In Part 1, I covered the basic mechanics of git and listed a few ways in which source control can be helpful. I mentioned easy recovery of lost files and previous versions, tracing the source of errors while debugging, and keeping a record of your accomplishments. In this post I'll further explore the reasons you should use source control, then I'll briefly go over git branching and sharing code using github.

Is source control really worth the effort?

If you've spent much time in a research environment, you have probably discovered the benefits of keeping a good lab notebook. Lab notebooks remind you of what you did and what you plan to do, and they also keep track of your justification for those decisions. They help you remember important details both during data collection and when it comes time to publish. It takes just a little extra time every day to keep your lab notebook thorough and up to date, and most of the time you don't really need it to remember what you did. But once in a while, when you really need to remember something important, that notebook is priceless.

Source control is like a lab notebook, but for code. Initially it may seem like an annoying time sink to keep track of all your changes, but if you make source control a consistent part of your workflow it will eventually prove its worth many times over.

However, despite these many apparent benefits, most researchers do not use source control. They may see it as a tool to use only occasionally when they have to, or as something that would be nice if they had time. Or, even worse, they are convinced that they don't really need source control and that their current methods work just fine.

If these viewpoints seem reasonable to you, consider this: you're probably already using source control, but poorly! For example... Do you have lots of old commented code hanging around in your files "just in case"? Do you have multiple versions of many files (e.g. script.txt, script2.txt, script_OLD.txt, and final_script.txt) all in the same directory? Do you use the "Date Modified" metadata to check if you've recently made any changes to a file? These haphazard methods of source control are equivalent to using dozens of scribbled notes on pieces of paper scattered around your desk as a lab notebook. It's crazy, it's horribly inefficient, and it's asking for trouble! Well-designed tools for source control (like git) are like structured lab notebooks; they are worth your time and effort.

Exploring variations with branching and merging

OK, back to learning git. In addition to the basic function of tracking changes, git also has a more advanced feature called branching. This works exactly as it sounds – it makes a duplicate branch at a particular point in history, and subsequent changes to both branches cause them to diverge. The command to create a new branch is simply:

$ git branch NEW_BRANCH_NAME

To switch branches, use the command:

$ git checkout BRANCH_NAME

(Note that your changes must be either committed or stashed to switch branches.) This seems pretty straightforward, right? In fact, we might envision the branch structure to be something like a phylogenetic tree.

However, there is also a way to recombine branches, called merging. Merging will start at the common ancestor of two branches and apply all the changes that were made for both branches in order, resulting in a final merged branch that shares all the traits of both of its parents.

To merge two branches, first checkout the branch that you want to contain the merged changes (you are merging onto this branch), then merge:

$ git checkout MERGE_BRANCH_NAME$

git merge OTHER_BRANCH_NAME

Now OTHER_BRANCH has remained unchanged, but MERGE_BRANCH contains all the combined edits from both branches. If there are any conflicts (i.e., the exact same code was edited differently in the two branches), you will instead see a merge conflict error, along with a list of files containing conflicts. In each file, the conflict will be highlighted to show the most recent version of the code from each branch. You may resolve the merge conflict by deleting the version you do not want to keep (along with the highlight text of course!) and saving and committing your changes.

So really, the code doesn't look like a typical phylogenetic tree at all. There can be plenty of transfer back and forth between branches, and the result probably looks a lot more like primordial soup instead. If your code looked like this, I bet you'd be pretty happy to have git automatically keeping track of it for you.

How to use branching and merging in research

Why are branching and merging useful? Consider this example: let's say you have one branch of experiment code that works and which you're currently using in an experiment. But say you also have a new variation of this experiment that you plan to run in the future. You can make a new branch from your current experiment code and start making lots of big changes for the new experiment. It's easy to switch between branches, so you can use both variations on the same computer as needed.

But now let's imagine that while running your current experiment you find a bug and fix it. And while you're at it, maybe you make some other subtle changes like saving an extra variable on each trial or changing the stimulus duration. If you were manually keeping track of these changes, you would have to immediately make the corresponding edits in your new branch. You might forget, or make a mistake, or you might even have to make the edits multiple times if you have multiple branch variations. This method is tedious and error-prone. With merging you can just merge your current working branch onto your experimental branch, and git will automatically take care of the changes and let you know if there are any conflicts.

Collaborating across computers and people

So far we've only covered using git on your local computer, but it's also possible to share code via an external code repository on github. Github has a list of resources and tutorials for how to use its site, so I won't get too much into the details here. The main difference is the push and pull commands. Pushing means sending your committed changes to a central version of the code on github, and pulling downloads changes that someone else has committed and pushed. This is useful for two or more people who are writing and editing shared code simultaneously, and it's also useful for a single researcher who makes edits to experimental code from multiple computers.

To recap – source control has lots of uses! It helps individual researchers keep track of what they've done and troubleshoot errors and bugs, it makes experimentation painless and error-free, and it also allows researchers to merge changes across computers and easily collaborate with others. What can source control do for you? As it turns out, quite a bit!

Monday, July 21, 2014

What can source control do for you? (Part 1)

Programmers often have strong feelings about source control, so I'd like to preface this post with a disclaimer: the entirety of my programming experience has been in an academic research setting, and my opinions and examples in this series on best programming practices are likely to reflect that experience. I hope that nonetheless someone can benefit from what I have learned.

What is source control? And more importantly, why should you care?

Source control (also known as revision control, or version control) is a way to keep track of files: what changes were made, who made them, and when they were made. But source control can be much more than that – it can also provide a systematic way for you to experiment with files and share them with others. In this post I'll go over the basics of using one type of source control (git) and talk a little about how researchers in particular can benefit from source control.

Git, SVN, and Mercurial are three programs which are commonly used for source control in the tech world right now. I'm going to restrict my post to git because it's what I know and use. To me, the best thing about git is that it has a really shallow learning curve at the beginning, but it's also capable of some very complicated maneuvers if you take the time to learn more. Not only is it free and open source, it's also a useful thing to learn since many companies and professional programmers use it. I like git a lot, and I hope you will, too!

Installation

Anyway, speaking of that shallow learning curve, git is mercifully easy to install. If you are using a package manager, you can install git using a command such as sudo brew install git for Mac, or, apt-get install git-core for Linux. (If the words "package manager" and/or "command line" are unfamiliar to you, you'll probably want to check out my earlier post on how to learn programming from scratch.) If you want something even easier (and probably more up to date, actually), you can download a package here that will install git for you automagically.

As an aside, there is also a GUI (Graphical User Interface, or basically an "App") available through github. As with many other things programming related, there are pros and cons to both the CLI (Command Line Interface) and the GUI. Which one you decide to use is really up to you, dependent on your background and needs, with one caveat: there is more support for the CLI, since that's what most people use. If at some point you search the internet for help with git, you'll probably find that there are a lot more people who can help you find the right command than can help you find the right menu button. I haven't used the GUI very much myself, but based on my limited experience I would suggest using the command line interface instead if you can because it's a useful thing to learn.

After you install git, take a few minutes to customize your git environment. If you haven't already, follow the instructions in my previous post on how to change your default command line text editor. Trust me on this – you do not want to use vi. You can also add your name and email address, especially if you intend to use github to share your code. To change your git config settings, replace the example name and email in the two commands below with your name and email (include the quotation marks in your command, but not the dollar signs):

$ git config --global user.name "Jane Doe"

$ git config --global user.email janedoe@example.com

Another nice customization is adding color to your git environment, which makes git output much easier to read. To do this, type:

$ git config --global color.ui auto

To find out more about git configuration, you can also check out the documentation on the git website.

How to create a repository

Ok, now that you have git installed, let's do an example. Suppose you are given someone else's code and you are told to modify it in some way. Since you don't really know how this code works yet, it would be nice to keep track of the original state of the code when it was given to you. If (when) you have to debug the code in the future, you can look back and try to see what changed between your current version and the original working version you started with.

So let's create a git repository: open up the command line and use the "cd" command to change your current directory to where the new code is located. For example, if your new code is in Documents > programming > mycode, you type:

$ cd ~/Documents/programming/mycode

Now, to create a repository we simply type:

$ git init

The next step is to add all the files you want to track to your git repository. If you want specific files (for example, file1.txt and file2.txt) you can type in the names of each of the files manually, separated by spaces like this:

$ git add file1.txt file2.txt

If instead you just want to add all the files in that folder, you can just type:

$ git add *

(The * notation is actually from something called regular expressions, which is incredibly useful, but I don't have time to talk about it here.)

Finally, the last step is to save this current state of the code in memory. To do this, we "commit" the files that we just added. A commit is like a save point that we can access later if needed. When you commit code to memory, you should also add a commit message describing what changes you made since the last commit. Commit messages are the most important and most useful aspect of source control. If you fail to write thorough and informative commit messages, at some point you'll be forced to dig through your code to try to figure out what you did and why, and you probably won't save any time compared to not using source control at all.

Useful commit messages include a verbal description of the specific problems that were fixed. For example, "Fixed bugs, updated parameters" is not a useful commit message. Instead, the message should be much more specific, like this:

"Fixed off-by-one error in file 2, added escape key to for-loop in file 1, changed initial conditions from 3 to 2 to improve performance."

Remember, your most important collaborator is your future self! If you write a thorough commit message now, you'll thank yourself later.

Ok, let's commit this code. It's as simple as:

$ git commit

After you type this command you'll be presented with the commit file in your default command line text editor. At the bottom of the file is a bunch of text telling you which files were changed and what the changes were, line-by-line. At the top of the file there is a blank section where you can type your commit message. You might write something like "First commit, initial state of the code", then save and exit. That's it! Git is now keeping track of all your changes from now on.

Viewing your tracked changes

So how do we access this tracking information? First, let's make sure everything is working correctly. To check the status of your git repository, type the command:

$ git status

You should get an output that says something like "nothing to commit, working directory clean". Now let's edit one of the files that we added earlier and see what happens. For example, make a small change to file1.txt, save the file, and then call git status again. You should see something like this:

Changes not staged for commit:(use "git add <file>..." to update what will be commited)(use "git checkout -- <file>..." to discard changes in working directory)

modified: file1.txt
no changes added to commit (use "git add" and/or "git commit -a")

This tells us that since the last time we committed, changes have only been made in a single file, file1.txt. To find out what was changed, we can type:

$ git diff

We will see some text showing the name of the file and a few lines before and after the spot that was changed. The change itself will be highlighted with plus signs (+) where text was added, and minus signs (-) where text was deleted. (By the way, any time the output for git is longer than a page, you'll see a colon (:) at the bottom of the terminal window instead of the usual dollar sign ($). To scroll, use the up and down arrows. To quit and return to the command line, press the q key.)

Ok, let's try one more thing. What if we wanted to look at the difference between the current state of the code and an even older version of the code from an earlier commit? For our current example we first need to have more than one commit. Go ahead and commit your changes to file1.txt, including a commit message that describes what you changed.

Next, we need to choose the older commit we want to compare against. To view older commits, we use this command:

$ git log

You'll see a list of two commits, each followed by a hash (this is basically a code that allows you to uniquely identify that commit). You should also see your name and email (since you made the commits), the date and time of the commit, and the commit message. Highlight and copy the hash of the initial commit we made earlier, then return to the command line and type the following command (replace HASHCODE with the hash that you just copied):

$ git diff HASHCODE

Voila! You can now view all the changes since that first commit.

How can source control help me?

Now that you understand the basics of how git works, let's talk a little about what it can do. Even for an individual, git can be very useful. There are many practical benefits, like easy recovery for accidental code and file deletion, and the ability to quickly and easily roll back code to a previous state. (In my next post I'll also talk about branching, which allows you to quickly and easily switch between variations of your code.)

Additionally, I've already mentioned that git can be helpful for finding the source of errors when debugging. This is particularly useful if you make notes in your commit messages of which commits have been tested and confirmed to be working. If the code stops working later on, you know that it must be something you did between now and your most recent working commit, so there are fewer edits to sift through to find the error.

Finally, one more benefit of tracking changes with git is that you have a record of what you've accomplished over time. Commit logs show how your time was spent and describe in detail exactly how you solved the challenging problems you faced along the way. If you spend a lot of time working on your code alone and without recognition, as many researchers do, using git can be really encouraging!

Stay tuned for part 2: merging and sharing code

Monday, July 14, 2014

Why you should write good code

I have seen a lot of bad code. Disorganized, repetitive, poorly-commented, inefficient bad code. And I'll admit – I have written some bad code, too. As a researcher I understand that the incentives on how to spend your time often run counter to your best intentions, and that sometimes bad code is unavoidable.

But I am confident that most of the time writing good code is not only possible, it's also well worth the effort. Clean, well-written, well-maintained code actually saves you time, reduces errors, and makes it easier to share your code with other people.

But perhaps you think this doesn't apply to you... perhaps you really don't have time to write good code, or your shoddy code at least isn't hurting anyone?

Perhaps. But perhaps not. In this post I'll examine four common excuses (ahem, reasons) that I've heard for why programmers write bad code, and by the end I hope you'll agree that in fact you really can't afford not to write good code.

Reason #1: "I don't have time"

Not having enough time is by far the most common reason I hear for not writing good code. And it's true – good code often demands a higher initial cost of time and effort. But this kind of thinking is very short-sighted; in the long term, poorly written code will actually weigh you down like a load of bricks.

Poorly-written code often runs slower, which costs anywhere from a few seconds to a few minutes every time you run your program. It's also more difficult to read and understand, which means more time debugging when (not if!) things go wrong. Hastily written code usually only works for a specific case and must be manually tweaked for every variation. This kind of rigid structure is very difficult to modify and repurpose. And finally, returning to your code after an absence is almost always confusing and frustrating, and discourages you from reusing your code in the future. To summarize:

Reason #2: "My code may be messy, but it works just fine"

When I hear this reason I wonder whether this person has ever seen another person's code. Software bugs are so ubiquitous they have even been found in NASA's space explorations, Intel's pentium chips, and United States missiles and military aircraft. Even for well-written code, it's almost certain that if the code is long enough it has at least one bug in it.

How confident are you that your messy, hastily written code, which only you have reviewed and tested, has no bugs? How many programs have you ever written in a hurry? What are the odds that there are no bugs in any of them? Pretty slim, I'm guessing.

Writing good code not only helps prevent mistakes in the first place, it also makes it easier to find the mistakes later.

Reason #3: "I'm only going to use this code once"

OK, I will partially concede that this is a reasonable excuse for not writing great code. But stop and consider for a moment how often this is really true. How often do you return to that code you said you would only use once? Even if you don't use the entire script or function verbatim, how often do you copy sections of code to use elsewhere?

Is there a better alternative to writing single-use code and cannibalizing it? Perhaps you could write a smaller, cleaner, more abstract function, and turn that single-use code into multi-use code.

Reason #4: "I'm the only one who ever uses my code"

While you may be the only one using your code now, it's worth considering whether anyone else might inherit your code in the future. If you are part of an organization or group, what are the chances that some poor soul will take over your project one day and have to sift through your incomprehensible code? This has happened to me, and let me tell you – it sucks!

And even if you really are the only one using your code, I hope you'll agree that you are actually pretty important. As I mentioned in reasons 1 and 2, writing good code saves you time and reduces the number of errors you make. After all, "Your most important collaborator is your future self." And hey – if your code is clean and easy to use, maybe you won't be the only one using it for very long. Developing good programming habits opens up doors to collaboration and contribution to larger projects in the future.

So, is it worth it to write good code? If you consider the time saved, errors prevented, code reused, and opportunities for collaboration, the benefits seem to greatly outweigh the initial costs. I hope this post inspires you to improve your code habits! If you're looking for more information on how to write better code, I'll be writing a series on best programming practices that's geared toward young researchers. Stay tuned!

best kept secrets