best kept secrets: How to write multi-purpose code

I often make the claim that writing good clean code saves you time, and one big source of savings is the ability to reuse code. But how is this different from reusing messy, poorly written code? How can you write your code in such a way that it becomes easy to repurpose?

Writing multi-purpose code requires thoughtfulness. Typically when you first start writing a script, you are just focused on getting the computer to produce the desired output (like a stimulus on the screen or a chart in an analysis). You fiddle with the code until it does what you want, and then you stop. But wait! At this point it's likely that the code only does what you want, and getting it to do anything else would be a serious challenge. Instead of just focusing on the output, consider going a little further and incorporating some of the following seven principles to help make your code more reusable in the future.

1) Break up the problem

When writing code for an experiment, often there are many steps that need to be completed to produce the final output. Segregating the code into each of these component steps will help you figure out which steps are unique to this experiment and which steps are more generally useful. If you segregate the code into its component parts (using functions or subscripts, for example), it will be easier to pull out sections of code for modularizing or repurposing later.

2) Use clear variable names

This may seem obvious, but using descriptive variable names is an important part of writing code that is easy to understand and reuse. There is a strong tendency when writing code to want to use short, simple variable names like 'a' for 'aperture'. This can be ok if the variable is within small, self-contained bits of code like functions. In research, however, it's often the case that variables get called in multiple disparate contexts throughout a very long script and it's also common for researchers to return to their code after months of absence. In those cases it's important to use longer variable names that clearly communicate what they represent.

3) Avoid "magic numbers"

Magic numbers are values that are hard-coded into the way the code works. Unlike variables, which are clearly defined and easily changed, magic numbers are baked into the structure of the code. For example, a researcher might code the number of loops to be '8' (e.g., for i = 1:8) because they "know" that there will always be 8 items to iterate through. Or they might perform a slicing operation by selecting specific items (e.g., list[251:252]) because they "know" they will always want those particular items out of that list. But if you repurpose the code these assumptions may change, and it can be painfully difficult to find and change all the magic numbers in the code later on. It's often very simple to make variables for the values in question, and if you also clearly define the variables with descriptive names it's easier to check your work and modify the code later on.

4) Use asserts liberally

Speaking of assumptions, I often find that scientists make a lot of assumptions about the input for their experimental code (probably because they are usually the only one using it). For example, a researcher might assume that a particular input value is always divisible by two, or always greater than zero, or less than the maximum size of the list. Usually these sorts of assumptions are not even acknowledged, or when they are acknowledged it is through the use of comments. Unfortunately it is a sad fact of programming that comments will almost always become outdated at some point, and nobody ever reads them anyhow. The point is: when in doubt, use an assert.

Here is an example in Python (and Matlab's syntax is very similar):

a = 25

b = 50

assert b/a > 1, "b must be greater than a"

In this example we want to assume that 'b/a' is greater than one (for some other purpose later on in the code) so we write an assert that explicitly states the requirements that need to be satisfied. If 'b' is greater than 'a' then all goes well and the program continues. However, it's possible that we've changed 'b' and didn't realize it, perhaps because we re-wrote the code for a new experiment. Now 'b' is no longer greater than 'a' and we get the error message 'b must be greater than a'. Without the assert it's conceivable that we wouldn't have noticed our error and we could have gotten some weird output value as a result. Depending on how sure we are of what the output value should be, we may or may not catch this error. Even if we did catch the error (for example if it caused a fatal problem later on in the code), the assert allows us to pinpoint the location of the error for faster debugging. This explicit method of programming makes reusing your code safer and easier.

5) Don't duplicate code

One major impediment to reusing code is the time it takes to edit the code for its new purpose. To minimize the cost of reusing your code later, try to avoid duplicating code. If you find yourself wanting to repeat several lines of code with minor changes, consider making those changes into variables (i.e. a list of values) then iterating through the values using a loop. On the other hand, if you find you are copying blocks of code from one file to another with minor changes, consider turning the code into a function that you can call in both files.

6) Code defensively

If you really want to write multi-use code, it's important to make your code as robust as possible to different inputs. You can think of this like "defensive coding", where you try to anticipate how your code might break and address those situations ahead of time. You have to consider edge cases such as dividing by zero, inputs that are the wrong size or shape, or inputs which call values that are out of bounds. This is a skill that requires practice, creativity, and experience with actually breaking code. For example, it's a classic mistake to define variables using different units (e.g. centimeters and meters) and then either make a mistake in the conversion process or even forget to make the conversion at all. To anticipate this type of error, you might decide to define all your variables using only meters, or perhaps even write (and test!) a short function to do the conversion for you.

7) Clean up the code after the deadline

Things tend to get a bit hairy around deadlines, and often times code quality suffers. If you think there's any chance you might reuse your messy last-minute code, clean it up immediately after the deadline. It's much faster and simpler to do this sooner rather than later so you don't forget how the code works.

Learning how to incorporate these principles in your code may take extra time and effort at first, but with practice you can develop good habits that will continue to pay off in the future. Breaking up the problem, avoiding "magic numbers", and using clear variable names will save you time if you refactor or reuse your code later. Even if you don't reuse the code, asserts, minimal code duplication, and defensive coding will reduce your chance of errors now.

best kept secrets

Monday, August 11, 2014

How to write multi-purpose code

No comments:

Post a Comment

About Me

Blog Archive