best kept secrets: 2014

Sunday, November 2, 2014

Where have I been?

I haven't been updating this blog for a while and there's a good reason – I'm actually getting a chance to practice what I preach and write my own code from scratch! This fall I decided to move more towards computer graphics, and I'm taking CS294-26: Computational Photography and Image Processing. You can check out my portfolio here: www.rachelalbert.com/image_processing.html

I am still very interested in finding out more about programming best practices, and in the spring I plan to volunteer with Software Carpentry, an organization specifically created for the purpose of helping researchers write better code.

This fall I also had the opportunity to volunteer for the Biophysics department and teach a one-month Introduction to Python class. It was a blast! I learned so much about all the different ways people think and how to account for those differences and communicate clearly about code. I also had the unique opportunity to experience 30 people sitting in a room together using pandas data frames for the first time! The class helped me appreciate how valuable is the experience of the Software Carpentry group! No matter how much you try to think ahead, someone will almost always interpret the material differently. Teaching helps you to get out of your own head and into the minds of others, and it clarifies your thoughts and assumptions. I highly recommend it!

Monday, August 11, 2014

How to write multi-purpose code

I often make the claim that writing good clean code saves you time, and one big source of savings is the ability to reuse code. But how is this different from reusing messy, poorly written code? How can you write your code in such a way that it becomes easy to repurpose?

Writing multi-purpose code requires thoughtfulness. Typically when you first start writing a script, you are just focused on getting the computer to produce the desired output (like a stimulus on the screen or a chart in an analysis). You fiddle with the code until it does what you want, and then you stop. But wait! At this point it's likely that the code only does what you want, and getting it to do anything else would be a serious challenge. Instead of just focusing on the output, consider going a little further and incorporating some of the following seven principles to help make your code more reusable in the future.

1) Break up the problem

When writing code for an experiment, often there are many steps that need to be completed to produce the final output. Segregating the code into each of these component steps will help you figure out which steps are unique to this experiment and which steps are more generally useful. If you segregate the code into its component parts (using functions or subscripts, for example), it will be easier to pull out sections of code for modularizing or repurposing later.

2) Use clear variable names

This may seem obvious, but using descriptive variable names is an important part of writing code that is easy to understand and reuse. There is a strong tendency when writing code to want to use short, simple variable names like 'a' for 'aperture'. This can be ok if the variable is within small, self-contained bits of code like functions. In research, however, it's often the case that variables get called in multiple disparate contexts throughout a very long script and it's also common for researchers to return to their code after months of absence. In those cases it's important to use longer variable names that clearly communicate what they represent.

3) Avoid "magic numbers"

Magic numbers are values that are hard-coded into the way the code works. Unlike variables, which are clearly defined and easily changed, magic numbers are baked into the structure of the code. For example, a researcher might code the number of loops to be '8' (e.g., for i = 1:8) because they "know" that there will always be 8 items to iterate through. Or they might perform a slicing operation by selecting specific items (e.g., list[251:252]) because they "know" they will always want those particular items out of that list. But if you repurpose the code these assumptions may change, and it can be painfully difficult to find and change all the magic numbers in the code later on. It's often very simple to make variables for the values in question, and if you also clearly define the variables with descriptive names it's easier to check your work and modify the code later on.

4) Use asserts liberally

Speaking of assumptions, I often find that scientists make a lot of assumptions about the input for their experimental code (probably because they are usually the only one using it). For example, a researcher might assume that a particular input value is always divisible by two, or always greater than zero, or less than the maximum size of the list. Usually these sorts of assumptions are not even acknowledged, or when they are acknowledged it is through the use of comments. Unfortunately it is a sad fact of programming that comments will almost always become outdated at some point, and nobody ever reads them anyhow. The point is: when in doubt, use an assert.

Here is an example in Python (and Matlab's syntax is very similar):

a = 25

b = 50

assert b/a > 1, "b must be greater than a"

In this example we want to assume that 'b/a' is greater than one (for some other purpose later on in the code) so we write an assert that explicitly states the requirements that need to be satisfied. If 'b' is greater than 'a' then all goes well and the program continues. However, it's possible that we've changed 'b' and didn't realize it, perhaps because we re-wrote the code for a new experiment. Now 'b' is no longer greater than 'a' and we get the error message 'b must be greater than a'. Without the assert it's conceivable that we wouldn't have noticed our error and we could have gotten some weird output value as a result. Depending on how sure we are of what the output value should be, we may or may not catch this error. Even if we did catch the error (for example if it caused a fatal problem later on in the code), the assert allows us to pinpoint the location of the error for faster debugging. This explicit method of programming makes reusing your code safer and easier.

5) Don't duplicate code

One major impediment to reusing code is the time it takes to edit the code for its new purpose. To minimize the cost of reusing your code later, try to avoid duplicating code. If you find yourself wanting to repeat several lines of code with minor changes, consider making those changes into variables (i.e. a list of values) then iterating through the values using a loop. On the other hand, if you find you are copying blocks of code from one file to another with minor changes, consider turning the code into a function that you can call in both files.

6) Code defensively

If you really want to write multi-use code, it's important to make your code as robust as possible to different inputs. You can think of this like "defensive coding", where you try to anticipate how your code might break and address those situations ahead of time. You have to consider edge cases such as dividing by zero, inputs that are the wrong size or shape, or inputs which call values that are out of bounds. This is a skill that requires practice, creativity, and experience with actually breaking code. For example, it's a classic mistake to define variables using different units (e.g. centimeters and meters) and then either make a mistake in the conversion process or even forget to make the conversion at all. To anticipate this type of error, you might decide to define all your variables using only meters, or perhaps even write (and test!) a short function to do the conversion for you.

7) Clean up the code after the deadline

Things tend to get a bit hairy around deadlines, and often times code quality suffers. If you think there's any chance you might reuse your messy last-minute code, clean it up immediately after the deadline. It's much faster and simpler to do this sooner rather than later so you don't forget how the code works.

Learning how to incorporate these principles in your code may take extra time and effort at first, but with practice you can develop good habits that will continue to pay off in the future. Breaking up the problem, avoiding "magic numbers", and using clear variable names will save you time if you refactor or reuse your code later. Even if you don't reuse the code, asserts, minimal code duplication, and defensive coding will reduce your chance of errors now.

Wednesday, July 30, 2014

What can source control do for you? (Part 2)

In Part 1, I covered the basic mechanics of git and listed a few ways in which source control can be helpful. I mentioned easy recovery of lost files and previous versions, tracing the source of errors while debugging, and keeping a record of your accomplishments. In this post I'll further explore the reasons you should use source control, then I'll briefly go over git branching and sharing code using github.

Is source control really worth the effort?

If you've spent much time in a research environment, you have probably discovered the benefits of keeping a good lab notebook. Lab notebooks remind you of what you did and what you plan to do, and they also keep track of your justification for those decisions. They help you remember important details both during data collection and when it comes time to publish. It takes just a little extra time every day to keep your lab notebook thorough and up to date, and most of the time you don't really need it to remember what you did. But once in a while, when you really need to remember something important, that notebook is priceless.

Source control is like a lab notebook, but for code. Initially it may seem like an annoying time sink to keep track of all your changes, but if you make source control a consistent part of your workflow it will eventually prove its worth many times over.

However, despite these many apparent benefits, most researchers do not use source control. They may see it as a tool to use only occasionally when they have to, or as something that would be nice if they had time. Or, even worse, they are convinced that they don't really need source control and that their current methods work just fine.

If these viewpoints seem reasonable to you, consider this: you're probably already using source control, but poorly! For example... Do you have lots of old commented code hanging around in your files "just in case"? Do you have multiple versions of many files (e.g. script.txt, script2.txt, script_OLD.txt, and final_script.txt) all in the same directory? Do you use the "Date Modified" metadata to check if you've recently made any changes to a file? These haphazard methods of source control are equivalent to using dozens of scribbled notes on pieces of paper scattered around your desk as a lab notebook. It's crazy, it's horribly inefficient, and it's asking for trouble! Well-designed tools for source control (like git) are like structured lab notebooks; they are worth your time and effort.

Exploring variations with branching and merging

OK, back to learning git. In addition to the basic function of tracking changes, git also has a more advanced feature called branching. This works exactly as it sounds – it makes a duplicate branch at a particular point in history, and subsequent changes to both branches cause them to diverge. The command to create a new branch is simply:

$ git branch NEW_BRANCH_NAME

To switch branches, use the command:

$ git checkout BRANCH_NAME

(Note that your changes must be either committed or stashed to switch branches.) This seems pretty straightforward, right? In fact, we might envision the branch structure to be something like a phylogenetic tree.

However, there is also a way to recombine branches, called merging. Merging will start at the common ancestor of two branches and apply all the changes that were made for both branches in order, resulting in a final merged branch that shares all the traits of both of its parents.

To merge two branches, first checkout the branch that you want to contain the merged changes (you are merging onto this branch), then merge:

$ git checkout MERGE_BRANCH_NAME$

git merge OTHER_BRANCH_NAME

Now OTHER_BRANCH has remained unchanged, but MERGE_BRANCH contains all the combined edits from both branches. If there are any conflicts (i.e., the exact same code was edited differently in the two branches), you will instead see a merge conflict error, along with a list of files containing conflicts. In each file, the conflict will be highlighted to show the most recent version of the code from each branch. You may resolve the merge conflict by deleting the version you do not want to keep (along with the highlight text of course!) and saving and committing your changes.

So really, the code doesn't look like a typical phylogenetic tree at all. There can be plenty of transfer back and forth between branches, and the result probably looks a lot more like primordial soup instead. If your code looked like this, I bet you'd be pretty happy to have git automatically keeping track of it for you.

How to use branching and merging in research

Why are branching and merging useful? Consider this example: let's say you have one branch of experiment code that works and which you're currently using in an experiment. But say you also have a new variation of this experiment that you plan to run in the future. You can make a new branch from your current experiment code and start making lots of big changes for the new experiment. It's easy to switch between branches, so you can use both variations on the same computer as needed.

But now let's imagine that while running your current experiment you find a bug and fix it. And while you're at it, maybe you make some other subtle changes like saving an extra variable on each trial or changing the stimulus duration. If you were manually keeping track of these changes, you would have to immediately make the corresponding edits in your new branch. You might forget, or make a mistake, or you might even have to make the edits multiple times if you have multiple branch variations. This method is tedious and error-prone. With merging you can just merge your current working branch onto your experimental branch, and git will automatically take care of the changes and let you know if there are any conflicts.

Collaborating across computers and people

So far we've only covered using git on your local computer, but it's also possible to share code via an external code repository on github. Github has a list of resources and tutorials for how to use its site, so I won't get too much into the details here. The main difference is the push and pull commands. Pushing means sending your committed changes to a central version of the code on github, and pulling downloads changes that someone else has committed and pushed. This is useful for two or more people who are writing and editing shared code simultaneously, and it's also useful for a single researcher who makes edits to experimental code from multiple computers.

To recap – source control has lots of uses! It helps individual researchers keep track of what they've done and troubleshoot errors and bugs, it makes experimentation painless and error-free, and it also allows researchers to merge changes across computers and easily collaborate with others. What can source control do for you? As it turns out, quite a bit!

Monday, July 21, 2014

What can source control do for you? (Part 1)

Programmers often have strong feelings about source control, so I'd like to preface this post with a disclaimer: the entirety of my programming experience has been in an academic research setting, and my opinions and examples in this series on best programming practices are likely to reflect that experience. I hope that nonetheless someone can benefit from what I have learned.

What is source control? And more importantly, why should you care?

Source control (also known as revision control, or version control) is a way to keep track of files: what changes were made, who made them, and when they were made. But source control can be much more than that – it can also provide a systematic way for you to experiment with files and share them with others. In this post I'll go over the basics of using one type of source control (git) and talk a little about how researchers in particular can benefit from source control.

Git, SVN, and Mercurial are three programs which are commonly used for source control in the tech world right now. I'm going to restrict my post to git because it's what I know and use. To me, the best thing about git is that it has a really shallow learning curve at the beginning, but it's also capable of some very complicated maneuvers if you take the time to learn more. Not only is it free and open source, it's also a useful thing to learn since many companies and professional programmers use it. I like git a lot, and I hope you will, too!

Installation

Anyway, speaking of that shallow learning curve, git is mercifully easy to install. If you are using a package manager, you can install git using a command such as sudo brew install git for Mac, or, apt-get install git-core for Linux. (If the words "package manager" and/or "command line" are unfamiliar to you, you'll probably want to check out my earlier post on how to learn programming from scratch.) If you want something even easier (and probably more up to date, actually), you can download a package here that will install git for you automagically.

As an aside, there is also a GUI (Graphical User Interface, or basically an "App") available through github. As with many other things programming related, there are pros and cons to both the CLI (Command Line Interface) and the GUI. Which one you decide to use is really up to you, dependent on your background and needs, with one caveat: there is more support for the CLI, since that's what most people use. If at some point you search the internet for help with git, you'll probably find that there are a lot more people who can help you find the right command than can help you find the right menu button. I haven't used the GUI very much myself, but based on my limited experience I would suggest using the command line interface instead if you can because it's a useful thing to learn.

After you install git, take a few minutes to customize your git environment. If you haven't already, follow the instructions in my previous post on how to change your default command line text editor. Trust me on this – you do not want to use vi. You can also add your name and email address, especially if you intend to use github to share your code. To change your git config settings, replace the example name and email in the two commands below with your name and email (include the quotation marks in your command, but not the dollar signs):

$ git config --global user.name "Jane Doe"

$ git config --global user.email janedoe@example.com

Another nice customization is adding color to your git environment, which makes git output much easier to read. To do this, type:

$ git config --global color.ui auto

To find out more about git configuration, you can also check out the documentation on the git website.

How to create a repository

Ok, now that you have git installed, let's do an example. Suppose you are given someone else's code and you are told to modify it in some way. Since you don't really know how this code works yet, it would be nice to keep track of the original state of the code when it was given to you. If (when) you have to debug the code in the future, you can look back and try to see what changed between your current version and the original working version you started with.

So let's create a git repository: open up the command line and use the "cd" command to change your current directory to where the new code is located. For example, if your new code is in Documents > programming > mycode, you type:

$ cd ~/Documents/programming/mycode

Now, to create a repository we simply type:

$ git init

The next step is to add all the files you want to track to your git repository. If you want specific files (for example, file1.txt and file2.txt) you can type in the names of each of the files manually, separated by spaces like this:

$ git add file1.txt file2.txt

If instead you just want to add all the files in that folder, you can just type:

$ git add *

(The * notation is actually from something called regular expressions, which is incredibly useful, but I don't have time to talk about it here.)

Finally, the last step is to save this current state of the code in memory. To do this, we "commit" the files that we just added. A commit is like a save point that we can access later if needed. When you commit code to memory, you should also add a commit message describing what changes you made since the last commit. Commit messages are the most important and most useful aspect of source control. If you fail to write thorough and informative commit messages, at some point you'll be forced to dig through your code to try to figure out what you did and why, and you probably won't save any time compared to not using source control at all.

Useful commit messages include a verbal description of the specific problems that were fixed. For example, "Fixed bugs, updated parameters" is not a useful commit message. Instead, the message should be much more specific, like this:

"Fixed off-by-one error in file 2, added escape key to for-loop in file 1, changed initial conditions from 3 to 2 to improve performance."

Remember, your most important collaborator is your future self! If you write a thorough commit message now, you'll thank yourself later.

Ok, let's commit this code. It's as simple as:

$ git commit

After you type this command you'll be presented with the commit file in your default command line text editor. At the bottom of the file is a bunch of text telling you which files were changed and what the changes were, line-by-line. At the top of the file there is a blank section where you can type your commit message. You might write something like "First commit, initial state of the code", then save and exit. That's it! Git is now keeping track of all your changes from now on.

Viewing your tracked changes

So how do we access this tracking information? First, let's make sure everything is working correctly. To check the status of your git repository, type the command:

$ git status

You should get an output that says something like "nothing to commit, working directory clean". Now let's edit one of the files that we added earlier and see what happens. For example, make a small change to file1.txt, save the file, and then call git status again. You should see something like this:

Changes not staged for commit:(use "git add <file>..." to update what will be commited)(use "git checkout -- <file>..." to discard changes in working directory)

modified: file1.txt
no changes added to commit (use "git add" and/or "git commit -a")

This tells us that since the last time we committed, changes have only been made in a single file, file1.txt. To find out what was changed, we can type:

$ git diff

We will see some text showing the name of the file and a few lines before and after the spot that was changed. The change itself will be highlighted with plus signs (+) where text was added, and minus signs (-) where text was deleted. (By the way, any time the output for git is longer than a page, you'll see a colon (:) at the bottom of the terminal window instead of the usual dollar sign ($). To scroll, use the up and down arrows. To quit and return to the command line, press the q key.)

Ok, let's try one more thing. What if we wanted to look at the difference between the current state of the code and an even older version of the code from an earlier commit? For our current example we first need to have more than one commit. Go ahead and commit your changes to file1.txt, including a commit message that describes what you changed.

Next, we need to choose the older commit we want to compare against. To view older commits, we use this command:

$ git log

You'll see a list of two commits, each followed by a hash (this is basically a code that allows you to uniquely identify that commit). You should also see your name and email (since you made the commits), the date and time of the commit, and the commit message. Highlight and copy the hash of the initial commit we made earlier, then return to the command line and type the following command (replace HASHCODE with the hash that you just copied):

$ git diff HASHCODE

Voila! You can now view all the changes since that first commit.

How can source control help me?

Now that you understand the basics of how git works, let's talk a little about what it can do. Even for an individual, git can be very useful. There are many practical benefits, like easy recovery for accidental code and file deletion, and the ability to quickly and easily roll back code to a previous state. (In my next post I'll also talk about branching, which allows you to quickly and easily switch between variations of your code.)

Additionally, I've already mentioned that git can be helpful for finding the source of errors when debugging. This is particularly useful if you make notes in your commit messages of which commits have been tested and confirmed to be working. If the code stops working later on, you know that it must be something you did between now and your most recent working commit, so there are fewer edits to sift through to find the error.

Finally, one more benefit of tracking changes with git is that you have a record of what you've accomplished over time. Commit logs show how your time was spent and describe in detail exactly how you solved the challenging problems you faced along the way. If you spend a lot of time working on your code alone and without recognition, as many researchers do, using git can be really encouraging!

Stay tuned for part 2: merging and sharing code

Monday, July 14, 2014

Why you should write good code

I have seen a lot of bad code. Disorganized, repetitive, poorly-commented, inefficient bad code. And I'll admit – I have written some bad code, too. As a researcher I understand that the incentives on how to spend your time often run counter to your best intentions, and that sometimes bad code is unavoidable.

But I am confident that most of the time writing good code is not only possible, it's also well worth the effort. Clean, well-written, well-maintained code actually saves you time, reduces errors, and makes it easier to share your code with other people.

But perhaps you think this doesn't apply to you... perhaps you really don't have time to write good code, or your shoddy code at least isn't hurting anyone?

Perhaps. But perhaps not. In this post I'll examine four common excuses (ahem, reasons) that I've heard for why programmers write bad code, and by the end I hope you'll agree that in fact you really can't afford not to write good code.

Reason #1: "I don't have time"

Not having enough time is by far the most common reason I hear for not writing good code. And it's true – good code often demands a higher initial cost of time and effort. But this kind of thinking is very short-sighted; in the long term, poorly written code will actually weigh you down like a load of bricks.

Poorly-written code often runs slower, which costs anywhere from a few seconds to a few minutes every time you run your program. It's also more difficult to read and understand, which means more time debugging when (not if!) things go wrong. Hastily written code usually only works for a specific case and must be manually tweaked for every variation. This kind of rigid structure is very difficult to modify and repurpose. And finally, returning to your code after an absence is almost always confusing and frustrating, and discourages you from reusing your code in the future. To summarize:

Reason #2: "My code may be messy, but it works just fine"

When I hear this reason I wonder whether this person has ever seen another person's code. Software bugs are so ubiquitous they have even been found in NASA's space explorations, Intel's pentium chips, and United States missiles and military aircraft. Even for well-written code, it's almost certain that if the code is long enough it has at least one bug in it.

How confident are you that your messy, hastily written code, which only you have reviewed and tested, has no bugs? How many programs have you ever written in a hurry? What are the odds that there are no bugs in any of them? Pretty slim, I'm guessing.

Writing good code not only helps prevent mistakes in the first place, it also makes it easier to find the mistakes later.

Reason #3: "I'm only going to use this code once"

OK, I will partially concede that this is a reasonable excuse for not writing great code. But stop and consider for a moment how often this is really true. How often do you return to that code you said you would only use once? Even if you don't use the entire script or function verbatim, how often do you copy sections of code to use elsewhere?

Is there a better alternative to writing single-use code and cannibalizing it? Perhaps you could write a smaller, cleaner, more abstract function, and turn that single-use code into multi-use code.

Reason #4: "I'm the only one who ever uses my code"

While you may be the only one using your code now, it's worth considering whether anyone else might inherit your code in the future. If you are part of an organization or group, what are the chances that some poor soul will take over your project one day and have to sift through your incomprehensible code? This has happened to me, and let me tell you – it sucks!

And even if you really are the only one using your code, I hope you'll agree that you are actually pretty important. As I mentioned in reasons 1 and 2, writing good code saves you time and reduces the number of errors you make. After all, "Your most important collaborator is your future self." And hey – if your code is clean and easy to use, maybe you won't be the only one using it for very long. Developing good programming habits opens up doors to collaboration and contribution to larger projects in the future.

So, is it worth it to write good code? If you consider the time saved, errors prevented, code reused, and opportunities for collaboration, the benefits seem to greatly outweigh the initial costs. I hope this post inspires you to improve your code habits! If you're looking for more information on how to write better code, I'll be writing a series on best programming practices that's geared toward young researchers. Stay tuned!

Sunday, July 6, 2014

How to learn programming from scratch

In many ways programming is one of the most easily accessible skills you can learn. There is no formal education required to get started, and there is a vast ocean of resources available for free on the internet. In fact, many professional programmers are self-taught.

At the same time, if you've never done any kind of programming before it can be really intimidating at first. How do you know where to start or who to ask when you have questions? Not that long ago, you just had to tinker around with code trying to figure out how it worked on your own. Now there are many tutorials and forums specifically geared toward people with zero programming experience.

Choosing a language

So where do you start? First you need to figure out what language you want to learn. There's a good chance that eventually you'll want to learn more than one language, but just stick with one for now. The language you choose really depends on what your goal is for learning to program. In the table below I list a few example languages for different purposes. If you don't have a particular goal I recommend starting with something easy like HTML or Python.

**Example Programming Languages**
Front End (web stuff)
HTML, CSS, PHP, Javascript, Ruby on Rails
Back End (server stuff)
Python, C, C++, C#, Scala, Java, Ruby on Rails
Phone Apps
Objective C / Cocoa (iPhone), Java (Android)
Databases
SQL (pronounced "sequel"), PostgreSQL
Research
MATLAB, R, Python

Once you've picked a language you'll probably want to jump right in with the installation and online tutorials – but wait! There's one important step that will make your life much easier. You should learn a little about the command line.

The command line

The command line is a text-only interface to your computer. Not only do you use it to install and update things, you can also use it to navigate around your computer (including to hidden files and folders), and to automatically rename, copy, move, and delete large batches of files (there are also many other cool things that it can do which I won't cover here).

On UNIX operating systems (i.e. Mac and Linux) the command line is accessible through applications called Terminal (for Mac) and Konsole (for Linux). The default "language" (called a shell) is Bash. (Windows is something different entirely, I don't know much about that, sorry!)

It may seem like a pain to add this extra step, but I promise it will be worth it later on. Unfortunately I've found that many great programming resources on the internet assume that you already know how to do basic things on the command line, so not knowing about it can be a huge barrier for someone just starting out. I suggest these two tutorials: An Introduction to the Linux Command Shell for Beginners by Victor Gedris, and Learning the Shell by William E. Shotts, Jr. They are both very short and clearly written, and will give you a solid foundation for navigating the command line.

EDIT: I realized belatedly that I neglected to include an optional but important step here – setting your default command line text editor. Once in a while you may want to edit a file using the command line (such as hidden files or git commit messages), and unfortunately the default text editor for UNIX systems is something called vi. Vi is really obtuse and I do not recommend it. (Honestly, why should you have to google how to save and exit in your default text editor? That's the opposite of user-friendly.) Instead, I recommend a simple editor called nano. To change your default editor, follow these steps:

1. Open the command line (see above) and type the command:
open ~/.profile
(this will open the hidden profile file in your default text editor)

2. At the bottom of the file, add this line:
export EDITOR=nano
and then save and close the file

3. Repeat steps 1 and 2 above, but this time replace .profile with .bashrc

4. In the command line, type these two commands:
source ~/.profile

source ~/.bashrc
5. Enjoy using nano

Installation

Next step: installation! This part is actually pretty simple, and it's different for every language, so I recommend that you just search for the language you want to learn and follow their instructions. The only side note I will add is that there are things called package managers which make installation and updates easier by packaging together all the other dependencies your program needs to run. You're not required to have a package manager to install a language, but they can sometimes save you a lot of time.

Tutorials

Whenever you're just starting out in a new programming language it's a good idea to take a tour, familiarize yourself with the syntax of the language, and learn about some of it's basic features. This is the kind of information that a tutorial provides. Programming tutorials are very diverse; each one assumes a particular level of experience between "I've never typed a command before" and "I already know five languages, what's special about this one?" It's important to find one that meets you on your level.

There are many, many tutorials available for free on the internet for just about every language you might want to learn. Here are just a few that begin at a very basic level.

Learn Python The Hard Way
The thing I like most about this tutorial is that it starts off very simple and each step is very incremental so you don't get lost along the way. I also like that it tackles some difficult concepts, like debugging, object oriented programming, and inheritance. Essentially, this tutorial doesn't just try to teach you how to program, it also tries to teach you how to think like a programmer. (The tutorial is free online, and there are also videos and other extra materials available for purchase.)

Codecademy
The best part about Codecademy is the way it makes programming fun and interesting by giving you useful, realistic projects. Each lesson is accompanied by a tiny fully functional project or game that helps you practice what you just learned. For example, the very first project for Python is a tip calculator – what a creative way to teach basic number operations! Codecademy tutorials are available for HTML/CSS, Javascript, jQuery, Python, Ruby, PHP, and APIs.

Khan Academy
If you're interested in free self-guided learning online, Khan Academy is one of the best resources available (maybe THE best?). The founder, Salman Khan, gave a nice TED talk in 2011 about the origins of the site. A couple years ago they added a complete Javascript tutorial, and it looks amazing! The highlight of Khan Academy is the free videos, which are easy to follow and much more engaging than reading a book or tutorial online. Since they are in Javascript, the lessons are also much more visual and interactive. Finally, I want to point out my favorite part of the Khan Academy tutorial – at the very end there is a section on Becoming a Better Programmer! I think it's important for tutorials (like this blog, even) to recognize that they won't be able to cover everything, and to point people towards good principles and other helpful resources they can turn to when they get stuck.

(A short note about text editors: many tutorials will suggest that you use the default plain text editor on your system, e.g. Notepad or Textedit. That's fine, but there are certainly better programs out there. By far the best one I have used is Sublime. Serious programmers also use editor "languages" which allow quick navigation and editing through keyboard shortcuts. The two most popular are Emacs and Vim. But really, just stick with Sublime, it's awesome.)

Next step: make something!

The one piece of advice pretty much every programmer will give you is that programming is a skill that you learn by doing. Once you have the basics, the next step is to make something real that works. It's even better if you can think of something you might use in your everyday life. If you don't have any ideas, there are lots of great suggestions online. For example, here are two project lists: one for front end programming, and one for back end.

How to find help

When you get stuck (and you will!) there are several good places to ask for help. First, programming languages usually have a help function that brings up a short written explanation of each command, often with examples of how to use it. Use google to find out how to access the help function for your language.

Second, if you're still stuck, search for your problem online. Try to include some key words, like the name of the language, the function, and the error you're getting if there is one. It also sometimes helps to type your query as a question (like "how do I add two numbers in python?") or even paste your error into the search box directly. Chances are someone else has faced this problem before and some other nice person has kindly taken the time to answer their question. If you don't find the answer right away, keep looking and maybe try changing your search terms. Like everything else, it takes practice to search for the right question, too.

If you still can't find your answer, you can post a question on a support forum like StackOverflow. A word of caution for beginners: it's sometimes hard to express your question in a way that other people can understand and help you. Just as there are good people on the internet there are also rude and obnoxious people as well. Be patient with the strangers giving you free advice, and if they seem frustrated with you try to understand why. At the same time, don't take it personally if they're rude!

Lastly, if you're learning a particular library within a language there are often community support forums and email listserves for users of that library. These smaller communities are often friendlier and more responsive to questions, and sometimes you'll even get a reply from the original author of the library themselves.

There are many more topics regarding learning to program which I didn't cover here, so feel free to comment if you think I missed something or have a suggestion. Thanks for reading!

Thursday, June 26, 2014

MATLAB or Python?

If you’re new to research or looking to expand your skill set, you might be asking yourself “Should I learn Python?” I think the answer to this question is usually a resounding “Yes.”

Here are 6 reasons you may want to consider learning this up-and-coming language:

0) Python is open source

This means it’s freely available and can be modified and built upon by anyone. Open source software works like this: programmers from lots of different backgrounds contribute code, which is then reviewed and edited by other programmers and eventually approved and packaged into the Python language.

Using open source software for research has many benefits. In the same way that open access journals make published scientific papers available to everyone, open source software makes it easier for people to benefit from and contribute to scientific code. The fact that open source software is free eliminates the need for pesky and expensive institutional licenses and creates the potential for contributors who are not affiliated with a university to make tools that researchers can use.

1) Python libraries adapt quickly

In addition to the core of the language there are also a large number of extensions called “libraries”. Libraries are similar to MATLAB’s Toolboxes except there are many more of them, they tend to be much smaller and more specific, and they are usually distributed independently. Python libraries develop very organically - someone realizes the need for a certain feature for their project, they write the code and make it available for anyone to download. Users then submit bug reports, suggest features, and sometimes contribute to the project directly.

As more Python tools have become available for scientific research, more researchers have started using python, and the quality and diversity of the tools is increasing. This decentralized model allows researchers to start using new features in a matter of weeks (instead of years) and it encourages specialized research communities to work together on developing tools for new scientific endeavors.

2) Numpy, Scipy, and Matplotlib have most of what you need

These three libraries form the backbone of Python for researchers.

Numpy allows you to do all sorts of mathematical operations and data restructuring.
Scipy does pretty much everything else: importing and exporting data (including to and from MATLAB), statistical functions, matrix operations, FFT’s and other signal processing, linear algebra, image processing, and a large number of other specialized mathematical and scientific features.
Matplotlib is a flexible library for creating research-style plots similar to those in MATLAB. The library has a very wide range in terms of functionality (i.e. types of plots), and is sometimes easier to customize than MATLAB. (However, I am not a big fan of either MATLAB or Matplotlib in terms of style. I’m still waiting for a plotting library that doesn’t require me to use Adobe Illustrator for my final posters and presentations.)

That being said, pretty much anything you would want to do with basic MATLAB you can do with Numpy, Scipy, and Matplotlib.

3) Pandas is an amazing time saver

The number one reason I recommend Python is the Pandas library. Pandas is a tool for manipulating and analyzing data. It uses a structure called a data frame (copied from R), which allows you to sort, filter, and group your data. Pandas is seamlessly integrated with Numpy, Scipy, and Matplotlib. This means you can easily run statistical tests on your data in various groupings and with various filters. Pandas also has a number of tools for dealing with missing data and time series data which are particularly suited to many research purposes.

The Pandas workflow is really much simpler and more efficient than anything I've seen or done in MATLAB. Here is an example:

Raw data may be imported from a .mat file like this

import scipy.io
import pandas

mat = scipy.io.loadmat( 'file.mat' )
df = pandas.DataFrame( mat[ 'saved_variable_name' ] )

The imported matrix is organized as a data frame. The next step is to rename the columns. This is as easy as

df.columns = ['trial', 'stim_value', 'response', 'etc']

Now it's no longer necessary to remember which columns are which variables, and the likelihood of thoughtless errors later in the analysis is greatly reduced. Let's say we only want to look at the data where the stimulus value is positive. We simply say

df_filtered = df[ df['stim_value'] > 0 ]

You can then analyze the data something like this

df_sum = df_filtered.sum()

And finally, we can plot the data as a bar graph with

df_sum.plot( type='bar' )

That's it! Pandas is really easy to use, and it's by far my first choice for data analysis. It's also a good place to start if you want to begin using Python in your research. Often data analysis is separate from the hardware and proprietary software demands of the experiment itself, and researchers tend to code up the analysis on their own and only share the output with their collaborators, thus alleviating some of the costs of switching to a new language (more about that below). You can also check out this 10 minute introduction to Pandas.

4) IPython Notebook makes writing and sharing code easy

IPython is an interactive command shell (like MATLAB's command window) that allows you to import libraries and data into a temporary workspace. IPython Notebook is a huge and beautiful extension on this idea of interactive computing. The Notebook is a rich text web interface for IPython, which means it runs in the browser (e.g. Firefox). It allows you to create annotated code snippets, including inline plots and mathematical formulas with LaTex support. Some people have even created blogs using IPython Notebook. I can't really do it justice, so if you want more information check out these links!

5) Transferable skills

Lastly, in the event you are considering a career outside of academia, any transferable skills you can develop while doing research are incredibly valuable. Python is widely used in industry, from tiny startups to major companies like Google and Dropbox. Learning it might be your ticket to a new career.

Of course, there are some costs to switching. In many fields MATLAB is still the predominant language, making it difficult to use Python when collaborating with others. Additionally, there is the issue of legacy code. Some MATLAB toolboxes and packages (for example Psychtoolbox and MrVista) have yet to be completely replicated in Python (although that is changing - check out Psychopy and Lyman). A tougher challenge, however, is the legacy code passed down from researcher to researcher in individual labs. Having to use and maintain legacy lab code often forces new researchers to learn MATLAB whether they want to or not. Yet, even with the larger initial cost of learning two languages, I would still argue that learning Python saves you time in the long-term.

Need even more reasons to learn Python? Here are two other blog posts extolling Python's virtues from an entirely different perspective:

Convinced, but don't know where to begin? Here are some resources for learning Python from the beginning and for switching from MATLAB.

You can download Python and all of its major scientific libraries for free through Enthought. Other libraries can be downloaded directly from their respective websites or through a software package manager such as homebrew or pip.
A great tutorial for switching from MATLAB to Python may be found here.
Two excellent introductory Python tutorials: here, and here.
To learn more, I recommend reading through the documentation for various libraries you might use and following along with the exercises on your own computer (for example, here is the Pandas documentation).

Happy coding!

best kept secrets