Published on (
 See this if you're having trouble printing code examples

Beginning Python for Bioinformatics

by Patrick O'Brien

Bioinformatics, the use of computers in biological research, is the newest wrinkle on one of the oldest pursuits--trying to uncover the secret of life. While we may not know all of life's secrets, at the very least computers are helping us understand many of the biological processes that take place inside of living things. In fact, the use of computers in biological research has risen to such a degree that computer programming has now become an important and almost essential skill for today's biologists.

The purpose of this article is to introduce Python as a useful and viable development language for the computer programming needs of the bioinformatics community. In this introduction, we'll identify some of the advantages of using Python for bioinformatics. Then we'll create and demonstrate examples of working code to get you started. In subsequent articles we'll explore some significant bioinformatics projects that make use of Python.

A Bit of Background

Because scientists have long relied on the open availability of each other's research results, it was only natural that they would turn to Open Source software when it came time to apply computer processes to the study of biological processes. One of the first Open Source languages to gain popularity among biologists was Perl. Perl gained a foothold in bioinformatics based on its strong text processing facilities, which were ideally suited to analyzing early sequence data. To its credit, Perl has a history of successful use in bioinformatics and is still a very useful tool for biological research.

In comparison to Perl, Python is a relative newcomer to bioinformatics, but is steadily gaining in popularity. A few of the reasons for this popularity are the:

The Python language was designed to be as simple and accessible as possible, without giving up any of the power needed to develop sophisticated applications. Python's clean, consistent syntax leaves it free from the subtleties and nuances that can make other languages difficult to learn and programs written in those languages difficult to comprehend.

Python's dynamic nature adds to its accessibility. For example, Python doesn't require you to declare variables before you use them, and the same variable can refer to objects of different types over the course of its existence. Python can be also be used interactively, allowing you to familiarize yourself with the language of any Python modules in an interactive session where each command produces immediate results.

Related Reading

Learning Python
By Mark Lutz, David Ascher

Python also has excellent support for the object-oriented style of programming. We'll show an example of this capability at the end of this article, but the basic idea is that object-orientation often provides a better way to organize the data and functionality within your programs. As the data and analytical techniques used in bioinformatics have become more complex, the value of object-oriented language features has risen.

In addition, Python integrates well with systems written in other languages, such as C, C++, Java and Fortran. One of the main benefits of C is speed. When a programmer needs an algorithm to run as fast as possible, they can code it in C or C++ and make it available to Python as an extension module. To the programmer, these are indistinguishable from pure Python modules. Similar utilities exist that make the large body of scientific algorithms coded in Fortran accessible to Python programs.

Java has become popular as a cross-platform and Web development language. The Python interpreter is now available in two variations: one version written in C, and the other version, known as Jython, written in Java. Jython allows Java programmers to write programs using the Python syntax and dynamic language features, and it allows Python programmers to use existing code developed in Java. These are just a few examples of the many ways Python is able to leverage and extend existing code written in other languages.

So while Perl is more well established in the bioinformatics community, many biologists and bioinformaticians are also turning to Python as it gains in popularity. To get a better sense of what Python has to offer, we'll look at examples of Python code that highlight some of its features. But first, we need to cover some of the basic biology that we'll touch on in the examples.

A Bit of Biology

One of the goals of molecular biology is to understand the processes that take place within the cells of living organisms. One such process is the creation of proteins, some of the most basic raw materials of all living things. Almost every process within a living creature makes use of, or is influenced by, these large, complex molecular structures. There are thousands of different proteins and we have barely begun to understand them in any detail. One thing we do know is that the creation of proteins is determined by the information encoded within the genetic material in each cell, called DNA.

DNA is a linear structure made up of a sequence of molecules called nucleotides or bases. Four nucleotides appear in DNA: adenine, cytosine, guanine, and thymine. These nucleotides are usually represented by their initials, A, C, G and T. DNA is actually composed of two strands of these nucleotides wound around each other in the famous double helix shape.

The sequence of a single strand of DNA can be represented as a sequence of alphabetical characters identifying each base in the sequence, such as ACCTTGGCACCT. Due to their chemical attractions, the nucleotides always appear in pairs, also called base pairs, such that adenine (A) always pairs up with thymine (T), and cytosine (C) always pairs up with guanine (G). Because of this base-pairing characteristic, we can easily determine the complementary, or opposite, strand of any single-stranded DNA sequence.

A simplified view of how DNA determines the creation of a protein goes something like this. A section of DNA called a gene contains the encoded information about the protein to create. Through the process of transcription, the two DNA strands along a gene separate, and the gene is copied. This single-stranded copy is called RNA, or, more precisely, messenger RNA. It is identical to the original gene sequence, except that the nucleotide uracil (U) appears in place of thymine (T).

Once formed, the messenger RNA moves to a structure in the cell known as a ribosome. The ribosome moves along the messenger RNA, reading its sequence three nucleotides at a time. Each group of three nucleotides, called a codon, determines which of 20 amino acids gets assembled by the ribosome into a protein. Like DNA and RNA, proteins are linear structures that can be represented by a text string. Where DNA and RNA use a four-character alphabet, proteins require a 20-character alphabet to represent each amino acid in a protein sequence.

A Bit of Python

Now that we've covered some of the basics of molecular biology, let's take a look at Python to see how its language features can be used to deal with biological research data. We mentioned that DNA, RNA, and proteins are all linear sequences that can be easily represented in a computer-friendly fashion. Python has several built-in structures for handling sequences. Three that we will look at are strings, lists, and dictionaries. To do that, we first need to introduce the Python shell.

There are a couple of different ways to run Python code. One way, which should be familiar to anyone with experience in another language, is to enter lines of Python code into a text file and save it with a .py extension. That program file can then be run from an operating system prompt, or by double-clicking on the file, depending on your setup. The other way is to interact with the Python interpreter in a Python shell, where you can enter lines of code, hit return, and get a immediate response back from Python.

Related Article:

Building GUI Applications with PythonCard and PyCrust -- Developing the GUI for a Python application is often a tedious and time-consuming process. This is the exact opposite of how Python programmers would describe other aspects of software development using Python. In this article, Patrick O'Brien explains how PythonCard and PyCrust, the graphical Python shell, ease the GUI development process.

The Python shell is a great environment in which to learn the Python language and to explore new programming concepts. There are even graphical Python shells that will colorize your code, pop up a list of autocompletion options as you type, display all the variables currently available to your program, and help out in any number of other ways. The Python shell that we will use here is called PyCrust, and it comes with the wxPython GUI toolkit.

When you start a Python shell, you will be prompted to enter a line of Python code. The main prompt is ">>> " (without the quotes). If the Python code you are entering requires more than one line, subsequent lines will display the secondary prompt of "... " Let's see what this looks like in PyCrust.

The initial view of the PyCrust shell.

After we've entered some examples of Python code in the PyCrust shell, it may look like this:

A popup listing available methods for the 'dna' object.

Python Strings

Let's take a look at the example code in more detail. The first thing we did was to create a string and assign it to a variable. Strings in Python are sequences of characters. You create a string literal by enclosing the characters in single ('), double (") or triple (''' or """) quotes. In the example we assigned the string literal CTGACCACTTTACGAGGTTAGC to the variable named dna.


Then we simply typed the name of the variable, and Python responded by displaying the value of that variable, surrounding the value with quotes to remind us that the value is a string.

>>> dna 

A Python string has several built-in capabilities. One of them is the ability to return a copy of itself with all lowercase letters. These capabilities are known as methods. To invoke a method of an object, use the dot syntax. That is, you type the name of the variable (which in this case is a reference to a string object) followed by the dot (.) operator, then the name of the method followed by opening and closing parentheses.

>>> dna.lower() 

You can access part of a string using the indexing operator s[i]. Indexing begins at zero, so s[0] returns the first character in the string, s[1] returns the second, and so on.

>>> dna[0] 
>>> dna[1] 
>>> dna[2] 
>>> dna[3] 

The final line in our screen shot shows PyCrust's autocompletion feature, whereby a list of valid methods (and properties) of an object are displayed when a dot is typed following an object variable. As you can see, Python lists have many built-in capabilities that you can experiment with in the Python shell. Now let's look at one of the other Python sequence types, the list.

Python Lists

Where Python strings are limited to characters, Python lists have no limitations. Python lists are ordered sequences of arbitrary Python objects, including other lists. In addition, you can insert, delete and replace elements in a list. Lists are written as a series of objects, separated by commas, inside of square brackets. Let's look at some lists, and some operations you can perform on lists.

>>> bases = ['A', 'C', 'G', 'T'] 
>>> bases 
['A', 'C', 'G', 'T'] 
>>> bases.append('U') 
>>> bases 
['A', 'C', 'G', 'T', 'U'] 
>>> bases.reverse() 
>>> bases 
['U', 'T', 'G', 'C', 'A'] 
>>> bases[0] 
>>> bases[1] 
>>> bases.remove('U') 
>>> bases 
['T', 'G', 'C', 'A'] 
>>> bases.sort() 
>>> bases 
['A', 'C', 'G', 'T']

In this example we created a list of single characters that we called bases. Then we added an element to the end, reversed the order of all the elements, retrieved elements by their index position, removed an element with the value 'U', and sorted the elements. Removing an element from a list illustrates a situation where we need to supply the remove() method with an additional piece of information, namely the value that we want to remove from the list. As you can see in the picture below, PyCrust takes advantage of Python's ability to let us know what is required for most operations by displaying that information in a call tip pop-up window.

A tooltip showing usage of the 'remove' method.

We've talked about objects having methods, such as the remove() method of a list object, and how a method performs a task and, perhaps, returns a result. Python has another very similar feature, called a function. About the only difference between a function and a method is that a function isn't associated with a particular object.

Note: Whether something should be defined as a function or a method is, in part, a design choice. In fact, we're going to create several functions below and then re-define them as methods as a way of demonstrating Python's support for object-oriented programming.

Python Functions

Functions perform an operation on one or more values and return a result. Python comes with many pre-defined functions, as well as the ability to define your own functions. Let's look at a couple of the built-in functions: len() returns the number of items in a sequence; dir() returns a list of strings representing the attributes of an object; list() returns a new list initialized from some other sequence.

>>> bases = ['A', 'C', 'G', 'T'] 
>>> len(dna) 
>>> len(bases) 
>>> dir(dna) 
['__add__', '__class__', '__contains__', '__delattr__',  
'__doc__', '__eq__', '__ge__', '__getattribute__', '__getitem__',  
'__getslice__', '__gt__', '__hash__', '__init__', '__le__',  
'__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__',  
'__repr__', '__rmul__', '__setattr__', '__str__', 'capitalize',  
'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs',  
'find', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower',  
'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower',  
'lstrip', 'replace', 'rfind', 'rindex', 'rjust', 'rstrip', 'split',  
'splitlines', 'startswith', 'strip', 'swapcase', 'title',  
'translate', 'upper'] 
>>> dir(bases) 
['__add__', '__class__', '__contains__', '__delattr__',  
'__delitem__', '__delslice__', '__doc__', '__eq__', '__ge__',  
'__getattribute__', '__getitem__', '__getslice__', '__gt__',  
'__hash__', '__iadd__', '__imul__', '__init__', '__le__', '__len__',  
'__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__repr__',  
'__rmul__', '__setattr__', '__setitem__', '__setslice__', '__str__',  
'append', 'count', 'extend', 'index', 'insert', 'pop', 'remove',  
'reverse', 'sort'] 
>>> list(dna) 
['C', 'T', 'G', 'A', 'C', 'C', 'A', 'C', 'T', 'T', 'T',  
'A', 'C', 'G', 'A', 'G', 'G', 'T', 'T', 'A', 'G', 'C'] 

Next, we're going to define some functions of our own that will perform useful operations on biological sequence data.

User-defined Functions

Here is the process for creating your own function in Python. The first line begins with the keyword def, is followed by the name of the function and any arguments (expected input values) surrounded by parentheses, and ends with a colon. Subsequent lines make up the body of the function and must be indented. If a string comment appears in the first line of the body, it becomes part of the documentation for the function. The last line of a function returns a result.

Let's define some functions in the PyCrust shell. Then we can try each function with some sample data and see the result returned by the function.

>>> def transcribe(dna): 
...     """Return dna string as rna string.""" 
...     return dna.replace('T', 'U') 
>>> transcribe('CCGGAAGAGCTTACTTAG') 

In this example we created a function, called transcribe that expects a string representing a DNA sequence. Strings have a replace() method that will return a copy of the original string with each occurence of one character replaced by another. In three lines of code we've given ourselves a consistent way to transcribe a string of DNA into RNA. Let's create another function. How about reverse?

>>> def reverse(s): 
...     """Return the sequence string in reverse order.""" 
...     letters = list(s) 
...     letters.reverse() 
...     return ''.join(letters) 

There are a few new things in this function that need explanation. First, we've used an argument name of "s" instead of "dna". You can name your arguments whatever you like in Python. It is something of a convention to use short names based on their expected value or meaning. So "s" for string is fairly common in Python code. The other reason to use "s" instead of "dna" in this example is that this function works correctly on any string, not just strings representing dna sequences. So "s" is a better reflection of the generic utility of this function than "dna".

You can see that the reverse function takes in a string, creates a list based on the string, and reverses the order of the list. Now we need to put the list back together as a string so we can return a string. Python string objects have a join() method that joins together a list into a string, separating each list element by a string value. Since we do not want any character as a separator, we use the join() method on an empty string, represented by two quotes ('' or "").

In order to calculate the complement of a DNA sequence, we need a way to map each of the four bases to its complement. For that, we'll use another Python sequence structure called a dictionary.

Python Dictionaries

A Python dictionary has the same benefit as a regular paper dictionary. It allows you to quickly locate the value (definition) associated with a key (word). Dictionaries are denoted by curly braces and contain a comma-separated sequence of key:value pairs. Dictionaries are not ordered. Instead, dictionary values are accessed by their key value, rather than their position in the sequence. Let's look at some of the methods supported by dictionaries.

>>> basecomplement = {'A': 'T', 'C': 'G', 'T': 'A', 'G': 'C'} 
>>> basecomplement.keys() 
['A', 'C', 'T', 'G'] 
>>> basecomplement.values() 
['T', 'G', 'A', 'C'] 
>>> basecomplement['A'] 
>>> basecomplement['C'] 
>>> for base in basecomplement.keys(): 
...     print "The complement of", base, "is", basecomplement[base] 
The complement of A is T 
The complement of C is G 
The complement of T is A 
The complement of G is C 
>>> for base in basecomplement: 
...     print "The complement of", base, "is", basecomplement[base] 
The complement of A is T 
The complement of C is G 
The complement of T is A 
The complement of G is C 

In this example we also introduced the concept of a for loop, which cycles over the keys of the basecomplement dictionary. Python's for loop can iterate over any sequence. In this example it assigns the first value from the list returned by keys() to the variable named base, executes the print statement, then repeats the process for each subsequent value in the list. In the second for loop example, you can see that when we simply specify "for base in basecomplement" Python defaults to looping over the basecomplement dictionary's keys.

More User-defined Functions

The next example will demonstrate one other technique we will need in our complement function. It's a relatively new feature of Python, called list comprehensions.

>>> letters = list('CCGGAAGAGCTTACTTAG') 
>>> [basecomplement[base] for base in letters] 
['G', 'G', 'C', 'C', 'T', 'T', 'C', 'T', 'C',  
'G', 'A', 'A', 'T', 'G', 'A', 'A', 'T', 'C'] 

A list comprehension returns a list and works similarly to a for loop, but in a much more compact and efficient format. In this case it allows us to return a new list where each base in the original list of letters has been replaced with its complement, which we retrieved from the basecomplement dictionary. Let's see how we put this all together.

>>> def complement(s): 
...     """Return the complementary sequence string.""" 
...     basecomplement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'} 
...     letters = list(s) 
...     letters = [basecomplement[base] for base in letters] 
...     return ''.join(letters) 
>>> complement('CCGGAAGAGCTTACTTAG') 

Now that we've got a reverse function and a complement function, we have the building blocks for a reversecomplement function.

>>> def reversecomplement(s): 
...     """Return the reverse complement of the dna string.""" 
...     s = reverse(s) 
...     s = complement(s) 
...     return s 
>>> reversecomplement('CCGGAAGAGCTTACTTAG') 

It can also be useful to know the percentage of DNA composed of G and C bases. String objects have a count() method that returns the number of character occurences. With that information, calculating the percentage is a simple matter of applying some mathematical calculations.

>>> def gc(s): 
...     """Return the percentage of dna composed of G+C.""" 
...     gc = s.count('G') + s.count('C') 
...     return gc * 100.0 / len(s) 

Since DNA can be divided into three character segments (codons), a function that returned a list of codons would also be useful. Another simple mathematical calculation determines the ending point for our codons in case the DNA string is not evenly divisible by three. The range() function returns a list of numbers from a beginning point to an ending point, incrementing by some value, in this case 3. This arithmetic progression is used inside a list comprehension combined with string slicing to produce a list of three character strings.

>>> def codons(s): 
...     """Return list of codons for the dna string.""" 
...     end = len(s) - (len(s) % 3) - 1 
...     codons = [s[i:i+3] for i in range(0, end, 3)] 
...     return codons 
['CCG', 'GAA', 'GAG', 'CTT', 'ACT', 'TAG'] 

String slicing is similar to string indexing. Instead of retrieving a single character, string slicing allows us to retrieve sections of characters from a starting position up to, but not including, an ending position. The syntax is s[i:j], where i is the starting position and j is the ending position. So s[0:3] returns a string containing the characters in index positions 0, 1, and 2.

>>> s[0:3] 
>>> s[3:6] 
>>> s[6:9] 
>>> s[9:12] 

Here is one final, interesting, note about functions. Functions themselves are objects. That means we can examine their attributes using dir(), just like we did for strings and lists. One of the more useful attributes of a function object is its documentation string, which gets stored in its __doc__ property.

>>> dir(transcribe) 
['__call__', '__class__', '__delattr__', '__dict__', '__doc__',  
'__get__', '__getattribute__', '__hash__', '__init__', '__name__',  
'__new__', '__reduce__', '__repr__', '__setattr__', '__str__',  
'func_closure', 'func_code', 'func_defaults', 'func_dict',  
'func_doc', 'func_globals', 'func_name'] 
>>> transcribe.__doc__ 
'Return dna string as rna string.' 

Don't worry if this last example is a bit esoteric. The main point of showing it was to emphasize that Python is very powerful and consistent, that everything in Python is an object, and that objects can be inspected on the fly. The result is that as you learn Python you will find that unfamiliar objects often behave exactly as you would expect them to behave the very first time you use them. This is a powerful feeling that's not experienced often enough when using other programming languages.

We've seen how to create simple objects, like strings, lists, dictionaries, and functions. Next we're going to look at how we can create our own custom objects with properties and methods that we define.

Python Classes

To create your own custom objects, you must define a sort of template, or cookie cutter, called a class. You do so in Python using the class statement, followed by the name of the class and a colon. Following this, the body of the class definition contains the properties and methods that will be available for all object instances that are based on this class.

Let's take all the functions that we've created so far and recast them as methods of a DNA class. Then we'll see how to create DNA objects based on our DNA class. While we could do all this from the Python shell, instead we will place this code into a file and show how we can use this file from the Python shell. The contents of our file, which Python calls a module, look like this.

class DNA: 
    """Class representing DNA as a string sequence.""" 
    basecomplement = {'A': 'T', 'C': 'G', 'T': 'A', 'G': 'C'} 
    def __init__(self, s): 
        """Create DNA instance initialized to string s.""" 
        self.seq = s 
    def transcribe(self): 
        """Return as rna string.""" 
        return self.seq.replace('T', 'U') 
    def reverse(self): 
        """Return dna string in reverse order.""" 
        letters = list(self.seq) 
        return ''.join(letters) 
    def complement(self): 
        """Return the complementary dna string.""" 
        letters = list(self.seq) 
        letters = [self.basecomplement[base] for base in letters] 
        return ''.join(letters) 
    def reversecomplement(self): 
        """Return the reverse complement of the dna string.""" 
        letters = list(self.seq) 
        letters = [self.basecomplement[base] for base in letters] 
        return ''.join(letters) 
    def gc(self): 
        """Return the percentage of dna composed of G+C.""" 
        s = self.seq 
        gc = s.count('G') + s.count('C') 
        return gc * 100.0 / len(s) 
    def codons(self): 
        """Return list of codons for the dna string.""" 
        s = self.seq 
        end = len(s) - (len(s) % 3) - 1 
        codons = [s[i:i+3] for i in range(0, end, 3)] 
        return codons 

Much of this should look familiar based on our existing functions. Class definitions do add a few new elements that we need to cover. Let's look at how to use this new class before exploring the extra details.

We create object instances by calling the class, much like we would call a function. The first thing we need to do is make the Python shell aware of this class definition. We do that by importing the DNA class definition from our module. Then we create an instance of the DNA class, passing in the initial string value. From that point on the object keeps track of its own sequence value, and we simply call the methods that are defined for that object.

>>> from bio import DNA 
>>> dna1.transcribe() 
>>> dna1.reverse() 
>>> dna1.complement() 
>>> dna1.reversecomplement() 
>>> dna1.gc() 
>>> dna1.codons() 
['CGA', 'CAA', 'GGA', 'TTA', 'GTA', 'GTT', 'TAC'] 

Since a class acts as a kind of template that's used to create multiple object instances, we need the ability, inside a class method, to refer to the specific object instance on which the method is called. To accommodate this need, Python automatically passes the object instance as the first argument to each method. The convention in the Python community is to name that first argument "self." That's why you see "self" as the first argument in all the method definitions of our DNA class.

The other thing to note is that the __init__() method. Python calls this specially named method when creating instances of the class. In our example, DNA.__init__ expects to receive a string argument, which we then store as a property of the object instance, self.seq.

We made one other change when we moved our functions into class methods. We moved the basecomplement dictionary definition out of the complement() method and into the class definition. As part of the class definition, the dictionary is only created once, rather than each time the method is called. The dictionary is shared by all instances of the class, and it can be used by more than one method. This is in contrast to the seq property, for which each object instance will have its own unique value.

As you can see, classes provide a effective way to group related data and functionality. Let's finish our shell session by creating a few more DNA instances.

>>> dna2.codons() 
['ACG', 'GGA', 'GGA', 'CGG', 'GAA', 'AAT', 'TAC', 'TAG',  
'CAC', 'CCG', 'CAT', 'AGA', 'CTT'] 
>>> dna3 = DNA(dna1.seq + dna2.seq) 
>>> dna3.reversecomplement() 
>>> dna4 = DNA(dna3.reversecomplement()) 
>>> dna4.codons() 
['AAG', 'TCT', 'ATG', 'CGG', 'GTG', 'CTA', 'GTA', 'ATT',  
'TTC', 'CCG', 'TCC', 'TCC', 'CGT', 'GTA', 'AAC', 'TAC',  
'TAA', 'TCC', 'TTG', 'TCG'] 

Even with this rudimentary class definition, manipulated from the Python shell, we can start to see Python's potential for analyzing biological data in a clear, coherent fashion, with a minimum of syntactic overhead.


Python is a popular, open source programming language with much to offer the bioinformatics community. At the same time, Python came late to the bioinformatics party and may never rise to level of popularity of Perl. Choice is always a good thing, though, and Python offers a viable, reliable option for biologists and professional programmers alike. We hope this article gives you a reason to take a closer look at Python.

Additional Resources

If you like what you've seen of Python, here are some additional resources to explore.

Patrick O'Brien is an independent software developer and trainer, specializing in the Python programming language. He is the creator of PyCrust, a developer on the PythonCard project, and leader of the PyPerSyst project. He may be reached at

Return to the Python DevCenter.

Copyright © 2009 O'Reilly Media, Inc.