Beginning Python for Bioinformaticsby Patrick O'Brien
Bioinformatics, the use of computers in biological research, is the newest wrinkle on one of the oldest pursuits--trying to uncover the secret of life. While we may not know all of life's secrets, at the very least computers are helping us understand many of the biological processes that take place inside of living things. In fact, the use of computers in biological research has risen to such a degree that computer programming has now become an important and almost essential skill for today's biologists.
The purpose of this article is to introduce Python as a useful and viable development language for the computer programming needs of the bioinformatics community. In this introduction, we'll identify some of the advantages of using Python for bioinformatics. Then we'll create and demonstrate examples of working code to get you started. In subsequent articles we'll explore some significant bioinformatics projects that make use of Python.
A Bit of Background
Because scientists have long relied on the open availability of each other's research results, it was only natural that they would turn to Open Source software when it came time to apply computer processes to the study of biological processes. One of the first Open Source languages to gain popularity among biologists was Perl. Perl gained a foothold in bioinformatics based on its strong text processing facilities, which were ideally suited to analyzing early sequence data. To its credit, Perl has a history of successful use in bioinformatics and is still a very useful tool for biological research.
In comparison to Perl, Python is a relative newcomer to bioinformatics, but is steadily gaining in popularity. A few of the reasons for this popularity are the:
- Readability of Python code
- Ability to development applications quickly
- Powerful standard library of functionality
- Scalability from very small to very large programs
The Python language was designed to be as simple and accessible as possible, without giving up any of the power needed to develop sophisticated applications. Python's clean, consistent syntax leaves it free from the subtleties and nuances that can make other languages difficult to learn and programs written in those languages difficult to comprehend.
Python's dynamic nature adds to its accessibility. For example, Python doesn't require you to declare variables before you use them, and the same variable can refer to objects of different types over the course of its existence. Python can be also be used interactively, allowing you to familiarize yourself with the language of any Python modules in an interactive session where each command produces immediate results.
Python also has excellent support for the object-oriented style of programming. We'll show an example of this capability at the end of this article, but the basic idea is that object-orientation often provides a better way to organize the data and functionality within your programs. As the data and analytical techniques used in bioinformatics have become more complex, the value of object-oriented language features has risen.
In addition, Python integrates well with systems written in other languages, such as C, C++, Java and Fortran. One of the main benefits of C is speed. When a programmer needs an algorithm to run as fast as possible, they can code it in C or C++ and make it available to Python as an extension module. To the programmer, these are indistinguishable from pure Python modules. Similar utilities exist that make the large body of scientific algorithms coded in Fortran accessible to Python programs.
Java has become popular as a cross-platform and Web development language. The Python interpreter is now available in two variations: one version written in C, and the other version, known as Jython, written in Java. Jython allows Java programmers to write programs using the Python syntax and dynamic language features, and it allows Python programmers to use existing code developed in Java. These are just a few examples of the many ways Python is able to leverage and extend existing code written in other languages.
So while Perl is more well established in the bioinformatics community, many biologists and bioinformaticians are also turning to Python as it gains in popularity. To get a better sense of what Python has to offer, we'll look at examples of Python code that highlight some of its features. But first, we need to cover some of the basic biology that we'll touch on in the examples.
A Bit of Biology
One of the goals of molecular biology is to understand the processes that take place within the cells of living organisms. One such process is the creation of proteins, some of the most basic raw materials of all living things. Almost every process within a living creature makes use of, or is influenced by, these large, complex molecular structures. There are thousands of different proteins and we have barely begun to understand them in any detail. One thing we do know is that the creation of proteins is determined by the information encoded within the genetic material in each cell, called DNA.
DNA is a linear structure made up of a sequence of molecules called
nucleotides or bases. Four nucleotides appear in DNA: adenine, cytosine,
guanine, and thymine. These nucleotides are usually represented by their
T. DNA is actually composed of two strands of these
nucleotides wound around each other in the famous double helix shape.
The sequence of a single strand of DNA can be represented as a sequence of
alphabetical characters identifying each base in the sequence, such as
ACCTTGGCACCT. Due to their chemical attractions, the nucleotides always appear
in pairs, also called base pairs, such that adenine (
A) always pairs up with
T), and cytosine (
C) always pairs up with guanine (
G). Because of this
base-pairing characteristic, we can easily determine the complementary, or
opposite, strand of any single-stranded DNA sequence.
A simplified view of how DNA determines the creation of a protein goes
something like this. A section of DNA called a gene contains the encoded
information about the protein to create. Through the process of transcription,
the two DNA strands along a gene separate, and the gene is copied. This
single-stranded copy is called RNA, or, more precisely, messenger RNA. It is
identical to the original gene sequence, except that the nucleotide uracil (
appears in place of thymine (
Once formed, the messenger RNA moves to a structure in the cell known as a ribosome. The ribosome moves along the messenger RNA, reading its sequence three nucleotides at a time. Each group of three nucleotides, called a codon, determines which of 20 amino acids gets assembled by the ribosome into a protein. Like DNA and RNA, proteins are linear structures that can be represented by a text string. Where DNA and RNA use a four-character alphabet, proteins require a 20-character alphabet to represent each amino acid in a protein sequence.