dumbmatter.com
Online home of Jeremy Scheff

Retrospective bioinformatics: the feasibility overlapping genetic codes

ResearchBlogging.org
This post was chosen as an Editor's Selection for ResearchBlogging.org

In 1957, we knew what DNA was. We were pretty sure that proteins were determined by sequences of DNA. But we didn't know exactly how this happened. In other words, the genetic code was still a mystery back then. This was a particularly perplexing problem, because a very simple question could be stated with no obvious answer: How does a language (DNA sequences) with four letters (the nucleotides A, C, G, and T) get translated into a language (protein sequences) with twenty letters (amino acids)... and furthermore, is there some higher purpose to having these two different alphabets?

Lacking direct experimental results at the time, there were numerous fascinating hypotheses that all turned out to be completely wrong. The history of these hypotheses and how the real genetic code was eventually discovered is summarized in this excellent article by Brian Hayes; this is one example of the depressing reality of molecular biology over the past 50 years, that things just become more and more complicated the closer we investigate them. However, the disproved hypotheses are still quite interesting. One critical question most of the hypothesized genetic codes tried to answer was the alphabet problem. If your "word size" (the number of letters to code from nucleotides to proteins) was 1, you could represent 4 different amino acids (one for each of the 4 nucleotides). For 2 letters, you have 4^2=16 possible words. That's not quite enough to represent 20 nucleotides. But for 3 letters, you wind up with 4^3=64 possible combinations, which is a lot more than 20. That seems very inefficient, doesn't it? So, many scientists assumed there must be some deep underlying reason that explains this discrepancy. Personally, I find Crick's comma-free code to be a particularly elegant hypothesis along these lines, but I'm going to focus on another class of explanations.

One hypothesis on nature of the genetic code was that it could be an overlapping code. For instance, consider a DNA sequence like AGATTC. We now know that these six nucleotides code for two amino acids (if the open reading frame starts at the beginning). AGA is arginine and TTC is phenylalanine. But what if codons (sequences of three nucleotides which code for an amino acid) could overlap? Then, for instance, you could take the same six nucleotides and get AGA, GAT, ATT, and TTC. This would make our DNA much more compact and thus much more energetically efficient, which was then (before the sequencing of the genome) believed to be of critical importance.

Even before scientists began to decipher the true genetic code in the 1960s, some aspects of the hypothesized codes could be tested. Consider a dipeptide (two adjacent amino acids). If no restriction is placed on the sequence of amino acids, then there are 20*20=400 possible dipeptides. But for an overlapping code, a dipeptide is defined by just four nucleotides (e.g. AGAT gives AGA and GAT). This means that an overlapping code has at most 4^4=256 possible dipeptides. Along these lines, clever combinatorics could put testable constraints on the feasibility of overlapping codes, which is exactly what Sidney Brenner did:

Consider an amino acid, which has adjacent amino acids at both of its ends, called C-neighbors and N-neighbors. As each unique triple can be preceded by and followed by any one of the four nucleotides, a single codon could have at most four C-neighbors and four N-neighbors. If more than four neighbors exist, then there must be more than one codon coding for that amino acid (remember, we have 64 triplets and 20 amino acids, so that allows for 44 redundant codons). For instance, an amino acid with 13 known C-neighbors and 15 known N-neighbors must have at least 4 different codons, as that would allow for 4^4=16 possible neighbors on each side.

Back in 1957, protein sequencing was a very young field, but there were a handful known sequences. Brenner used sequences of seven known proteins to find the number C-neighbors and N-neighbors for each amino acid, and then calculated the number of codons that would be needed to represent all of those dipeptides. He found that 70 different codons were required, and since this is more than the 64 that is possible for a simple triplet code, the existence of an overlapping triplet code was disproved.

Now, in 2011, we know the sequences of many more than seven proteins. Brenner's experiment can be performed on much more comprehensive data with just a bit of programming. So let's try it. First, we need some protein sequence data. This can be downloaded from UniProt. I'm going to use the UniProtKB/Swiss-Prot database in FASTA format. Once the .gz file is uncompressed, it becomes apparent that it is just a plain text file in a standard format which has protein sequences.

Then, we have to install Perl and BioPerl. On Ubuntu, that's just an apt-get install bioperl away. Now it's time to code, starting with some boilerplate and module loading:

#!/usr/bin/perl -w

use strict;

use Bio::SeqIO;

We want to calculate the frequency of all dipeptides. This can be done by scanning through each protein sequence and keeping a count of all the dipeptides. I will store all of the counts in a two-dimensional matrix @count with 20 rows and 20 columns. This will replicate Table 2 in Brenner's paper. I also define the hash %labels which contains the 20 amino acids that correspond to the rows and columns of @count.

# 20x20 matrix for dipeptide frequencies
my @count;
for (my $i=0; $i<20; $i++) {
    for (my $j=0; $j<20; $j++) {
        $count[$i][$j] = 0;
    }
}

# Amino acids corresponding to rows/columns of @count
my %labels = (
    A => 0,
    C => 1,
    D => 2,
    E => 3,
    F => 4,
    G => 5,
    H => 6,
    I => 7,
    K => 8,
    L => 9,
    M => 10,
    N => 11,
    P => 12,
    Q => 13,
    R => 14,
    S => 15,
    T => 16,
    V => 17,
    W => 18,
    Y => 19,
);

Then comes the first bit of BioPerl magic, loading the FASTA file. Without standard file formats and standard programming interfaces, this would require custom code to be written to process every different type of file. I am grateful that other hackers came before me so I can just write some simple code like this:

my $seqio = Bio::SeqIO->new(-file => 'uniprot_sprot.fasta');

Then we can use the nice BioPerl data structure to look at all of the protein sequences. The bit with the defined functions is to ignore anything that is not one of the 20 standard amino acids.

while (my $seq = $seqio->next_seq()) {
    my @aa = split(//, $seq->seq());
    my $size = $seq->length();

    # Count the frequency of dipeptides
    for (my $i=0; $i<$size-1; $i++) {
        if (defined($labels{$aa[$i]}) && defined($labels{$aa[$i+1]})) {
            $count[$labels{$aa[$i]}][$labels{$aa[$i+1]}]++;
        }
    }
}

Then simply save the output to a CSV file.

open(OUT, '>out.txt');
foreach my $row(@count) {
    print OUT join(',', @{$row}), "\n";
}
close(OUT);

On my four year old laptop, this script takes several minutes to run on the UniProt file I downloaded. The output is the following table.

ACDEFGHIKLMNPQRSTVWY
A16043501951828223981044202552296115888632036188307385706515883393574545293506453136136668744309397887851651089590154667392376
C174130584991387051480161039132277437098513549913073023574945384103432139410990121443791900931302711613623171081394
D8274451327525785937442704460427168552111556849945569231019856220821383099479704332924515621609928495489741802130625350202
E107810114269268643110670854390717743052674158172328916351225847299511550953439135537320745843662946635891866060127733341910
F53380912171845005945099931261254105916954443587538137368745414640131027931160325224935529854166240297048005885220234312
G104394318999069673882557655209910662133201058301208298671215270312678479838512268484335750996864545722355951778158808411564
H308719711301959732245321895243214581389652581381985264372388449316191426030717759924232627382222048227119349896147018
I933722163420696647779148413674775075248568689560665501998339209124505839536561394941586852749331633140745150103013322834
K866387123427597071845021352420674838224900689040833329999805230527499704478812433871618957649117592149734724102558330326
L157701424572410101741209093673188123460141564097310410788861774109368336737749915158726217103877412485159829291164291180014472504
M435247503822452082910771521803165709520425108029092641385011745119968921599617445924277531957726112830435436375108406
N55411810603737893145934830975552459016663252078745326971459216263941865642184229295036821749189538519150313187700248285
P71440610740748166167516934306866993720344444579843873479821517218332593350439435437942664661648747586865335798682260902
Q6494439172433498848552326076746767417879542781443040674624317046428612334735844549344337142113236700149349884418205712
R8198411345785630507306574222766787112560946148005960581017683227775397183460706434279715987614290491147696504117367322554
S90305617794263462672428450028193687628125269123566392211897882516084897576107584770676539191039613671654797089139655353170
T8072061421495165455916593931257731112260376011975007201020373201492376965571417354351496334668873600541739356109986277183
V10960701829937470458873454812248532592651107904777355701233197281514506873606861441079686430835395741120974728129389346900
W14401231093111942116299858581290095046711901411529322599149921911367779099612121738125757998741294963151862207
Y3856748930131697033610724382140802213632331402929034652905510545923780524639523266731895035976728629335282666302190585

Clearly, this is overkill. Rather than a sparsely populated grid like Brenner's Table 2, there are thousands upon thousands of every dipeptide combination. This means that each amino acid has 20 C-neighbors and 20 N-neighbors, which would take 5 different triplets for each amino acid, or 100 triplets. Thus, even more conclusively than Brenner's original paper, I have disproved the existence of an overlapping triplet genetic code. Of course we already knew this. A non-obvious thing that these results tell us is that the distribution of dipeptides is certainly not uniform.

One striking aspect of Brenner's paper is that it is written incredibly confidently. In my own writing, I struggle to convey such confidence (sometimes for good reason). But it is interesting that Brenner does not 100% conclusively prove what he claims, given that posttranslational modification could account for some anomalies in protein sequence data.

References

Brenner, S. (1957). On the Impossibility of all Overlapping Triplet Codes in Information Transfer from Nucleic Acid to Proteins Proceedings of the National Academy of Sciences, 43 (8), 687-694 DOI: 10.1073/pnas.43.8.687