Retrospective bioinformatics: the feasibility overlapping genetic codes
In 1957, we knew what DNA was. We were pretty sure that proteins were determined by sequences of DNA. But we didn't know exactly how this happened. In other words, the genetic code was still a mystery back then. This was a particularly perplexing problem, because a very simple question could be stated with no obvious answer: How does a language (DNA sequences) with four letters (the nucleotides A, C, G, and T) get translated into a language (protein sequences) with twenty letters (amino acids)... and furthermore, is there some higher purpose to having these two different alphabets?
Lacking direct experimental results at the time, there were numerous fascinating hypotheses that all turned out to be completely wrong. The history of these hypotheses and how the real genetic code was eventually discovered is summarized in this excellent article by Brian Hayes; this is one example of the depressing reality of molecular biology over the past 50 years, that things just become more and more complicated the closer we investigate them. However, the disproved hypotheses are still quite interesting. One critical question most of the hypothesized genetic codes tried to answer was the alphabet problem. If your "word size" (the number of letters to code from nucleotides to proteins) was 1, you could represent 4 different amino acids, one for each of the 4 nucleotides. For 2 letters, you have 4^2=16 possible words. That's not quite enough to represent 20 nucleotides. But for 3 letters, you wind up with 4^3=64 possible combinations, which is a lot more than 20. That seems very inefficient, doesn't it? So, many scientists assumed there must be some deep underlying reason that explains this discrepancy. Personally, I find Crick's comma-free code to be a particularly elegant hypothesis along these lines, but I'm going to focus on another class of explanations.
One hypothesis on nature of the genetic code was that it could be an overlapping code. For instance, consider a DNA sequence like AGATTC
. We now know that these six nucleotides code for two amino acids: AGA
is arginine and TTC
is phenylalanine. But what if codons (sequences of three nucleotides which code for an amino acid) could overlap? Then, for instance, you could take the same six nucleotides and get AGA
, GAT
, ATT
, and TTC
. This would make our DNA much more compact and thus much more energetically efficient, which was then believed to be of critical importance.
Even before scientists began to decipher the true genetic code in the 1960s, some aspects of the hypothesized codes could be tested. Consider two adjacent amino acids, which is known as a "dipeptide". If no restriction is placed on the sequence of amino acids, then there are 20*20=400 possible dipeptides. But for an overlapping code, a dipeptide is defined by just four nucleotides (e.g. AGAT
gives AGA
and GAT
). This means that an overlapping code has at most 4^4=256 possible dipeptides. Along these lines, clever combinatorics could put testable constraints on the feasibility of overlapping codes, which is exactly what Sidney Brenner did:
Consider an amino acid, which has adjacent amino acids at each of its ends, called C-neighbors and N-neighbors. As each unique triple can be preceded by and followed by any one of the four nucleotides, a single codon could have at most four C-neighbors and four N-neighbors. If more than four neighbors exist, then there must be more than one codon coding for that amino acid (remember, we have 64 triplets and 20 amino acids, so that allows for 44 redundant codons). For instance, an amino acid with 13 known C-neighbors and 15 known N-neighbors must have at least 4 different codons, as that would allow for 4^4=16 possible neighbors on each side.
Back in 1957, protein sequencing was a very young field, but there were a handful of known sequences. Brenner used sequences of seven known proteins to find the number C-neighbors and N-neighbors for each amino acid, and then calculated the number of codons that would be needed to represent all of those dipeptides. He found that 70 different codons were required, and since this is more than the 64 that is possible for a simple triplet code, the existence of an overlapping triplet code was disproved.
Now, in 2011, we know the sequences of many more than seven proteins. Brenner's experiment can be performed on much more comprehensive data with just a bit of programming. So let's try it. First, we need some protein sequence data. This can be downloaded from UniProt. I'm going to use the UniProtKB/Swiss-Prot database in FASTA format. Once the .gz file is uncompressed, it becomes apparent that it is just a plain text file in a standard format which has protein sequences.
Then, we have to install Perl and BioPerl. On Ubuntu, that's just an apt-get install bioperl
away. Now it's time to code, starting with some boilerplate and module loading:
#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;
We want to calculate the frequency of all dipeptides. This can be done by scanning through each protein sequence and keeping a count of all the dipeptides. I will store all of the counts in a two-dimensional matrix @count
with 20 rows and 20 columns. This will replicate Table 2 in Brenner's paper. I also define the hash %labels
which contains the 20 amino acids that correspond to the rows and columns of @count
.
# 20x20 matrix for dipeptide frequencies
my @count;
for (my $i=0; $i<20; $i++) {
for (my $j=0; $j<20; $j++) {
$count[$i][$j] = 0;
}
}
# Amino acids corresponding to rows/columns of @count
my %labels = (
A => 0,
C => 1,
D => 2,
E => 3,
F => 4,
G => 5,
H => 6,
I => 7,
K => 8,
L => 9,
M => 10,
N => 11,
P => 12,
Q => 13,
R => 14,
S => 15,
T => 16,
V => 17,
W => 18,
Y => 19,
);
Then comes the first bit of BioPerl magic, loading the FASTA file. Without standard file formats and standard programming interfaces, this would require custom code to be written to process every different type of file. I am grateful that other hackers came before me so I can just write some simple code like this:
my $seqio = Bio::SeqIO->new(-file => 'uniprot_sprot.fasta');
Then we can use the nice BioPerl data structure to look at all of the protein sequences. The bit with the defined
functions is to ignore anything that is not one of the 20 standard amino acids.
while (my $seq = $seqio->next_seq()) {
my @aa = split(//, $seq->seq());
my $size = $seq->length();
# Count the frequency of dipeptides
for (my $i=0; $i<$size-1; $i++) {
if (defined($labels{$aa[$i]}) && defined($labels{$aa[$i+1]})) {
$count[$labels{$aa[$i]}][$labels{$aa[$i+1]}]++;
}
}
}
Then simply save the output to a CSV file.
open(OUT, '>out.txt');
foreach my $row(@count) {
print OUT join(',', @{$row}), "\n";
}
close(OUT);
On my four year old laptop, this script takes several minutes to run on the UniProt file I downloaded. The output is the following table.
Clearly, this is overkill. Rather than a sparsely populated grid like Brenner's Table 2, there are thousands upon thousands of every dipeptide combination. This means that each amino acid has 20 C-neighbors and 20 N-neighbors, which would take 5 different triplets for each amino acid, or 100 triplets. Thus, even more conclusively than Brenner's original paper, I have disproved the existence of an overlapping triplet genetic code. Of course we already knew this. A non-obvious thing that these results tell us is that the distribution of dipeptides is certainly not uniform.
One striking aspect of Brenner's paper is that it is written incredibly confidently. In my own writing, I struggle to convey such confidence (sometimes for good reason). But it is interesting that Brenner does not 100% conclusively prove what he claims, given that posttranslational modification could account for some anomalies in protein sequence data.
References
Brenner, S. (1957). On the Impossibility of all Overlapping Triplet Codes in Information Transfer from Nucleic Acid to Proteins Proceedings of the National Academy of Sciences, 43 (8), 687-694 DOI: 10.1073/pnas.43.8.687