Retrospective bioinformatics: the feasibility overlapping genetic codes

Jun

2011

June 2, 2011

Tags: Programming, Research Blogging, Science

In 1957, we knew what DNA was. We were pretty sure that proteins were determined by sequences of DNA. But we didn't know exactly how this happened. In other words, the genetic code was still a mystery back then. This was a particularly perplexing problem, because a very simple question could be stated with no obvious answer: How does a language (DNA sequences) with four letters (the nucleotides A, C, G, and T) get translated into a language (protein sequences) with twenty letters (amino acids)... and furthermore, is there some higher purpose to having these two different alphabets?

Lacking direct experimental results at the time, there were numerous fascinating hypotheses that all turned out to be completely wrong. The history of these hypotheses and how the real genetic code was eventually discovered is summarized in this excellent article by Brian Hayes; this is one example of the depressing reality of molecular biology over the past 50 years, that things just become more and more complicated the closer we investigate them. However, the disproved hypotheses are still quite interesting. One critical question most of the hypothesized genetic codes tried to answer was the alphabet problem. If your "word size" (the number of letters to code from nucleotides to proteins) was 1, you could represent 4 different amino acids, one for each of the 4 nucleotides. For 2 letters, you have 4^2=16 possible words. That's not quite enough to represent 20 nucleotides. But for 3 letters, you wind up with 4^3=64 possible combinations, which is a lot more than 20. That seems very inefficient, doesn't it? So, many scientists assumed there must be some deep underlying reason that explains this discrepancy. Personally, I find Crick's comma-free code to be a particularly elegant hypothesis along these lines, but I'm going to focus on another class of explanations.

One hypothesis on nature of the genetic code was that it could be an overlapping code. For instance, consider a DNA sequence like AGATTC. We now know that these six nucleotides code for two amino acids: AGA is arginine and TTC is phenylalanine. But what if codons (sequences of three nucleotides which code for an amino acid) could overlap? Then, for instance, you could take the same six nucleotides and get AGA, GAT, ATT, and TTC. This would make our DNA much more compact and thus much more energetically efficient, which was then believed to be of critical importance.

Even before scientists began to decipher the true genetic code in the 1960s, some aspects of the hypothesized codes could be tested. Consider two adjacent amino acids, which is known as a "dipeptide". If no restriction is placed on the sequence of amino acids, then there are 20*20=400 possible dipeptides. But for an overlapping code, a dipeptide is defined by just four nucleotides (e.g. AGAT gives AGA and GAT). This means that an overlapping code has at most 4^4=256 possible dipeptides. Along these lines, clever combinatorics could put testable constraints on the feasibility of overlapping codes, which is exactly what Sidney Brenner did:

Consider an amino acid, which has adjacent amino acids at each of its ends, called C-neighbors and N-neighbors. As each unique triple can be preceded by and followed by any one of the four nucleotides, a single codon could have at most four C-neighbors and four N-neighbors. If more than four neighbors exist, then there must be more than one codon coding for that amino acid (remember, we have 64 triplets and 20 amino acids, so that allows for 44 redundant codons). For instance, an amino acid with 13 known C-neighbors and 15 known N-neighbors must have at least 4 different codons, as that would allow for 4^4=16 possible neighbors on each side.

Back in 1957, protein sequencing was a very young field, but there were a handful of known sequences. Brenner used sequences of seven known proteins to find the number C-neighbors and N-neighbors for each amino acid, and then calculated the number of codons that would be needed to represent all of those dipeptides. He found that 70 different codons were required, and since this is more than the 64 that is possible for a simple triplet code, the existence of an overlapping triplet code was disproved.

Now, in 2011, we know the sequences of many more than seven proteins. Brenner's experiment can be performed on much more comprehensive data with just a bit of programming. So let's try it. First, we need some protein sequence data. This can be downloaded from UniProt. I'm going to use the UniProtKB/Swiss-Prot database in FASTA format. Once the .gz file is uncompressed, it becomes apparent that it is just a plain text file in a standard format which has protein sequences.

Then, we have to install Perl and BioPerl. On Ubuntu, that's just an apt-get install bioperl away. Now it's time to code, starting with some boilerplate and module loading:

#!/usr/bin/perl -w

use strict;

use Bio::SeqIO;

We want to calculate the frequency of all dipeptides. This can be done by scanning through each protein sequence and keeping a count of all the dipeptides. I will store all of the counts in a two-dimensional matrix @count with 20 rows and 20 columns. This will replicate Table 2 in Brenner's paper. I also define the hash %labels which contains the 20 amino acids that correspond to the rows and columns of @count.

# 20x20 matrix for dipeptide frequencies
my @count;
for (my $i=0; $i<20; $i++) {
    for (my $j=0; $j<20; $j++) {
        $count[$i][$j] = 0;
    }
}

# Amino acids corresponding to rows/columns of @count
my %labels = (
    A => 0,
    C => 1,
    D => 2,
    E => 3,
    F => 4,
    G => 5,
    H => 6,
    I => 7,
    K => 8,
    L => 9,
    M => 10,
    N => 11,
    P => 12,
    Q => 13,
    R => 14,
    S => 15,
    T => 16,
    V => 17,
    W => 18,
    Y => 19,
);

Then comes the first bit of BioPerl magic, loading the FASTA file. Without standard file formats and standard programming interfaces, this would require custom code to be written to process every different type of file. I am grateful that other hackers came before me so I can just write some simple code like this:

my $seqio = Bio::SeqIO->new(-file => 'uniprot_sprot.fasta');

Then we can use the nice BioPerl data structure to look at all of the protein sequences. The bit with the defined functions is to ignore anything that is not one of the 20 standard amino acids.

while (my $seq = $seqio->next_seq()) {
    my @aa = split(//, $seq->seq());
    my $size = $seq->length();

    # Count the frequency of dipeptides
    for (my $i=0; $i<$size-1; $i++) {
        if (defined($labels{$aa[$i]}) && defined($labels{$aa[$i+1]})) {
            $count[$labels{$aa[$i]}][$labels{$aa[$i+1]}]++;
        }
    }
}

Then simply save the output to a CSV file.

open(OUT, '>out.txt');
foreach my $row(@count) {
    print OUT join(',', @{$row}), "\n";
}
close(OUT);

On my four year old laptop, this script takes several minutes to run on the UniProt file I downloaded. The output is the following table.

	A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y
A	1604350	195182	822398	1044202	552296	1158886	320361	883073	857065	1588339	357454	529350	645313	613666	874430	939788	785165	1089590	154667	392376
C	174130	58499	138705	148016	103913	227743	70985	135499	130730	235749	45384	103432	139410	99012	144379	190093	130271	161362	31710	81394
D	827445	132752	578593	744270	446042	716855	211155	684994	556923	1019856	220821	383099	479704	332924	515621	609928	495489	741802	130625	350202
E	1078101	142692	686431	1067085	439071	774305	267415	817232	891635	1225847	299511	550953	439135	537320	745843	662946	635891	866060	127733	341910
F	533809	121718	450059	450999	312612	541059	169544	435875	381373	687454	146401	310279	311603	252249	355298	541662	402970	480058	85220	234312
G	1043943	189990	696738	825576	552099	1066213	320105	830120	829867	1215270	312678	479838	512268	484335	750996	864545	722355	951778	158808	411564
H	308719	71130	195973	224532	189524	321458	138965	258138	198526	437238	84493	161914	260307	177599	242326	273822	220482	271193	49896	147018
I	933722	163420	696647	779148	413674	775075	248568	689560	665501	998339	209124	505839	536561	394941	586852	749331	633140	745150	103013	322834
K	866387	123427	597071	845021	352420	674838	224900	689040	833329	999805	230527	499704	478812	433871	618957	649117	592149	734724	102558	330326
L	1577014	245724	1010174	1209093	673188	1234601	415640	973104	1078886	1774109	368336	737749	915158	726217	1038774	1248515	982929	1164291	180014	472504
M	435247	50382	245208	291077	152180	316570	95204	251080	290926	413850	117451	199689	215996	174459	242775	319577	261128	304354	36375	108406
N	554118	106037	378931	459348	309755	524590	166632	520787	453269	714592	162639	418656	421842	292950	368217	491895	385191	503131	87700	248285
P	714406	107407	481661	675169	343068	669937	203444	445798	438734	798215	172183	325933	504394	354379	426646	616487	475868	653357	98682	260902
Q	649443	91724	334988	485523	260767	467674	178795	427814	430406	746243	170464	286123	347358	445493	443371	421132	367001	493498	84418	205712
R	819841	134578	563050	730657	422276	678711	256094	614800	596058	1017683	227775	397183	460706	434279	715987	614290	491147	696504	117367	322554
S	903056	177942	634626	724284	500281	936876	281252	691235	663922	1189788	251608	489757	610758	477067	653919	1039613	671654	797089	139655	353170
T	807206	142149	516545	591659	393125	773111	226037	601197	500720	1020373	201492	376965	571417	354351	496334	668873	600541	739356	109986	277183
V	1096070	182993	747045	887345	481224	853259	265110	790477	735570	1233197	281514	506873	606861	441079	686430	835395	741120	974728	129389	346900
W	144012	31093	111942	116299	85858	129009	50467	119014	115293	225991	49921	91136	77790	99612	121738	125757	99874	129496	31518	62207
Y	385674	89301	316970	336107	243821	408022	136323	314029	290346	529055	105459	237805	246395	232667	318950	359767	286293	352826	66302	190585

Clearly, this is overkill. Rather than a sparsely populated grid like Brenner's Table 2, there are thousands upon thousands of every dipeptide combination. This means that each amino acid has 20 C-neighbors and 20 N-neighbors, which would take 5 different triplets for each amino acid, or 100 triplets. Thus, even more conclusively than Brenner's original paper, I have disproved the existence of an overlapping triplet genetic code. Of course we already knew this. A non-obvious thing that these results tell us is that the distribution of dipeptides is certainly not uniform.

One striking aspect of Brenner's paper is that it is written incredibly confidently. In my own writing, I struggle to convey such confidence (sometimes for good reason). But it is interesting that Brenner does not 100% conclusively prove what he claims, given that posttranslational modification could account for some anomalies in protein sequence data.

References

Brenner, S. (1957). On the Impossibility of all Overlapping Triplet Codes in Information Transfer from Nucleic Acid to Proteins Proceedings of the National Academy of Sciences, 43 (8), 687-694 DOI: 10.1073/pnas.43.8.687