Nucleotide Repetition Lengths
After a long hiatus, let’s continue our foray into amateur genomics with Clojure, in which we continue to encounter elegant expressions of Clojure’s power as well as a few additional wrinkles in the Clojure ecosystem (in this case, Incanter).
In our last post, we upgraded our genome decoder and used it to show frequencies of the four nucleotides, illustrating Chargaff’s Rule.
Another simple question to be asked of the data is, What is the
distribution of nucleotide repetition lengths? In other words, if we
represent the number of times any nucleotide is repeated as a histogram,
then AAGGCATTTT
would yield two entries for 1, two entries for 2, zero for 3, and one for 4.
What does this distribution look like for the entire human or yeast genome?
As a “null hypothesis”, consider the likelihood of getting ( N ) A’s when drawing at random from A, G, C or T. The probability for the first selection to be A is 25%; for the first two draws to be AA the probability is ( P(AA) = {1 \over {4^2}} = 0.625 ) , and for ( N ) consecutive draws, ( P(A \times N) = 4^{-N} ) .
You can easily simulate (“Monto Carlo”) such a situation using a purely random sequence:
This function yields a sequence like (:A :A :C :A :T :G :G :T :G :A :C :C ...)
.
To convert this into repetition lengths, we can first divide the
sequence up into repeating elements using partition-by
and the
identity
function:
Extracting the lengths is then trivial: just map count
over the sequences:
(Note: running these functions unmodified will hang your REPL, since
randgenome
yields an infinite sequence! Wrap them with take
if
needed.)
Histogramming the data
The plots in my previous posts were made by the
easy expedient of copying data into a Numbers spreadsheet on my Mac. We’d like to eventually do some more ambitious plotting, however, so let’s investigate a Clojure-based solution.
Incanter provides a broad set of Clojure-based utilities for data analysis, including matrices and linear algebra operations, statistical distributions, and so on. Built on top of JFreeChart, it implements several different kinds of plots, including histograms, and it should be an obvious candidate for displaying our length distribution histogram.
Unfortunately, because of two issues I discovered, Incanter will not work for our purposes. To make a long story short, it does not handle empty (zero) entries in histograms properly when displaying on a logarithmic scale, and it generates out-of-memory errors when given more than a million or so entries. (Coming from high energy physics, where such histograms are extremely common, these both seem like serious flaws to me, which I hope will be remedied in the next Incanter release.)
I wound up using the JFreeChart API directly, though this was very time consuming, as the API has approximately the same complexity as the entire US space program or the Large Hadron Collider. (In what follows below I elide details about false starts, descriptions of hair pulling, and expletives.)
First, I needed to convert the lengths from get-lengths
into a histogram with arbitrarily many entries:
Supplying the minimum and maximum value ahead of time simplifies the histogramming code substantially and allows us to choose the same range for multiple plots, as will be shown below.
Armed with this transformation, the generation of the histogram plot is as follows:
I will omit a detailed explanation of how draw-hist
works because
I’m still not a JFreeChart expert (though if I do many more blog posts
on this topic I may be forced to become one, however reluctantly). The
principal difference with the Incanter implementation of histograms is
that we provide our own set of bin heights and simply plot those. This
follows a better separation of concerns anyways: our fairly simple
binning function remains separate from the visual presentation of the
bin positions and heights (we could, for example, add the counting of
overflows and underflows – this is left as an exercise to the
reader).
(trim-zeros
exists to transform bin values of zero to small positive
numbers to avoid taking the logarithm of zero, which is undefined.)
Armed with these tools, we can now make our first distribution:
Figure 1: Nucleotide repetition lengths for the random hypothesis
As expected by our analysis above, this shows the falling “power-law” distribution of randomly-occuring nucleotide repetitions. Compare this with the genome for the yeast S. Cerviciae:
Figure 2: Nucleotide repetition lengths for the yeast S. Cerviciae
And, for humans (here we did not wait for the result for the entire genome):
Figure 2: Nucleotide repetition lengths for humans
The actual genome data clearly deviate from our null hypothesis of randomness, as one might expect. However, these graphs raise more questions than they answer. In particular, note that there is a sort of bimodal characteristic or extended, secondary bump in the data for humans, with hints of an outlier feature on the tail which may or may not just be a statistical fluctuation.
One complicating factor we neglected is the role of “N-blocks”, which
should be eliminated without introducing artificially longer lengths;
e.g. ANNNNA
with the Ns removed should be considered two 1-blocks
rather than a two-block AA
group. This simple modification to
get-lengths
is left as another exercise to the reader (the change
does not affect the resulting distributions).
A real biologist could no doubt tell us a lot about these distributions; in any case, in the next blog post we will take up a variant of this analysis which may be even more interesting. Meanwhile, the latest code has been pushed to GitHub.
blog comments powered by Disqus