A Two Bit Decoder
The entire human genome is available as a single .2bit file
here (click on
“Full Data Set”, then download hg19.2bit
). Unlike the stellar signal
in His Master’s Voice, the 2bit
format is reasonably
clearly documented.
We want to write Clojure code to:
-
Provide base pairs in symbolic (rather than raw binary) form as lazy sequences – i.e., sequences which need not all fit in memory at once, but can be consumed and processed as needed;
-
Provide “random access” to this data selectively, e.g. by chromosome, rather than always reading through the entire file;
-
Provide access to metadata encoded in the file.
The functionality to do this is posted in the jenome project on GitHub. In this post, we’ll explore this code a little; in following posts, we’ll do some investigating of the actual genome per se.
Capability (2) is provided, in part, by implementing random-access
reads from file fname
of len
bytes starting at offset
:
Armed with this, we can get the .2bit
file header:
file-header
basically just gives us the number of sequences
(usually, chromosomes) in the file, doing some sanity checks along the
way. bytes-to-number
converts an arbitrary sequence of bytes to the
appropriate unsigned integer. (For brevity’s sake, I won’t show every
utility function in this blog post; the source code on
GitHub is reasonably short.)
Figure 1: Header and File Index
The next part of the file, as shown in Figure 1, is called the “file index,” and contains a list of sequences contained the rest of the file. It can be read as follows:
This somewhat imperative code walks through the seqcnt
sequence
portions of the index, pulling out sequence names and lengths as we
go.
It’s here that we introduce a new friend, the yeast Saccharomyces
cerevisiae
(SacCer3), used since antiquity for making bread and fermented
beverages. Relatively small in comparison with the human genome,
SacCer3 will be our “unit test” organism. Available
here
and checked into resources
, the file can be accessed as
(I have imported resource
and as-file
from clojure.java.io
; again, see the source code.)
Our index-reading code yields:
The apparent consistency of these values give us some
initial confidence that we are reading the index correctly. Note,
however, the curious fact that chrIX
appears between IV
and V
.
With this encouraging start, we can now attack the sequences proper. These are laid out as shown in Figure 2, with block metadata prepended to the actual DNA sequences:
Figure 2: Sequence Record Layout
The “N blocks” are blocks of unknown sequences with specified offsets
and lengths. Masked blocks are blocks which are known repetitions
(indicated as lower case “a”, “g”, “c” and “t” in the text-based
‘FASTA’ file format). We are obviously most interested in dnaSize
,
the number of base pairs in the sequence, and the actual sequence
values themselves.
Unpacking the above data format (except the base pairs per se)
makes heavy use of get32
, which just returns the unsigned 32-bit
integer at the specified file location. This code doesn’t need to be
super efficient, since the block headers themselves are quite small. The
metadata for the entire file is returned as a sequence of maps.
A few sanity checks are included in test_core.clj
to make sure we’re
decoding the metadata correctly. Requirement (3) is done!
The final step (Requirement (1)), is to actually get our base pairs (BPs).
Since we have to assume the data set is very large (as is the case
with the human genome), we cannot read the entire DNA sequence at
once. The first part is to break the dna-size
base pairs (remember
we have 2 bits per BP, or 4 BP/byte), starting at dna-offset
. First
we obtain the “coordinates” of the sequences we want to read:
At long last, having obtained the locations and lengths we want to read from, we can get our sequences out:
This function allows us to choose the entire genome, or that for a given offset and number of base pairs (whether from the metadata for an entire chromosome, or some smaller region, thus satisfying (2)).
The next post will focus on verification of correctness of this code; subsequent posts will begin to explore the characteristics of this data for various genomes, human or otherwise.
NEXT: Validataing the Genome Decoder
blog comments powered by Disqus