Debunking myths on genetics and DNA

Thursday, April 19, 2012

Four decades of computational genomics.

"Every genome is the result of a mostly shared, but partly unique, 3.8-billion-year evolutionary journey from the origin of life. Diversity is created mostly by copy errors during replication."
The above is taken from a review in the latest issue of Science [1] that summarizes the progress made in the field of computational genomics since the first sequences obtained back in the mid-seventies. I highly recommend reading the review. Here, I'd like to highlight a few relevant points.

Zerbino, Paten, and Haussler summarize nicely the different types of DNA edits that over those 3.8 billion years have brought us the genetic diversity we observe today. Replication copy errors give rise to single-base changes that can get fixed in the entire population (substitution) or can be present in only part of the population (single-nucleotide polymorphisms). Multiple sequential bases can be duplicated or erased, in which case we talk about indels. Rearrangements can occur, leading to changes in gene copies or even chromosome numbers.

There's so much more to a DNA sequence than just a string of four letters. Genes are not fully understood until you look at their history throughout evolution and throughout the single individual's life, their regulatory mechanisms, their interactions with other genes (epistasis), their epigenetic pathways, their function, etc. With this in mind, computational genomics has the arduous task of not only efficiently store and retrieve the enormous amounts of data, but also build models that encompass epigenetic mechanisms, metabolic pathways, and gene regulatory networks.
"Combining evolutionary, mechanistic, and functional models, computational genomics interprets genomic data along three dimensions. A gene is simultaneously a DNA sequence evolving in time (history), a piece of chromatin that interacts with other molecules (mechanism), and, as a gene product, an actor in pathways of activity within the cell that affect the organism (function). [. . .] Beyond the basics of storing, indexing, and searching the world's genomes, the three fundamental, interrelated challenges of computational genomics are to explain genome evolution, model molecular phenotypes as a consequence of genotype, and predict organismal phenotype."
Genomic evolution is studied using phylogenetic analyses. This presents its challenges, starting from finding optimal ways to align the sequences: in order to compare different sequences, one has to make sure that there is a one-to-one correspondence between each base in each sequence, as shown in the figure below.
Once aligned, one builds phylogenetic trees in order to represent the evolutionary history of the sequences: from the leaves of the tree all the way back to the root, each node in the tree represents a "coalescent" event in the evolutionary history, in other words the event when two distinct lineages shared a common ancestor.
"When applied to more than two species or to multiple gene copies within a species, phylogenetic methods provide an explicit order of gene descent through shared ancestry. [. . .] Finding the optimal phylogeny under probabilistic or parsimony models of substitutions (and also of indels) is NP-hard, and considerable effort has been devoted to obtaining efficient and accurate heuristic solutions."
Right now algorithms that compute phylogenetic trees are computationally intensive and take a long time to run. As the sequencing technology advances and it's possible to sequence more data, larger regions, and in a more efficient way, the challenge is in making also the phylogenetic analyses more computationally efficient.

The next big challenge computational genomics embraces is predicting causal variants. Whole genome studies have to take to account population stratification due to the fact that we are a relatively young species and, as such, all related. New databases are emerging in order to provide epigenetic context and data, RNA expression, and protein levels. All this needs to be folded in in order to make causal predictions from genotype to phenotype.

The coming together of all this information will benefit medical research on multiple levels. Since nearly all cancers are caused by genetic modifications, computational genomics will help us understand cancer therapeutics and tumorigenesis. Stem cell research will also benefit from progress made in computational genomics as it involves the full understanding of variants and their effects not just on the genome, but also on the epigenome and gene expression.
"To face the challenges of obtaining the maximum information from every sequencing experiment, we must borrow advances from a spectrum of different research fields and tie them together into foundational mathematical models implemented with numerical methods. There is a tension between the comprehensiveness of models and their computational efficiency. [. . .] As a common language develops, shaped by our increasing knowledge of biology, we anticipate that computational genomics will provide enhanced ability to explore and exploit the genome structures and processes that lie at the heart of life."
[1] Zerbino, D., Paten, B., & Haussler, D. (2012). Integrating Genomes Science, 336 (6078), 179-182 DOI: 10.1126/science.1216830


  1. When I read "diversity is created mostly by copy errors during replication", I wondered whether we can be sure. Is there clear evidence that most diversity comes from replication errors, is this a widely-accepted view? What about persistant viral DNA, transposon activity, horizontal gene transfer, hybridization ... ?

    interesting post ... as always. Thanks again for the steady stream od informative posts.

  2. Thank you, Hollis, I always appreciate your questions!

    You're right, there are other events that bring diversity, I think what I should've said there is that most of the diversity inherited through "vertical transfer" (ie from one generation to the next) comes from replication errors.

    There are also, as you correctly mention, "horizontal transfers" and those include symbiosis (which, as you know, brought, mitochondria to our cells), viral DNA acquisition and/or transfers, etc.

    I'm not sure if transposons are considered horizontal transfers, because usually they "jump" within a cell... and I don't think those are inheritable, but I'll double check.

    Thanks for being such an attentive reader! :-)

  3. I'm a bit biased. When I first read about mobile elements and other alternatives to small mutations, I was almost ecstatic. Small infrequent DNA changes have never seemed enough to explain evolution, and yet that was the level of discussion and even "knowledge" for so long, even though no one had a clue re genome-scale variation. So I'm biased towards alternative sources of variation, novelty. A paper by Shapiro really got me thinking about transposons and their role in genome evolution. He argues that mobile elements "increase the efficiency of generating functional genomic novelties".

    Mobile DNA and evolution in the 21st century
    Shapiro Mobile DNA 2010, 1:4

    Maybe it's obvious now why I like your blog so much -- the first line about 10% of our DNA being viral in origin was a great hook!


  4. I see... I'll check that link later when I come back from work, in the meantime I just wanted to say that I think you're going to like my next post on Monday! :-)

    Also, I think that the changes are more frequent than originally thought, it's just that most are silent or not picked up by selection, only random drift...

    TGIF! :-)


Comments are moderated. Comments with spam links will be deleted and never published. So, if your intention is to leave a comment just to post a bogus link, please spare your time and mine. To all others: thank you for leaving a comment, I will respond as soon as possible.