Debunking myths on genetics and DNA

Thursday, September 27, 2012

Healthy habits are easier when you stop thinking about it

Raising health awareness has done little so far in actually improving global health. Humans seem to be stubbornly attached to certain behaviors, even when fully aware that such behaviors pose a health risk.

Currently, the four most prevalent noncommunicable diseases are diabetes, cardiovascular disease, lung disease, and cancer. The risk of death from any of the four can be significantly lowered by changing basic behaviors such as lowering the consumption of calories, alcohol and tobacco, while increasing physical activity and the consumption of fruits and vegetables. It sounds simple, in theory, but certain behaviors are so engrained in the society that despite widespread campaigns, we still haven't been able to change people's habits. And not surprisingly so, since much of our behavior is often automatic rather than dictated by consciousness. As Marteau et al. state in [1], not even personalized risk assessments like gene variants and other biomarkers have succeeded in dissuading people from certain behaviors.

We are complex beings, constantly shifting from full awareness and reflective, goal-driven behavior, to more automated actions where deep thoughts are far removed. The former behavior is more costly in terms of metabolic resources and energy. The latter is more efficient in our daily routine, but it has the disadvantage of taking over even when the consequences are undesired. For example, lab animals that have been trained to repeat certain behaviors, they will keep repeating them even when unpleasant consequences are introduced in the experimental setting. Therefore, in order to prevent noncommunicable diseases, Marteau et al. argue that we need to target automatic behaviors rather than conscious ones.

How can this be achieved?

Well, for example making fruits and vegetables very easy to find at the store, and relegate the so-called junk food to some hidden, desolate aisle that requires extra walking to get to. Also (and I know I'm totally going against common economy rules here), making fruits and vegetables cheaper than junk food would cause a huge switch in people's eating habits. If this may sound much of an utopia (yeah, I can see that), here are a few more practical things that can be changed: make stairs accessible from everywhere in a building, and hide the elevators. Make the elevators really slow that it's a lot more practical to take the stairs. Make tobacco and alcohol harder to find (though I have doubts about alcohol, since I grew up in a country where alcohol is on the table every day and somehow we seem to handle alcohol addictions better than other countries with lots of rules and prohibitions). Use smaller serving portions and smaller (but taller) glasses and plates. Marteau et al. even suggest "standing desks" in classrooms to have students spend more calories (this one made me smile).

Here's my two cents. As you know, I grew up in Italy, a country that very much cherishes food and spending social time at the table. I think Italians have pushed things to the far extreme, and now, when I go back to visit, after about one hour of sitting at the table "socializing" I get a little restless. Much of the overweight problems in Italy come from spending too much time at the table. After a while you don't feel hungry anymore but you just keep eating because food is being offered to you.

On the other hand, I see that the United States have the exact opposite problem. There's no definite time of when to eat lunch or dinner, and when you look around you see people eating at any given time of the day. This is just my personal opinion, of course, but I truly believe that introducing fixed eating times in the day can greatly help towards healthier eating behaviors. Also, we should learn from our children. When they are full they stop eating. Parents tend to get edgy and force them to eat more, whereas maybe it should be opposite, it should be the children telling the parents to stop eating so we can all get back into the habit of eating only when we're hungry. Unfortunately, because eating is so much part of our social life and social celebrations, in real life, things tend to get more complicated.

[1] Theresa M. Marteau, Gareth J. Hollands, & Paul C. Fletcher (2012). Changing Human Behavior to Prevent Disease: The Importance of Targeting Automatic Processes Science

Monday, September 24, 2012

ENCODE sheds light on non-coding variants

Back when I started studying human genetics, we were still doing single-gene associations. Namely, we would type a bunch of variants in a single gene and then do a case-control association study to see which, if any, of those variants marked an increase in disease risk. That's how breast cancer markers such as BRCA1 and BRCA2 have been found.

When the Human Genome Project was completed in 2003, scientists started looking for disease risk alleles across the whole genome. The findings were puzzling: more than 90% of the diseases-associated variants fell in non-coding regions. Why? One issue I've previously discussed is that when looking at tens of thousands of loci, you need huge sample sizes and often, when huge sample sizes aren't feasible, these studies are underpowered. Another possible explanation lies in epistasis, and the detected signal may be the effect of some unknown correlation.

However. You knew there was going to be a "however", right? Because thanks to the ENCODE project we now know that if a genetic variant falls in a non-coding region, it doesn't mean it has no effect whatsoever. ENCODE is bound to shed new light on these numerous non-coding risk alleles that genome-wide association studies (GWAS) studies have found.

Last time I discussed DHSs, or DNase I hypersensitive sites. These are chromatin regions where many regulatory elements have been found. In [1], Maurano et al. show that many of the non-coding variants associated with common diseases are concentrated in regulatory DNA marked by DHSs. The researchers performed genome-wide DNase I mapping across 349 cell and tissue types. As discussed last week, regions of DNase I accessibility harbor regulatory elements. The researchers also examined the distribution of 5654 non-coding SNPs (single base variants) that had been significantly associated to some disease or trait in genome-wide studies.

These the main findings:
"Fully 76.6% of all noncoding GWAS SNPs either lie within a DHS (57.1%, 2931 SNPs) or are in complete linkage disequilibrium (LD) with SNPs in a near-by DHS (19.5%, 999 SNPs)."
To be in linkage disequilibrium means that the variant is typically inherited together with a DHS site. Suppose the true causal variant is at locus A, but you haven't typed locus A, you've typed locus B, and A and B are inherited together. Then B is going to light up as strong signal in your statistical analysis. So, what Maurano et al. are saying in the above paragraph is that the non-coding SNPs either turned up in a DHS site, or they found evidence that they were strongly correlated with one of such sites.
"Many common disorders have been linked with early gestational exposures or environmental insults. Because of the known role of the chromatin accessibility landscape in mediating responses to cellular exposures such as hormones, we examined if DHSs harboring GWAS variants were active during fetal developmental stages. Of 2931 noncoding disease- and trait-associated SNPs within DHSs globally, 88.1% (2583) lie within DHSs active in fetal cells and tissues. Of DHSs containing disease-associated variation, 57.8% are first detected in fetal cells and tissues and persist in adult cells (“fetal origin” DHSs), whereas 30.3% are fetal stage–specific DHSs.
And finally:
"Enhancers may lie at great distances from the gene(s) they control and function through long-range regulatory interactions, complicating the identification of target genes of regulatory GWAS variants."
GWAS variants control distant genes that need not even be on the same chromosome. Furthermore, these variants in DHSs sites tend to alter allelic chromatin state, thus modulating the accessibility of genes to transcription factors. Disease-linked variants were found to alter such accessibility, resulting in allelic imbalance (one allele gets transcribed more than the other one), possibly explaining their role in altering the disease risk or quantitative trait.

[1] Matthew T. Maurano, Richard Humbert, Eric Rynes, Robert E. Thurman, Eric Haugen, Hao Wang, Alex P. Reynolds, Richard Sandstrom, Hongzhu Qu, Jennifer Brody, Anthony Shafer, Fidencio Neri, Kristen Lee, Tanya Kutyavin, & Sandra Stehling-Sun (2012). Systematic Localization of Common Disease-Associated Variation in Regulatory DNA Science DOI: 10.1126/science.1222794

Sunday, September 23, 2012

Fall colors are here!

Winsor Trail, Pecos Wilderness, Santa Fe Basin, Santa Fe, NM.

Thursday, September 20, 2012

The encyclopedia of DNA - Part III

The ENCODE project effectively marked the transition from genomics to functional genomics. The goal of the Human Genome Project was to type the entire human genome. Once that was achieved people realized they had just scraped the tip of the iceberg. Today, the goal of functional genomics is go one step beyond DNA sequences, and understand the dynamics of gene expression, transcription, translation and all the complex pathways that lead from DNA to the making of proteins.

In order to do this, the main goal of functional genomics is to annotate regulatory elements of the genome, in other words, elements that regulate gene expression and transcription. For example, proteins called transcription factors bind to regulatory sequences and favor transcription of a gene into mRNA. Last time we learned about regulatory sequences such as promoters and enhancers, and how the ENCODE project has found a vast amount of these sequences, in particular outside and far away from the genes they regulate.

Previously, we also learned about chromatin, the "yarn" of DNA inside the nucleus, and how its configurations affect gene expression. We also learned about transcription factories inside the chromatin, where genes are recruited and transcribed.

In order to transcribe a gene, the two helices of DNA where the gene sits need to be separated. This will allow the RNA polymerase to access the strand where the gene sits and transcribe it. In other words, in order to be expressed, a gene needs to be accessible. To explore how "accessible" a gene is in a specific chromatin configuration, people have employed the technique of mapping regions called hypersensitive sites. These sites are highly accessible to certain enzymes called nucleases, and promoters and most regulatory elements are found in chromatin sites that are hypersensitive to one endonuclease in particular, called DNase I. Therefore, mapping DNase I hypersensitive sites (DHSs) is an efficient way of identifying regulatory DNA regions.

In [1], Thurman et al. identified nearly 2.9 million genome-wide DHSs across 125 cell types.
"Annotating these elements using ENCODE data reveals novel relationships between chromatin accessibility, transcription, DNA methylation and regulatory factor occupancy patterns. [. . .] Patterning of chromatin accessibility at many regulatory regions is organized with dozens to hundreds of co-activated elements, and the transcellular DNase I sensitivity pattern at a given region can predict cell-type-specific functional behaviours."

Chromatin accessibility is what allows transcription factors to bind to the DNA region to be transcribed. Hence, which sites are accessible and which are not plays an important role in gene expression. When transcription factors bind to their target sites, they initiate chromatin remodeling and the recruitment of other chromatin elements. These local perturbations make certain stretches of DNA accessible to nucleases, DNase I in particular.

[1] Robert E. Thurman, Eric Rynes, Richard Humbert, Jeff Vierstra, Matthew T. Maurano, Eric Haugen, Nathan C. Sheffield, & Andrew B. Stergachis, et al. (2012). The accessible chromatin landscape of the human genome Nature DOI: 10.1038/nature11232

Monday, September 17, 2012

The encyclopedia of DNA - Part II

Last week I started discussing the exciting news about the six ENCODE papers published in the Nature September 6 issue. If you haven't already, I highly recommend reading the review ENCODE explained [1], which has a nice summary of the papers and an excellent perspective on what these results mean.

One paragraph in particular is worth quoting:
"The authors report that the space between genes is filled with enhancers (regulatory DNA elements), promoters (the sites at which DNA’s transcription into RNA is initiated) and numerous previously overlooked regions that encode RNA transcripts that are not translated into proteins but might have regulatory roles. Of note, these results show that many DNA variants previously correlated with certain diseases lie within or very near non-coding functional DNA elements, providing new leads for linking genetic variation and disease [1]."
To review what promoters and enhancers are you can take a look at this older post.

I can't stress enough how relevant these findings are. Previously, genes were thought to be the minimal "coding" unit, so much so that the rest of the genome had been dubbed "junk DNA" (and by now you should know how much I hate that unfortunate expression!). In [2], Djebali et al. report that
"about 75% of the genome is transcribed at some point in some cells, and that genes are highly interlaced with overlapping transcripts that are synthesized from both DNA strands [1]."
"The consequent reduction in the length of ‘intergenic regions’ leads to a significant overlapping of neighbouring gene regions and prompts a redefinition of a gene [2]."
Djebali et al. looked at RNA isolates in the whole cell, nucleus and cytosol of 15 different cell lines. They found novel exons, novel splice junctions and sites, and novel transcripts. Many of these elements are in intergenic regions, and many are antisense. They also investigated which of these newly found elements show evidence of protein expression.

When they looked at expression patterns specific to cell lines, they found that gene expression levels were similar across cell lines. The majority of protein-coding genes were expressed across all cell lines, and only a minority (~7%) was specific to certain cell lines. On the other hand, the researchers found many long non-coding RNAs that were largely cell-line specific, while only 10% was expressed across all cell lines. I found this bit to be quite intriguing, as it seems to point that RNAs have a large role in controlling gene expression across cell lines.

Overall, their findings yield an increase overlap in what they call "genic regions". What were previously thought to be "deserts" between genes, aren't so deserted after all, rather, populated by lots and lots of regulatory elements. In their final discussion, Djebali et al. conclude
"The likely continued reduction in the lengths of intergenic regions will steadily lead to the overlap of most genes previously assumed to be distinct genetic loci. This supports and is consistent with earlier observations of a highly interleaved transcribed genome, but more importantly, prompts the reconsideration of the definition of a gene."

[1] Joseph R. Ecker, Wendy A. Bickmore, Inês Barroso, Jonathan K. Pritchard, Yoav Gilad, & & Eran Segal (2012). Genomics: ENCODE explained Nature DOI: 10.1038/489052a

[2] Sarah Djebali, Carrie A. Davis, Angelika Merkel, Alex Dobin,, Timo Lassmann, Ali Mortazavi, Andrea Tanzer, Julien Lagarde, Wei Lin, Felix Schlesinger, & et al. (2012). Landscape of transcription in human cells Nature DOI: 10.1038/nature11233

Monday, September 10, 2012

The encyclopedia of DNA - Part I

The raw numbers of the human genome: three billion base pairs, of which roughly 1% fall into the 20,000 genes in our genome. So, what's all the extra stuff for?

Typing the whole human genome, in 2001, was only the beginning. The next step in disentangling the puzzle was to assign biochemical functions to those three billion base pairs.
"The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions" [1].
Let's start with a bit of a refresher.

Regulatory regions: these are regions in the genome that regulate gene transcription. Thanks to these regulatory sequences, skin cells only express "skin" genes, brain cells express "brain" genes, and so on. Promoters, for example, are regulatory sequences found immediately before the start of the gene, on the same strand, and they initiate the transcription of the gene. There are other regions, called enhancer, which also promote transcription. However, contrary to promoters, enhancers need not be near the gene. They don't even need to be on the same chromosome, and some enhancers have been found in introns, regions of a gene that are removed prior to making mRNA.

Transcription factors: I talked a little bit about them last week. These are proteins that can either promote or block the recruitment of RNA polymerase, and therefore either activate or silence a gene.

And, finally you can review the concepts of chromatin structure and histone modification in a few previous posts.

All these concepts are useful to understand that there's a lot, and I mean A LOT going on, between genes and phenotype. Genes are only the starting point. You can't just look at genes alone in order to try and infer a phenotype.

Started in 2003, the aim of ENCODE was to annotate all functional regions of the genome, where by "functional" they don't just mean encoding proteins, but also presenting some biochemical signature such as protein binding or a specific chromatin structure. The latest findings published in Nature: over 700,000 promoter regions and nearly 400,000 enhancer regions that regulate gene expression.

You can see the complications and layers to this: while we have one unique genome, which is identical in all nucleated cells, once you start looking for function, you have to look at the whole genome and chromatin structure and RNA transcripts of all cell lines, as each cell line will have its own activated and silenced genes, its own chromatin signatures, and so on ... whew, that's A LOT!

So far the ENCODE Project Consortium has integrated the data from 1,640 experiments involving 147 different cell types. They saw that
"The vast majority (80.4%) of the human genome participates in at least one biochemical RNA- and/or chromatin-associated event in at least one cell type."
Many more cell lines are yet to be explored, and yet these initial results already shed light into puzzling questions, like, for example: why do nearly 90% of SNPs found in whole genome disease association studies fall outside genes?
"Single nucleotide polymorphisms (SNPs) associated with disease by GWAS are enriched within non-coding functional elements, with a majority residing in or near ENCODE-defined regions that are out- side of protein-coding genes. In many cases, the disease phenotypes can be associated with a specific cell type or transcription factor."
I can't tell you how excited I am about these results, as I started blogging a little over one year ago raising exactly the point that junk DNA should NOT be called junk DNA.

I'm coming down with the flu (how do you explain to your kids NOT to cough in your face when they have a bug? Sigh), so this will be all for this time. But I've got all the Nature papers printed out and will be talking more about them in the next few weeks. A lot of new (and exciting) stuff to learn!

[1] The ENCODE Project Consortium (2012). An integrated encyclopedia of DNA elements in the human genome Nature DOI: 10.1038/nature11247

Friday, September 7, 2012


The current issue of Nature is dedicated to ENCODE, the Encyclopedia of DNA elements. There are six open access articles that discuss the latest findings concerning the vast genomic area that lies between genes. Many of these "junk DNA" regions have indeed a function!

I've been swamped at work, but I'll do my best to read the papers in the next couple of weeks and discuss them here. In the meantime, you can find the papers at, they are open access. Enjoy!

Monday, September 3, 2012

Transcription factories for gene expression: the hard working units of the nucleus

You've probably heard it many times already: if you could stretch out the DNA contained in any one nucleated cell in your body, it would be 2 meters (~6 feet) long. Now imagine packing this 2-meter long molecule into a sphere whose diameter is of the order of a few micrometers, roughly one millionth smaller than a meter. Yes, it's going to be packed in there, yet those genes have to be accessible to the "workers" that come in and perform daily tasks such as gene transcription, replication, and DNA repair. Clearly, which genes are accessible and which aren't is going to play a major role in the cell's life and development.

The chromatin, the ensemble of DNA and proteins inside the nucleus, is dynamically regulated. For gene expression, active genes relocate from chromosome regions and cluster into subnuclear compartments called "transcription factories for gene expression."

As you know, transcription is one of the fundamental steps in the making of proteins: the enzyme RNA polymerase II creates a complementary strand of RNA (a precursor of mRNA) from the active gene. The mRNA is then synthesized and translated into the protein's amino acid sequence. The concept of transcription factories comes from the observation that specific regions in the nucleus are highly enriched in RNA polymerase II, and those are the regions from which new RNA transcripts emerge. A second observation is that distant loci, often on different chromosomes, can interact during regulation through long-range regulatory contacts.
"Increasing numbers of examples suggest that regulatory DNA elements also seem capable of undergoing functional contacts with genes located on other chromosomes. [...] By contrast, temporarily inactive alleles are positioned away from transcription factories, suggesting that genes migrate to these subnuclear sites in order to be transcribed. Crucially, the number of transcription factories per cell is severely limited compared to the number of expressed genes, compelling genes to share the same transcription factory [1]."

The above figure is a schematic of a transcription factory: active genes from different chromosomes are recruited from the chromatin. As transcription proceeds and new RNAs are formed, the templates are reeled through the factory bringing downstream nearby genes. Transcripts generated in a transcription factory that are in close proximity have a greater chance to undergo trans-splicing, in other words, the two transcripts are joined into one even though they originated from different RNA polymerases. The resulting joint RNA is called chimeric RNA. A few studies have observed proteins generated from chimeric RNAs.

In addition to trans-splicing, close proximity in a transcription factory increases the chances of translocation, i.e. one genomic region being moved to a different locus.
"It is puzzling that a genome conformation that increases the risk of potentially grave translocations can evolutionarily persist. We speculate that three- dimensional gene clustering of transcribed loci must elicit evolutionary advantages that outweigh the dangers of translocations."
As Schoenfelder et al. conclude,
"A major challenge will be to decipher the relation between these genome conformation changes and the numerous epigenetic alterations of the genome, allowing their integration into a comprehensive picture of the spatial and functional organization of the nucleus."

[1] Schoenfelder, Stefan, et al. (2010). The transcriptional interactome: gene expression in 3D. Current Opinion in Genetics DOI: 10.1016/j.gde.2010.02.002