CHIMERAS: ENCODE

Showing posts with label ENCODE. Show all posts

Monday, September 24, 2012

ENCODE sheds light on non-coding variants

Back when I started studying human genetics, we were still doing single-gene associations. Namely, we would type a bunch of variants in a single gene and then do a case-control association study to see which, if any, of those variants marked an increase in disease risk. That's how breast cancer markers such as BRCA1 and BRCA2 have been found.

When the Human Genome Project was completed in 2003, scientists started looking for disease risk alleles across the whole genome. The findings were puzzling: more than 90% of the diseases-associated variants fell in non-coding regions. Why? One issue I've previously discussed is that when looking at tens of thousands of loci, you need huge sample sizes and often, when huge sample sizes aren't feasible, these studies are underpowered. Another possible explanation lies in epistasis, and the detected signal may be the effect of some unknown correlation.

However. You knew there was going to be a "however", right? Because thanks to the ENCODE project we now know that if a genetic variant falls in a non-coding region, it doesn't mean it has no effect whatsoever. ENCODE is bound to shed new light on these numerous non-coding risk alleles that genome-wide association studies (GWAS) studies have found.

Last time I discussed DHSs, or DNase I hypersensitive sites. These are chromatin regions where many regulatory elements have been found. In [1], Maurano et al. show that many of the non-coding variants associated with common diseases are concentrated in regulatory DNA marked by DHSs. The researchers performed genome-wide DNase I mapping across 349 cell and tissue types. As discussed last week, regions of DNase I accessibility harbor regulatory elements. The researchers also examined the distribution of 5654 non-coding SNPs (single base variants) that had been significantly associated to some disease or trait in genome-wide studies.

These the main findings:

"Fully 76.6% of all noncoding GWAS SNPs either lie within a DHS (57.1%, 2931 SNPs) or are in complete linkage disequilibrium (LD) with SNPs in a near-by DHS (19.5%, 999 SNPs)."

To be in linkage disequilibrium means that the variant is typically inherited together with a DHS site. Suppose the true causal variant is at locus A, but you haven't typed locus A, you've typed locus B, and A and B are inherited together. Then B is going to light up as strong signal in your statistical analysis. So, what Maurano et al. are saying in the above paragraph is that the non-coding SNPs either turned up in a DHS site, or they found evidence that they were strongly correlated with one of such sites.

"Many common disorders have been linked with early gestational exposures or environmental insults. Because of the known role of the chromatin accessibility landscape in mediating responses to cellular exposures such as hormones, we examined if DHSs harboring GWAS variants were active during fetal developmental stages. Of 2931 noncoding disease- and trait-associated SNPs within DHSs globally, 88.1% (2583) lie within DHSs active in fetal cells and tissues. Of DHSs containing disease-associated variation, 57.8% are first detected in fetal cells and tissues and persist in adult cells (“fetal origin” DHSs), whereas 30.3% are fetal stage–specific DHSs.

And finally:

"Enhancers may lie at great distances from the gene(s) they control and function through long-range regulatory interactions, complicating the identification of target genes of regulatory GWAS variants."

GWAS variants control distant genes that need not even be on the same chromosome. Furthermore, these variants in DHSs sites tend to alter allelic chromatin state, thus modulating the accessibility of genes to transcription factors. Disease-linked variants were found to alter such accessibility, resulting in allelic imbalance (one allele gets transcribed more than the other one), possibly explaining their role in altering the disease risk or quantitative trait.

[1] Matthew T. Maurano, Richard Humbert, Eric Rynes, Robert E. Thurman, Eric Haugen, Hao Wang, Alex P. Reynolds, Richard Sandstrom, Hongzhu Qu, Jennifer Brody, Anthony Shafer, Fidencio Neri, Kristen Lee, Tanya Kutyavin, & Sandra Stehling-Sun (2012). Systematic Localization of Common Disease-Associated Variation in Regulatory DNA Science DOI: 10.1126/science.1222794

Thursday, September 20, 2012

The encyclopedia of DNA - Part III

The ENCODE project effectively marked the transition from genomics to functional genomics. The goal of the Human Genome Project was to type the entire human genome. Once that was achieved people realized they had just scraped the tip of the iceberg. Today, the goal of functional genomics is go one step beyond DNA sequences, and understand the dynamics of gene expression, transcription, translation and all the complex pathways that lead from DNA to the making of proteins.

In order to do this, the main goal of functional genomics is to annotate regulatory elements of the genome, in other words, elements that regulate gene expression and transcription. For example, proteins called transcription factors bind to regulatory sequences and favor transcription of a gene into mRNA. Last time we learned about regulatory sequences such as promoters and enhancers, and how the ENCODE project has found a vast amount of these sequences, in particular outside and far away from the genes they regulate.

Previously, we also learned about chromatin, the "yarn" of DNA inside the nucleus, and how its configurations affect gene expression. We also learned about transcription factories inside the chromatin, where genes are recruited and transcribed.

In order to transcribe a gene, the two helices of DNA where the gene sits need to be separated. This will allow the RNA polymerase to access the strand where the gene sits and transcribe it. In other words, in order to be expressed, a gene needs to be accessible. To explore how "accessible" a gene is in a specific chromatin configuration, people have employed the technique of mapping regions called hypersensitive sites. These sites are highly accessible to certain enzymes called nucleases, and promoters and most regulatory elements are found in chromatin sites that are hypersensitive to one endonuclease in particular, called DNase I. Therefore, mapping DNase I hypersensitive sites (DHSs) is an efficient way of identifying regulatory DNA regions.

In [1], Thurman et al. identified nearly 2.9 million genome-wide DHSs across 125 cell types.

"Annotating these elements using ENCODE data reveals novel relationships between chromatin accessibility, transcription, DNA methylation and regulatory factor occupancy patterns. [. . .] Patterning of chromatin accessibility at many regulatory regions is organized with dozens to hundreds of co-activated elements, and the transcellular DNase I sensitivity pattern at a given region can predict cell-type-specific functional behaviours."

Chromatin accessibility is what allows transcription factors to bind to the DNA region to be transcribed. Hence, which sites are accessible and which are not plays an important role in gene expression. When transcription factors bind to their target sites, they initiate chromatin remodeling and the recruitment of other chromatin elements. These local perturbations make certain stretches of DNA accessible to nucleases, DNase I in particular.

[1] Robert E. Thurman, Eric Rynes, Richard Humbert, Jeff Vierstra, Matthew T. Maurano, Eric Haugen, Nathan C. Sheffield, & Andrew B. Stergachis, et al. (2012). The accessible chromatin landscape of the human genome Nature DOI: 10.1038/nature11232

Monday, September 17, 2012

The encyclopedia of DNA - Part II

Last week I started discussing the exciting news about the six ENCODE papers published in the Nature September 6 issue. If you haven't already, I highly recommend reading the review ENCODE explained [1], which has a nice summary of the papers and an excellent perspective on what these results mean.

One paragraph in particular is worth quoting:

"The authors report that the space between genes is filled with enhancers (regulatory DNA elements), promoters (the sites at which DNA’s transcription into RNA is initiated) and numerous previously overlooked regions that encode RNA transcripts that are not translated into proteins but might have regulatory roles. Of note, these results show that many DNA variants previously correlated with certain diseases lie within or very near non-coding functional DNA elements, providing new leads for linking genetic variation and disease [1]."

To review what promoters and enhancers are you can take a look at this older post.

I can't stress enough how relevant these findings are. Previously, genes were thought to be the minimal "coding" unit, so much so that the rest of the genome had been dubbed "junk DNA" (and by now you should know how much I hate that unfortunate expression!). In [2], Djebali et al. report that

"about 75% of the genome is transcribed at some point in some cells, and that genes are highly interlaced with overlapping transcripts that are synthesized from both DNA strands [1]."

"The consequent reduction in the length of ‘intergenic regions’ leads to a significant overlapping of neighbouring gene regions and prompts a redefinition of a gene [2]."

Djebali et al. looked at RNA isolates in the whole cell, nucleus and cytosol of 15 different cell lines. They found novel exons, novel splice junctions and sites, and novel transcripts. Many of these elements are in intergenic regions, and many are antisense. They also investigated which of these newly found elements show evidence of protein expression.

When they looked at expression patterns specific to cell lines, they found that gene expression levels were similar across cell lines. The majority of protein-coding genes were expressed across all cell lines, and only a minority (~7%) was specific to certain cell lines. On the other hand, the researchers found many long non-coding RNAs that were largely cell-line specific, while only 10% was expressed across all cell lines. I found this bit to be quite intriguing, as it seems to point that RNAs have a large role in controlling gene expression across cell lines.

Overall, their findings yield an increase overlap in what they call "genic regions". What were previously thought to be "deserts" between genes, aren't so deserted after all, rather, populated by lots and lots of regulatory elements. In their final discussion, Djebali et al. conclude

"The likely continued reduction in the lengths of intergenic regions will steadily lead to the overlap of most genes previously assumed to be distinct genetic loci. This supports and is consistent with earlier observations of a highly interleaved transcribed genome, but more importantly, prompts the reconsideration of the definition of a gene."

[1] Joseph R. Ecker, Wendy A. Bickmore, Inês Barroso, Jonathan K. Pritchard, Yoav Gilad, & & Eran Segal (2012). Genomics: ENCODE explained Nature DOI: 10.1038/489052a

[2] Sarah Djebali, Carrie A. Davis, Angelika Merkel, Alex Dobin,, Timo Lassmann, Ali Mortazavi, Andrea Tanzer, Julien Lagarde, Wei Lin, Felix Schlesinger, & et al. (2012). Landscape of transcription in human cells Nature DOI: 10.1038/nature11233

Monday, September 10, 2012

The encyclopedia of DNA - Part I

The raw numbers of the human genome: three billion base pairs, of which roughly 1% fall into the 20,000 genes in our genome. So, what's all the extra stuff for?

Typing the whole human genome, in 2001, was only the beginning. The next step in disentangling the puzzle was to assign biochemical functions to those three billion base pairs.

"The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions" [1].

Let's start with a bit of a refresher.

Regulatory regions: these are regions in the genome that regulate gene transcription. Thanks to these regulatory sequences, skin cells only express "skin" genes, brain cells express "brain" genes, and so on. Promoters, for example, are regulatory sequences found immediately before the start of the gene, on the same strand, and they initiate the transcription of the gene. There are other regions, called enhancer, which also promote transcription. However, contrary to promoters, enhancers need not be near the gene. They don't even need to be on the same chromosome, and some enhancers have been found in introns, regions of a gene that are removed prior to making mRNA.

Transcription factors: I talked a little bit about them last week. These are proteins that can either promote or block the recruitment of RNA polymerase, and therefore either activate or silence a gene.

And, finally you can review the concepts of chromatin structure and histone modification in a few previous posts.

All these concepts are useful to understand that there's a lot, and I mean A LOT going on, between genes and phenotype. Genes are only the starting point. You can't just look at genes alone in order to try and infer a phenotype.

Started in 2003, the aim of ENCODE was to annotate all functional regions of the genome, where by "functional" they don't just mean encoding proteins, but also presenting some biochemical signature such as protein binding or a specific chromatin structure. The latest findings published in Nature: over 700,000 promoter regions and nearly 400,000 enhancer regions that regulate gene expression.

You can see the complications and layers to this: while we have one unique genome, which is identical in all nucleated cells, once you start looking for function, you have to look at the whole genome and chromatin structure and RNA transcripts of all cell lines, as each cell line will have its own activated and silenced genes, its own chromatin signatures, and so on ... whew, that's A LOT!

So far the ENCODE Project Consortium has integrated the data from 1,640 experiments involving 147 different cell types. They saw that

"The vast majority (80.4%) of the human genome participates in at least one biochemical RNA- and/or chromatin-associated event in at least one cell type."

Many more cell lines are yet to be explored, and yet these initial results already shed light into puzzling questions, like, for example: why do nearly 90% of SNPs found in whole genome disease association studies fall outside genes?

"Single nucleotide polymorphisms (SNPs) associated with disease by GWAS are enriched within non-coding functional elements, with a majority residing in or near ENCODE-defined regions that are out- side of protein-coding genes. In many cases, the disease phenotypes can be associated with a specific cell type or transcription factor."

I can't tell you how excited I am about these results, as I started blogging a little over one year ago raising exactly the point that junk DNA should NOT be called junk DNA.

I'm coming down with the flu (how do you explain to your kids NOT to cough in your face when they have a bug? Sigh), so this will be all for this time. But I've got all the Nature papers printed out and will be talking more about them in the next few weeks. A lot of new (and exciting) stuff to learn!

[1] The ENCODE Project Consortium (2012). An integrated encyclopedia of DNA elements in the human genome Nature DOI: 10.1038/nature11247

Friday, September 7, 2012

ENCODE

The current issue of Nature is dedicated to ENCODE, the Encyclopedia of DNA elements. There are six open access articles that discuss the latest findings concerning the vast genomic area that lies between genes. Many of these "junk DNA" regions have indeed a function!

I've been swamped at work, but I'll do my best to read the papers in the next couple of weeks and discuss them here. In the meantime, you can find the papers at www.nature.com, they are open access. Enjoy!