The raw numbers of the human genome: three billion base pairs, of which roughly 1% fall into the 20,000 genes in our genome. So, what's all the extra stuff for?
Typing the whole human genome, in 2001, was only the beginning. The next step in disentangling the puzzle was to assign biochemical functions to those three billion base pairs.
"The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions" .Let's start with a bit of a refresher.
Regulatory regions: these are regions in the genome that regulate gene transcription. Thanks to these regulatory sequences, skin cells only express "skin" genes, brain cells express "brain" genes, and so on. Promoters, for example, are regulatory sequences found immediately before the start of the gene, on the same strand, and they initiate the transcription of the gene. There are other regions, called enhancer, which also promote transcription. However, contrary to promoters, enhancers need not be near the gene. They don't even need to be on the same chromosome, and some enhancers have been found in introns, regions of a gene that are removed prior to making mRNA.
Transcription factors: I talked a little bit about them last week. These are proteins that can either promote or block the recruitment of RNA polymerase, and therefore either activate or silence a gene.
And, finally you can review the concepts of chromatin structure and histone modification in a few previous posts.
All these concepts are useful to understand that there's a lot, and I mean A LOT going on, between genes and phenotype. Genes are only the starting point. You can't just look at genes alone in order to try and infer a phenotype.
Started in 2003, the aim of ENCODE was to annotate all functional regions of the genome, where by "functional" they don't just mean encoding proteins, but also presenting some biochemical signature such as protein binding or a specific chromatin structure. The latest findings published in Nature: over 700,000 promoter regions and nearly 400,000 enhancer regions that regulate gene expression.
You can see the complications and layers to this: while we have one unique genome, which is identical in all nucleated cells, once you start looking for function, you have to look at the whole genome and chromatin structure and RNA transcripts of all cell lines, as each cell line will have its own activated and silenced genes, its own chromatin signatures, and so on ... whew, that's A LOT!
So far the ENCODE Project Consortium has integrated the data from 1,640 experiments involving 147 different cell types. They saw that
"The vast majority (80.4%) of the human genome participates in at least one biochemical RNA- and/or chromatin-associated event in at least one cell type."Many more cell lines are yet to be explored, and yet these initial results already shed light into puzzling questions, like, for example: why do nearly 90% of SNPs found in whole genome disease association studies fall outside genes?
"Single nucleotide polymorphisms (SNPs) associated with disease by GWAS are enriched within non-coding functional elements, with a majority residing in or near ENCODE-defined regions that are out- side of protein-coding genes. In many cases, the disease phenotypes can be associated with a specific cell type or transcription factor."I can't tell you how excited I am about these results, as I started blogging a little over one year ago raising exactly the point that junk DNA should NOT be called junk DNA.
I'm coming down with the flu (how do you explain to your kids NOT to cough in your face when they have a bug? Sigh), so this will be all for this time. But I've got all the Nature papers printed out and will be talking more about them in the next few weeks. A lot of new (and exciting) stuff to learn!
 The ENCODE Project Consortium (2012). An integrated encyclopedia of DNA elements in the human genome Nature DOI: 10.1038/nature11247