A little while ago, I wrote a tiny bit about the 1000 genomes project, in which scientists hoped to sequenced the genomes of 1000 individuals and use them as a basis of comparison to pin down the genetic variation contributing to disease. About a week ago, the consortium published their findings, and somehow I missed it: shock horror.
What have they done?
Well to give you the fine detail, a faintly ridiculous number of collaborators have sequenced the genomes of 1092 people. They have done some low coverage whole genome sequencing (on average each locus is sequenced 2-6x per individual), but also lots of really deep (50-100x) exome sequencing (the exome is the expressed bits of the genome: so where we expect most interesting variation to be). Overall around 20 Gb (that’s 20 000 000 000 nucleotides) of sequence were generated. They then used an algorithm to align all those sequences and produce a haplotype map. A haplotype is basically a particular pattern of markers: so if two people had the gene sequences
Then their haplotypes would be TCT and CGA: we don’t need to know all of the nucleotides that remain the same: only the ones that are different and therefore important. Haplotypes don’t have to only include these single nucleotide polymorphisms (changes – known as SNPs). Different versions of genes also may include insertions or deletions, and these are included as part of the haplotype.
How many SNPs and indels are we talking here?
Lots! The final validated haplotype map includes 38 million SNPs, 1.4 short indels and another 14000 larger indels.
Did they really need to do 1000?
Yes! The coverage needed to identify those 38 million SNPs is actually pretty crazy. Some of them are pretty common, occurring in over 5% of the population and around 95% of these were identified during the pilot stage, but there are plenty of other less common SNPs that didn’t appear in the pilot at all. A major goal of the study was to identify more than 95% of SNPs at 1% frequency, but the authors even hoped to identify SNPs present in <0.1% of the population: i.e. only one or two people!
The 1000 genomes dataset contains 50%, 98% and 99.7% of the SNPs at 0.1%, 1% and 5% frequency in the Wellcome Trust funded UK10K project (containing 2500 UK genomes), but the 1000 genomes dataset does contain plenty of UK samples. A similar comparison to 2000 Sardinian genomes found 99.3% of the 5% SNPs: but only 76.9% of the 1% SNPs and just 23.7% of SNPs occurring at 0.1% frequency in the Sardinian population.
Why is this information interesting?
Well firstly, from a purely academic view, it gives us an idea of how different humans are on average. On average in the study individuals carried around 2500 SNPs that affected their proteins (non-synonymous variations), including 20-40 SNPs known to be ‘damaging’ and around 150 loss of function variants (i.e. SNPs that made a protein completely cease to work). The general idea behind 1000 genomes is to identify ‘functionally relevant’ variants: or differences that we can pin to a phenotypic change. Perhaps the most interesting and important question is whether any of these SNPs or indels are associated with diseases, such as cancer (Have a look at Bamshad et al (2011) Nature Rev. Genet. 12, 745–755 if you’re interested).
An integrated map of genetic variation from 1,092 human genomes
The 1000 Genomes Project Consortium