Genome sequencing has been back in the news this week with the announcement that the genome of the bonobo has been assembled and compared to the human genome. The bonobo is a great ape found in the Democratic Republic of Congo, believed to be around as related to us as we are to chimpanzees. A few weeks ago the tomato genome was also sequenced. And two years ago it was the turn of wheat. All of these are interesting stories, and worth writing about, but for now I’m going to give you a quick biology lesson in the lab-work that comes behind the headline. The point of the Human Genome Project (and projects like the bonobo sequencing project above) was to write down the entire ‘code’ for making a human. It took over a decade and cost around $3 billion to complete. But what’s so useful about that? And how is it that we now hear headlines about sequencing genomes for just $1000? And what’s all this about a $1000 genome and the $10 000 analysis?
To many of my non-scientific friends, the human genome project conjures up images of a recipe book that scientists have faithfully copied out. Recipe for brown hair it says. “atgcctgcagtgggtgccagggcccctctccaccgtccctgctgggcttcggggccacgc…” Once upon a time de novo sequencing was sort of like this. Old school Sanger sequencing involves picking out a particular gene and just sequencing that. This is fine if you want to know how the genes for eye colour differ in person A and B, but less good if you have an entire novel species to contend with. By my reckoning, sequencing just the genes of the human genome once in this way would cost me about £500 000.
Sequencing has become a lot cheaper in part because of advances in technology that allow us to randomly sequence all the DNA at once. Whereas before you might have five overlapping sets of sequences, now you have gigabytes of data and no obvious way to merge them together. It’s like a 1000 piece jigsaw where you don’t have the final picture. Or corners. Or, to continue with the current metaphor, it’s like someone has taken your copy of Delia’s Christmas, torn it into tiny pieces, photocopied them so that sometimes you have four copies and sometimes you have seven, and then asked you to reassemble the book. It’s not humanly possible.
Luckily, nobody’s asking a human to do it. It requires an awful lot of computer power and some snazzy software, but provided you’ve got a shedload of RAM, a working knowledge of Unix, and a copy of CLC Bio, you’re on your way to a good thing.
The next problem is in annotating it. In the days after the first sequence coverage of the wheat genome was released you may have read that the work wasn’t even close to being finished yet. Scientists had all the letters but no idea what many of them meant. Because actually, you haven’t just got a shredded copy of your favourite recipe book. You’ve got the French translation.
First there’s the problem of figuring out what you’re reading in the first place. You can spot where the recipes are (as opposed to the glossy adverts) because they have handy markers saying HERE BE GENES (aka start codons and changes in GC content). But what do the genes say? You might recognise the odd word from La Tarte de Meringue de citron but if your French is as poor as mine is then you’re not going to get very far with Ajouter assez de l’eau pour mélanger la farine de maïs à une pâte lisse. It gets a lot easier once you have an English recipe, even if it’s a slightly different one because you can then compare the two. Lemons? Check. Egg whites? Check. Sugar? Check. And so on. Bits that look more or less identical can be annotated as equivalent (or, as we say in the biz, homologous).
Because we now have several fully assembled and working cookbooks (model species like mouse, Drosophila and Arabidopsis) genome sequencing and annotation is a lot faster than it used to be. By using a handy tool called a BLAST search we can probe a huge database with an unknown sequence and find out what it is most similar to. If a gene codes for eye colour in humans, chances are it codes for eye colour in bonobos too.
Nowadays though, we’re not content with just a single genome sequence. What we’re really interested in isn’t the “basic code” for each species. It’s how individuals of that species differ. How is the genome of a black cat different to a white cat? How does a high yielding variety of rice differ from a low yielding variety? What makes a cultivar of barley able to resist barley yellow dwarf virus when others can’t? Why are some humans more likely to get certain types of cancer than others? We want to figure out the differences in the recipes.
This is the remit of the 1000 genome project (as well as lots of smaller similar projects in other important species like wheat). By looking at lots of ‘recipes’ that differ in a few different ways, scientists hope to spot the things that make the end products different. Imagine you have 100 recipes for sponge: you can split them into those made with caster sugar and those made with light brown sugar. On average, how are they different? Or those made with one egg and those made with two: on average, how are they different? In the end (you hope!) you’ll be able to say that all the light brown sugar cakes were slightly sweeter, and all the two egg cakes were more voluminous. Or all the humans with a C to T substitution at base 237 of Gene X are more susceptible to developing bowel cancer.