H 1
The Human Genome Project
Email Kathy


The Human Genome Project
October 23, 2007



Some Quotes to start things off:

"It is essentially immoral not to get it [the human genome sequence] done as fast as possible," James D. Watson, The New York Times, 5 June 1990, p. C1.

"I've seen a lot of exciting biology emerge over the past 40 years. But chills still ran down my spine when I first read the paper that describes the outline of our genome and now appears on page 860 of this issue...it is a seminal paper, launching the era of post-genomic science." David Baltimore, Nature, Feb. 2001, Our Genome Unveiled

"Humanity has been given a great gift. With the completion of the human genome sequence, we have received a powerful tool for unlocking the secrets of our genetic heritage and for finding our place among the other participants in the adventure of life." Barbara R. Jasny and Donald Kennedy Science, Feb. 2001


"The Human Genome Project has been an amazing adventure into ourselves, to understand our own DNA instruction book, the shared inheritance of all humankind," Francis S. Collins, M.D., Ph.D., April 14, 2003 Press Release


I. How do we Sequence DNA?

1. DNA Sequencing: 1975 - 1991: Sanger dideoxy sequencing (named for its developer, Frederick Sanger, who shared the 1980 Nobel Prize in Chemistry.)

  • What are the main Ingredients? What is the purpose of the DNA primer and the dideoxy" (ddNTPs) nucleotides

  • A DNA primer= initiates DNA synthesis

  • dNTPs = dATP, dCTP, dGTP, dTTP -in ALL tubes incorporated into the new DNA synthesis.

  • DNA polymerase!

  • ddNTPs = "chain terminating, or dideoxy" nucleotides in just ONE tube of 4. When incorporated, DNA synthesis on that one strand STOPS, but it continues on all other strands.

  • A ladder of DNA fragments of different sizes in generated (depending on the location of the chain terminating nucleotide)

  • Electrophoresis through thin polyacrylamide gel subjected to an electrical field. The shorter the fragment, the faster it moves through the gel. After expos use to X-ray film, presence of different sized 'bands' represents where each A, C, G, or T is located.

  • Biggest drawback: takes hours for DNA fragments to transverse a slab gel, and a typical gel run will read about 500 bases of DNA.

2. DNA Sequencing 1992 - present: Automated DNA Sequencing
Also called capillary array electrophoresis

  • Developed for the Human Genome Project at Lawrence Berkeley Lab
  • How do the ingredients differ from regular Sanger sequencing? Uses 4 different fluorescently labeled dideoxy nucleotides ddA, ddC, ddG and ddT to determine the sequence of DNAs. No radioactive nucleotides are needed.

  • Since 4 different flurophores are used, all 4 reactions can be run in the same tube, greatly increasing 'throughput'.

  • DNA fragments are separated by capillary electrophoresis (in tiny gel-filled capillary tubes, about 100 microns in diameter, bundled together) and read with a laser scanning system. This system is not only more accurate than reading a gel, but longer runs of DNA sequence data can be obtained.

  • Completely Automated!

  • Electropherogram: As each capillary tube moves into the into the path of the laser beam, fluorescently tagged nucleotides are detected one at a time, producing a color electropherogram : The information is then analyzed by a computer to generate the final sequence data.

Many different companies make automated DNA sequencers, but The 'Big Daddy' is made by Applied Biosystems: The ABI Prism 3700 /3730- can sequence roughly 50,000 to 100,000 bases per hour! (@ $300,000 apiece...) - thousands of times faster than traditional Sanger Dideoxy sequencing! [Image]

  • The ABI Prism 3700 provides 24-hour unattended operation - the limiting factor of how fast a lab can sequence is determined by how fast they can get DNA samples prepared and ready to go!

II. The Human Genome Project - 5 phases: (Need a timeline? or another?)

Phase I - Conceptualization / Initiation : (~1985-1990)

In 1985, Charles DeLisi, Department of Energy (DOE), begins discussion of a mammoth project 'of a scale unprecedented in biology' to sequence the complete human genome. [Why the DOE?]

In 1989, The Department of Energy (DOE) moved ahead, soon challenged by the National Institutes of Health (NIH). Result: a 'national' program to sequence the human genome for $3,000,000,000 in government spending over 15 years.

The US Human Genome Project is the result of the combined effort of two government organizations:

1. National Human Genome Research Institute (NHGRI / NIH / HHS) Dr. Francis Collins
2. Department of Energy genomes.org (ORNL / Oak Ridge National Labs) Dr. Ari Patrinos


The IHGSC: International Human Genome Sequencing Consortium: nationwide and worldwide genome centers and more centers

The "Big Science" Grumblers, and the response:


Phase II - The First '5' years: 1990-93

1990-93 NIH establishes itself as lead agency with funding apportioned 2:1 (NIH:DOE) - The first HGP director, James Watson captured control of the project for NIH and designed both the scientific and organizational strategy for its implementation.

What are some of the potential benefits of human genome research?

Potential benefits = Look at this list in class!


Phase III -Gathering Speed: 1993 - 1998

A new leader, Dr. Francis Collins, MD, PhD, a geneticist from the University of Michigan takes command of HGP. With the project falling somewhat behind scale in progress, the HGP under Collins soon began to greatly accelerate the pace of the HGP in terms of progress and project growth. This phase ended with another crisis due to external threat, this time again from J. Craig Venter

  • Revised 5-Year Research Goals of the U.S. Human Genome Project (1993-1998) Still making progress

  • Landmark technology: High throughput automated DNA sequencers

  • An important policy decision of the time: NHGRI Rapid Data Release Policy: A main 'ground rule' of the HGP: 'The Bermuda principles'. In 1996 February, at a meeting in Bermuda, international partners in the genome project agree to formalize the conditions of data access, "which expressly call for automatic, rapid release (in this case, within 24 hours) of sequence assemblies of 1 to 2 kilobase (kb) or greater to the public domain. "

  • The three main ideas of the Bermuda Principles scribbled on a blackboard by (now nobel laureate) John Sulston in 1996

    •  Automatic release of sequence assemblies larger than 1 kb (preferably within 24 hours).
    •  Immediate publication of finished annotated sequences.
    •  Aim to make the entire sequence freely available in the public domain for both research and development in order to maximise benefits to society.
    • "The highest priority of the International Human Genome Sequencing Consortium is ensuring that sequencing data from the human genome is available to the world's scientists rapidly, freely and without restriction." From NHGRI's Data Release and Access Principles and Policy

  • Big Result from this period -1997: A Gene Map of the Human Genome Science, 5 September 1997 Woo hoo!
    • 18 months of serious international effort
    • Number of mapped human genes at start = 5131 (1994) 
    • Tripled in 22 months = 16,354 genes on the current map "may represent one-fifth of all protein-coding genes in our genome".
    • New map has sufficient accuracy and resolution to localize genes to within a few megabases, which corresponds well with the regions typically encountered in disease-gene hunts.
    • Lots of credit given to using ESTs
    • Still only 3% of the human genome actually sequenced, though!

News Flash: May 1998, May: Race for the Genome: J. Craig Venter, with Perkin Elmer's applied Biosystem's Michael Hunkapillar, creates Celera Genomics. Goal: sequence the entire human genome by December 31, 2001 (2 years before the completion by the HGP, and for a mere $300 million) by a method untested in a complex eukaryotic genome: Whole Genome Shotgun Sequencing, using 300 hi-speed automated DNA sequencers running in parallel, 24 hours a day. Venter calls the plan a "mutually rewarding partnership between public and private institutions." Press release.

 

...(but its data release policy will not follow the Bermuda principles)....


Phase IV - Reorientation, October 1998-2001
"In response to was widely perceived as a race to sequence the human genome, the HGP shifted dramatically to a crash project. Collins reoriented the HGP, altering scientific and organizational strategy.


Phase V - Finishing (2001-2005 and beyond!): Research Milestones
  • 2001-2006: An era of very rapid shotgun sequencing of major genomes including the Mouse, Chimp and Dog, Fugu ribripes (pufferfish), Bos taurus (moo) and hundreds of other species, PLUS the completion of Human Chromosomes (in this order) 22, 21, Y, 7, 6, 20, 14, 13, 16, 19, 10, 9, 5, 16, X, 2, 4 , 18, and in 2006:8, 11, 12, 15, 17, 3, and 1 (May 2006!) wow!!! A large percentage of the genomes were completed by TIGR, Venter's first (and non-profit) company in Bethesda MD (run by his very talented scientist spouse Claire Fraser). Check out this impressive (ever-growing list) at A Quick Guide to Sequenced Genomes . The human chromosomes were typically completed by the NHGRI (The Genome Boys: Francis Collins, Eric Lander, et. al) plus the IHGSC).
  • The Genome Boys (just a sampling of some VERY impressive scientists):
James Watson, PhD
The Original Bad Boy of DNA;
Director, Human Genome Project; Archives
Director, Cold Spring Harbor Labs, (Bio) New York (*Ouch; 14 Oct 2007)

Francis Collins, MD /PhD

Director, National Human Genome Research Institute; CF gene in 1989, NF in 1990, Huntington in 1993, Progeria in 2007

Ari Patrinos, PhD.

Left the DOE in 2006

Now with J. Craig Venter Synthetic Genomics Institute

Eric Lander, MIT/ Broad
Impressive accomplishments!
20 November 2006: Eric Lander et al get a few hundred million more in NIH funds! - The Cancer Genome Atlas (TCGA)

George Weinstock, Ph.D.
Human Genome Sequencing Ctr.
Baylor College of Medicine
Honeybee, Sea Urchin,

Rhesus Monkey genomes

Sir John Sulston
Sanger Center

Nobel Prize 2002
C. elegans development / lineage

 

And the real meat 'n potatoes: The genomes!

 





A Quick Guide to Sequenced Genomes:


III. The Human Genome Project - What have we learned? A genome summary Like, wow!

Our Genome Unveiled by Nobel Laureate David Baltimore, Cal Tech
Top 10 Things We Learned about the Human Genome Sequence Francis Collins

1. Only 1.4% - 2% of the genome is sequence that actually encodes for genes that make protein; and the gene-coding tend to be clumped together.

  • Genes appear to be concentrated in random areas along the genome, with vast expanses of noncoding DNA between.
  • The human genome's gene-dense "urban centers" are mostly composed of the DNA nucleotides G and C.
  • In contrast, the gene-poor "deserts" are heavy in nucleotides A and T.
  • Stretches of up to 30,000 C/G bases repeating over and over often occur adjacent to gene-rich areas. These CpG islands, including repetitive Alu sequences (500,000/haploid genome), are believed to help regulate gene activity in some way we do not understand yet!

2. The number of genes is much lower than expected.

  • The HGP estimates that there are 31,000 protein-encoding genes in the human genome. Celera finds about 26,000. Updated on October 21, 2004: Only 20,000-25,000 genes predicted
  • The functions are unknown for over 50% of discovered genes.
  • ~740 identified genes make the tRNAs and rRNAs involved in translation,
  • Protein coding genes in the human genome compares with 6,000 for yeast, 13,000 for Drosophila, 18,000 for C. elegans and 26,000 for Arabidopsis.
  • Humans have on average three times as many kinds of proteins as the fly or worm because of mRNA alternative splicing, novel combinations of protein domains, and chemical modifications to the proteins that can yield different protein products from the same gene. Around 60 per cent of human genes have two or more alternatively spliced transcripts, compared, for example, with only 22 per cent in the worm.
  • Although the numbers of genes is smaller than expected, the number of gene family members have expanded in humans. For instance, humans have 30 FGF genes (fly and worm have 2 each) and 42 TGF-ßs (fly and worm have 9 and 6, respectively). Humans have 765 genes for antibody production, while the fly has 140, and the worm 64.
  • [It was originally stated that "more than 200 human genes are the result of the horizontal transfer from bacteria", but evidence contrary to this hypothesis has since surfaced: "Phylogenetic analyses do not support horizontal gene transfers from bacteria to vertebrates" Stanhope et al, Nature 411, 940-944 [21 June 2001]

3. Most of our genes come from our evolutionary past.

  • Only ~94 of 1,278 protein families in our genome appear to be specific to vertebrates.
  • The most elementary of cellular functions: basic metabolism, transcription of DNA into RNA, translation of RNA into protein, DNA replication; evolved just once and have stayed pretty well fixed since the evolution of single-celled yeast and bacteria.

4. Over half of our DNA (53%) consists of repeated sequences of various types:

  • The human genome has a much greater portion (53%) of repeat sequences than Arabidopsis (11%), C. elegans (7%), and Drosophila (3%).
  • 45% in four classes of parasitic Transposable DNA elements (some of presumably viral origin)
  • 3% in repeats of just a few bases, 5% in recent duplications of large segments of DNA.
  • There is evidence that Transposable Elements shaped the evolution of the genome and mediated the creation of new genes. What was once known as "junk DNA" actually provides a "fossil" record of human evolution that looks back 800 million years.
  • Here is a unique perspective on our genome from David Baltimore: "As the co-discoverer of reverse transcriptase, I find it striking that most of the parasitic DNA came about by reverse transcription from RNA. In places, the genome looks like a sea of reverse-transcribed DNA with a small admixture of genes." Like wow...what ARE we, anyway?
  • Image from The Genome by Numbers, the Welcome Trust. Is this a cool figure, or what? [long interspersed nuclear elements (LINEs), and short interspersed nuclear elements (SINEs)]


 

IV. The Human Genome Project - What is still on the "to do" list? (See also the nice bullted list at the end of the DOE FAQs page)

 

1. Correct errors and proofread. The original plan was to repeat the sequencing up to 10-12 times to prune away the mistakes that inevitably accompany a project involving 3.2 billion pieces of data.

2. Fill tens of thousands of gaps in the sequence. These holes amounted to about 15 percent of the genome on June 26 2000. Most gaps lie in stretches of short sequences repeated hundreds or thousands of times, which makes them enormously difficult to get right.

3. Sequence the 7 percent of the human genome that was originally excluded by design. This region is heterochromatin, highly condensed DNA found at the centromeres and telomeres that has been long believed to contain no genes. But human heterochromatin probably contains a few genes too as well as things we don't yet know about.

4. Finish finding all the genes that make proteins. This step takes place after the sequence is cleaned up and deemed 99.99% accurate. About 38,000 protein-coding genes have been confirmed so far. Recent estimates have tended to fall below 60,000.

5. Find the non-protein-making genes. There are, for instance, genes that make RNA rather than protein. They tend to fall below the threshold of today’s gene-finding software, so new ways of discovering them will have to be devised.  

6. Discover the regulatory sequences that activate a gene and that govern how much of its product to make.  

7. Untangle the genes' intricate interactions with other molecules.
 
8. Identify gene functions. Because a gene may make several proteins, and each protein may perform more than one job, the task will be stupendous.

 

What is next, in a more abstract sense? (From Our Genome Unveiled , by Nobel Laureate David Baltimore)

  • First, get the most precise representation of the genome that we can: cleaning up the errors, and getting rid of the uncertainties....
  • Second, we need to see more genomes, with each one giving us a deeper insight into our own. TIGR, Ensembl
  • Third, we need to take advantage of this book of life. Tools for scanning the activity levels of genes in different cells, tissues and settings...
  • Fourth, we need to turn our new genomic information into an engine of pharmaceutical discovery. Individual humans differ from one another by about one base pair per thousand. These 'single nucleotide polymorphisms' (SNPs) are markers that can uncover the genetic basis of diseases. They can also provide information about personal responses to medicines.

Two recent, ongoing Human Genome related projects - just the basics: It would be nice to teach a whole course on the Human Genome Project!

1. The International Hap Map: 27 October 2005: International Consortium Completes Map Of Human Genetic Variation - the HapMap! Free full text: A haplotype map of the human genome

  • What is the HapMap? "A catalog of common genetic variants that occur in human beings. It describes what these variants are, where they occur in our DNA, and how they are distributed among people in different parts of the world. Goal: use to locate genes involved in medically important traits.
  • (a) Single nucleotide polymorphisms (SNPs) are identified in multiple indivduals. SNPs are single nucleotide point mutations in DNA that occur in more than 1% of the population. There are over 10 million different human SNPs! This paper catalogued more than 1 million SNPs in the genome sequences of 269 people drawn from 4diverse human populations!
  • (b) Adjacent SNPs that are inherited together are compiled into "haplotypes." (Groups of alleles that are linked closely enough to be inherited as a unit.)
  • (c)"Tag" SNPs within haplotypes are identified that uniquely identify those haplotypes (estimated to be ~300,000-600,000, far fewer than the 10 million common SNPs). The long-term task is to translate these data into an understanding of the effects of that variation on human health!

2. The ENCODE Project: Nature, 14 June 2007: the Encyclopedia Of DNA Elements, launched in September 2003 by NHGRI. Goal: initially, a pilot project looking at 1% of the genome to determine how the information coded in DNA is turned into functioning systems in the living cell

Free full text: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. From the article: "4 main findings

a. The genome is pervasively transcribed, with many non-protein-coding transcripts that extensively overlap one another.
b. New understanding about transcription start sites, regulatory sequences, chromatin accessibility and histone modification.

c. A more sophisticated view of chromatin structure and its inter-relationship with DNA replication and transcriptional regulation.
d. New mechanistic and evolutionary insights concerning the functional landscape of the human genome."


V. Tools for Searching the Genome

"Biology has belatedly realized that it is, itself, an information technology" Drowning in data, The Economist,351: 93-94, 1999.


Finding the needle in the haystack - Genome databases:


(NCBI/ NLM/ NIH)

1. GenBank: is the National Center for Biotechnology Information (NCBI) / NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. GenBank performs of the most basic of gene finding tools - finding DNA sequence homology between a newly sequenced piece of DNA and previously sequenced DNA, and predicting protein sequence from nucleotide sequence.

Bioinformatics in 1980: GenBank, the DOE's repository for genome sequences. Each ACG &T entered by hand!
Bioinformatics in 1990:

Milestone: 1996

GenBank transferred to NIH / NCBI; Scientists deposit sequences into GenBank directly; BLAST developed. 
First Billion Base Pairs in Genbank!
Bioinformatics in 2000:
Growth of Genbank

Milestone: 2005
First draft done... as of September, 2000: ~7 billion bases in GenBank; ~13 B bases in Genbank in 2001, ~28 B in 2003, 44 B in 2004,

100 Billion Base Pairs (100,000,000,000 = 100 'Gig'!)
in 2005

 

2. BLAST -...it's cool...it's free...it's like Google for DNA!. Basic Local Alignment Search Tool - DNA and protein sequence search tools. Submit a DNA sequence...allow BLAST to search through well over 100,000,000,000 bases, receive results moments later...it's cool...it's free...it's like Google for DNA!.

Try it! 1. aacgtcacct ttgaggacgc cggggagtac acctgcctgg cgggcaattc tattgggttt ...and the human gene is...

  • Gene Myers: With a name like Gene, he's got to be good! the scientist behind the proprietary algorithm for BLAST and Celera's assembly of the human genome. Check out his list of accomplishments and papers!

3. Entrez - NCBIs database of "metasearch" tools -

4. Ensembl - Sanger Center, UK / European Bioinformatics Institute (EBI).

Three GREAT Genome Resources - keep for your future! :)
1.   NCBI - Genomic Biology - Homo sapiens From NCBI. The definitive resource for the human genome.
2.   Interactive Center Wellcome Trust A GREAT chromosome/gene browser - find genes, gene functions, disease states, on each chromosome
3.   Nature’s guide to the human genome: Another GREAT chromosome/gene browser - find original research papers associated with genes on each chromosome


Objectives: HGP Part 1:

  1. DNA Sequencing: Make sure you know the basics of Sanger dideoxy sequencing
  2. Infrastructure: Describe the two main government organizations that together make up the HGP, and list the 5 major HGP sequencing sites: The "G5"
  3. Explain the Bermuda Principles and their importantce in the public (and private) project
  4. Describe the 7 HGP main goals (overall) AND the 7 major scientific goals that guided the project's first ~10 years
  5. Describe the 5 main phases of the HGP - summarize the major milestones of each phase. Who are: James Watson, Francis Collins, Ari Patrinos, Eric lander, John Sulston and George Weinstock in terms of the HGP?
  6. What have we learned? Be able to describe major genome features detailed above - and be able to list a few genomes completed, perhaps ones with beautiful cover photos!
  7. Provide examples of what is being done to complete the sequence of the human genome. What is the difference between the February 2001, April 13, 2003 and October 2004 papers in terms of "finishing' the genome
  8. According to the article: Welcome to the Genomic Era: (Francis Collins) What is Genetics, and what is Genomics? List some of the 'Promise of genomics to medicine' as described in the article or on the HGP website
  9. (Briefly) What are the Hap Map and the ENCODE projects and why are they important?
  10. What is GenBank, and how do BLAST and ENTREZ fit into GenBank? Who is Gene Meyers?

Schedule