Cancer Genome Anatomy Project

Last updated

The Cancer Genome Anatomy Project (CGAP), created by the National Cancer Institute (NCI) in 1997 and introduced by Al Gore, is an online database on normal, pre-cancerous and cancerous genomes. It also provides tools for viewing and analysis of the data, allowing for identification of genes involved in various aspects of tumor progression. The goal of CGAP is to characterize cancer at a molecular level by providing a platform with readily accessible updated data and a set of tools such that researchers can easily relate their findings to existing knowledge. There is also a focus on development of software tools that improve the usage of large and complex datasets. [1] [2] The project is directed by Daniela S. Gerhard, and includes sub-projects or initiatives, with notable ones including the Cancer Chromosome Aberration Project (CCAP) and the Genetic Annotation Initiative (GAI). CGAP contributes to many databases and organisations such as the NCBI contribute to CGAP's databases.

Contents

The eventual outcomes of CGAP include establishing a correlation between a particular cancer's progression with its therapeutic outcome, improved evaluation of treatment and development of novel techniques for prevention, detection and treatment. This is achieved by characterisation of biological tissue mRNA products.

Research

Background

The fundamental cause of cancer is the inability for a cell to regulate its gene expression. To characterise a specific type of cancer, the proteins that are produced from the altered gene expression or the mRNA precursor to the protein can be examined. CGAP works to associate a particular cell's expression profile, molecular signature or transcriptome, which is essentially the cell's fingerprint, with the cell's phenotype. Therefore, expression profiles exist with consideration to cancer type and stage of progression. [3]

Sequencing

CGAP's initial goal was to establish a Tumor Gene Index (TGI) to store the expression profiles. This would have contributions to both new and existing databases. [4] This contributed to two types of libraries, the dbEST and later dbSAGE. This was performed in a series of steps: [3]

The TGI focused on prostate, breast, ovarian, lung and colon cancers at first, and CGAP extended to other cancers in its research. Practically, issues arose which CGAP accounted for as new technologies became available.

Many cancers occur in tissues with multiple cell types. Traditional techniques took the whole tissue sample and produced bulk tissue cDNA libraries. This cellular heterogeneity made gene expression information in terms of cancer biology less accurate. An example is prostate cancer tissue where epithelial cells, which have been shown to be the only cell type give rise to cancer, only consist 10% of the cell count. This led to development of laser capture microdissection (LCM), a technique that can isolate individual cell types individual cells, which gave rise to cDNA libraries of specific cell types. [4]

The sequencing of cDNA will produce the entire mRNA transcript that generated it. Practically, only part of the sequence is required to uniquely identify the mRNA or protein associated. The resultant part of the sequence was termed the expressed sequence tag (EST) and is always at the end of the sequence close to the poly A tail. EST data are stored in a database called dbEST. ESTs only need to be around 400 bases long, but with NGS sequencing techniques this will still produce low quality reads. Therefore, an improved method called serial analysis of gene expression (SAGE) is also used. This method identifies, for each cDNA transcript molecule produced from a cell's gene expression, regions only 10-14 bases long anywhere along the read sequence, sufficient to uniquely identify that cDNA transcript. These bases are cut out and linked together, then incorporated into bacterial plasmids as mentioned above. SAGE libraries have better read quality and generate a larger amount of data when sequenced, and since transcripts are compared in absolute rather than relative levels, SAGE has the advantage of requiring no normalisation of data via comparison with a reference. [1] [4]

Resources

Following sequencing and establishment of libraries, CGAP incorporates the data along with existing data sources and provides various databases and tools for analysis. A detailed description of tools and databases created or used by CGAP can be found on NCI's CGAP website. Below are some of the initiatives or research tools provided by CGAP.

Genomic Annotation Initiative

The goal of the Cancer Genome Anatomy Project Genome Annotation Initiative (CGAP-GAI) is to discover and catalogue single nucleotide polymorphisms (SNPs) that correlate with cancer initiation and progression. [4] CGAP-GAI have created a variety of tools for the discovery, analysis and display of SNPs. SNPs are valuable in cancer research as they can be used in several different genetic studies, commonly to track transmission, identify alternate forms of genes and analyze complex molecular pathways that regulate cell metabolism, growth, or differentiation. [5]

SNPs in the CGAP-GAI are either found as a result of resequencing genes of interest in different individuals or looking through existing human EST databases and making comparisons. [2] It examines transcripts from healthy individuals, individuals with disease, tumour tissue and cell lines from a large set of individuals; therefore the database is more likely to include rare disease mutations in addition to high frequency variants. [6] A common challenge with SNP detection is differentiation between sequencing errors with actual polymorphisms. SNPs that are found undergo statistical analysis using the CGAP SNP pipeline to calculate the probability that the variant is in fact a polymorphism. High probability SNPs are validated and there are tools available that make predictions as to whether function is altered. [2]

To make the data easily accessible CGAP-GAI has a number of tools which can display both a sequence alignment and assembly overview with context to sequences from which they were predicted. SNPs are annotated and integrated genetic/physical maps are often determined. [6]

Cancer Chromosomal Aberration Project (CCAP)

Genomic instability is a common feature of cancer; therefore understanding structural and chromosomal abnormalities can give insight into the progression of disease. The Cancer Chromosome Aberration Project (cCAP) is a CGAP supported initiative used for defining chromosome structure and to characterize rearrangements that are associated with malignant transformation. [4] [7] It incorporates the online version of Mitelman's database, created by Felix Mitelman, Bertil Johansson and Fredrik Mertens prior to the creation of CGAP, another compilation of known chromosomal rearrangements. The CCAP has several goals: [7]

There is cytogenetic information from over 64,000 patient cases, including more than 2000 gene fusions, contained in the database. [1]

As part of this project there is a repository of physically and cytogenetically mapped BAC clones for the human genome that are physically available through a network of distributors. [1] The CCAP Clone maps have been mapped cytogenetically using FISH at a resolution of 1-2Mb across the human genome, and physically mapped using sequence-tagged sites (STS). [8] The data for BAC clones are also available through CGAP and NCBI databases.

Other Resources

Listed below are some other resources available through CGAP. [1]

Digital Differential Display

An early technique used by CGAP is digital differential display (DDD), which uses the Fisher exact test to compare libraries against each other, in order to find a significant difference between populations. CGAP ensured that DDD was able to compare between all cDNA libraries in dbEST, and not just those which were generated by CGAP. [4]

Mammalian Gene Collection (MGC)

The MGC provides researchers with full-length protein information from cDNA, unlike EST or SAGE databases which only provide the identifying tag. The project includes human and mouse genes, and later cow cDNAs generated by Genome Canada were added. [9]

SAGEmap

SAGEmap is the database used to store SAGE libraries. Over 3.4 million SAGE tags exist as of 2001. Tools can be used to map SAGE tags to UniGene clusters, a database that stores transcriptomes. This allows for easier identification of a SAGE tag's corresponding sequence. In addition, there are tools associated with SAGEmaps: [10]

  • Digital Northern is used to measure the expression level of specific genes, [1]
  • SAGE Anatomic Viewer displays this information visually, and compares it between normal and cancerous cells,
  • Ludwig Transcript (LT) Viewer shows alternative transcripts and their possible associated SAGE tags,
  • mSAGE Expression Matrix (mSEM) shows gene expression levels throughout mouse development for different tissue types.

Gene Finder

The CGAP locates a gene or a list of genes based on specified search criteria and provides links to different NCI and NCBI databases. A gene can be searched for specifically using a unique identifier such as gene symbols and Entrez gene number as well as generally by function, tissue or keyword. [11]

Other gene tools accessible through the CGAP web interface include the Gene Ontology Browser (GO) and the Nucleotide BLAST tool.

Gene Expression Tools

cDNA xProfiler and cDNA Digital gene expression displayer (DGED) together are used to find statistically significant genes of interest that are differentially expressed within two pools of cDNA libraries, typically a comparison is made between normal and cancer tissues. [12] Statistical significance is determined by DGED using a combination of Bayesian statistics and a sequence odds ratio to calculate a probability. cDNA DGED relies on the UniGene relational database while the cDNA xProfileruses a flat file database that is not available online. [13]

Outcomes and Future

CGAP is now a centralised location for several genomics tools and genetic databases and is employed widely in cancer and molecular biology research. The databases established by CGAP continues to contribute to knowledge of cancers in terms of their pathways and progression. The transcriptome databases can also be used in non-cancer related research, as they contain information that can be used to quickly and easily identify particular sequenced genes. The data also has clinical impact, as cDNAs can be used to create microarrays for diagnosis and treatment comparison purposes. CGAP has been used in many studies, with examples including: [1] [4]

In addition, the vast amount of data generated by CGAP has prompted for improvement of data analysis and mining techniques, with examples including: [1]

See also

Related Research Articles

<span class="mw-page-title-main">Alternative splicing</span> Process by which a gene can code for multiple proteins

Alternative splicing, or alternative RNA splicing, or differential splicing, is an alternative splicing process during gene expression that allows a single gene to code for multiple proteins. In this process, particular exons of a gene may be included within or excluded from the final, processed messenger RNA (mRNA) produced from that gene. This means the exons are joined in different combinations, leading to different (alternative) mRNA strands. Consequently, the proteins translated from alternatively spliced mRNAs usually contain differences in their amino acid sequence and, often, in their biological functions.

Gene knockdown is an experimental technique by which the expression of one or more of an organism's genes is reduced. The reduction can occur either through genetic modification or by treatment with a reagent such as a short DNA or RNA oligonucleotide that has a sequence complementary to either gene or an mRNA transcript.

In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has proceeded rapidly, with approximately 74.2 million ESTs now available in public databases. EST approaches have largely been superseded by whole genome and transcriptome sequencing and metagenome sequencing.

<span class="mw-page-title-main">Single-nucleotide polymorphism</span> Single nucleotide in genomic DNA at which different sequence alternatives exist

In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome that is present in a sufficiently large fraction of considered population.

<span class="mw-page-title-main">Functional genomics</span> Field of molecular biology

Functional genomics is a field of molecular biology that attempts to describe gene functions and interactions. Functional genomics make use of the vast data generated by genomic and transcriptomic projects. Functional genomics focuses on the dynamic aspects such as gene transcription, translation, regulation of gene expression and protein–protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional "candidate-gene" approach.

The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The term transcriptome is a portmanteau of the words transcript and genome; it is associated with the process of transcript production during the biological process of transcription.

<span class="mw-page-title-main">Serial analysis of gene expression</span> Molecular biology technique

Serial Analysis of Gene Expression (SAGE) is a transcriptomic technique used by molecular biologists to produce a snapshot of the messenger RNA population in a sample of interest in the form of small tags that correspond to fragments of those transcripts. Several variants have been developed since, most notably a more robust version, LongSAGE, RL-SAGE and the most recent SuperSAGE. Many of these have improved the technique with the capture of longer tags, enabling more confident identification of a source gene.

Trans-Spliced Exon Coupled RNA End Determination (TEC-RED) is a transcriptomic technique that, like SAGE, allows for the digital detection of messenger RNA sequences. Unlike SAGE, detection and purification of transcripts from the 5’ end of the messenger RNA require the presence of a trans-spliced leader sequence.

<span class="mw-page-title-main">Long non-coding RNA</span> Non-protein coding transcripts longer than 200 nucleotides

Long non-coding RNAs are a type of RNA, generally defined as transcripts more than 200 nucleotides that are not translated into protein. This arbitrary limit distinguishes long ncRNAs from small non-coding RNAs, such as microRNAs (miRNAs), small interfering RNAs (siRNAs), Piwi-interacting RNAs (piRNAs), small nucleolar RNAs (snoRNAs), and other short RNAs. Given that some lncRNAs have been reported to have the potential to encode small proteins or micro-peptides, the latest definition of lncRNA is a class of RNA molecules of over 200 nucleotides that have no or limited coding capacity. Long intervening/intergenic noncoding RNAs (lincRNAs) are sequences of lncRNA which do not overlap protein-coding genes.

<span class="mw-page-title-main">RNA-Seq</span> Lab technique in cellular biology

RNA-Seq is a technique that uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also known as transcriptome.

Massive parallel signature sequencing (MPSS) is a procedure that is used to identify and quantify mRNA transcripts, resulting in data similar to serial analysis of gene expression (SAGE), although it employs a series of biochemical and sequencing steps that are substantially different.

Cancer genome sequencing is the whole genome sequencing of a single, homogeneous or heterogeneous group of cancer cells. It is a biochemical laboratory method for the characterization and identification of the DNA or RNA sequences of cancer cell(s).

<span class="mw-page-title-main">HIKESHI</span> Protein-coding gene in the species Homo sapiens

HIKESHI is a protein important in lung and multicellular organismal development that, in humans, is encoded by the HIKESHI gene. HIKESHI is found on chromosome 11 in humans and chromosome 7 in mice. Similar sequences (orthologs) are found in most animal and fungal species. The mouse homolog, lethal gene on chromosome 7 Rinchik 6 protein is encoded by the l7Rn6 gene.

Chimeric RNA, sometimes referred to as a fusion transcript, is composed of exons from two or more different genes that have the potential to encode novel proteins. These mRNAs are different from those produced by conventional splicing as they are produced by two or more gene loci.

WormBase is an online biological database about the biology and genome of the nematode model organism Caenorhabditis elegans and contains information about other related nematodes. WormBase is used by the C. elegans research community both as an information resource and as a place to publish and distribute their results. The database is regularly updated with new versions being released every two months. WormBase is one of the organizations participating in the Generic Model Organism Database (GMOD) project.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

<span class="mw-page-title-main">FANTOM</span>

FANTOM is an international research consortium first established in 2000 as part of the RIKEN research institute in Japan. The original meeting gathered international scientists from diverse backgrounds to help annotate the function of mouse cDNA clones generated by the Hayashizaki group. Since the initial FANTOM1 effort, the consortium has released multiple projects that look to understand the mechanisms governing the regulation of mammalian genomes. Their work has generated a large collection of shared data and helped advance biochemical and bioinformatic methodologies in genomics research.

<span class="mw-page-title-main">C22orf23</span> Protein-coding gene in the species Homo sapiens

C22orf23 is a protein which in humans is encoded by the C22orf23 gene. Its predicted secondary structure consists of alpha helices and disordered/coil regions. It is expressed in many tissues and highest in the testes and it is conserved across many orthologs.

<span class="mw-page-title-main">SMIM19</span> Protein-coding gene in the species Homo sapiens

SMIM19, also known as Small Integral Membrane Protein 19, encodes the SMIM19 protein. SMIM19 is a confirmed single-pass transmembrane protein passing from outside to inside, 5' to 3' respectively. SMIM19 has ubiquitously high to medium expression with among varied tissues or organs. The validated function of SMIM19 remains under review because of on sub-cellular localization uncertainty. However, all linked proteins research to interact with SMIM19 are associated with the endoplasmic reticulum (ER), presuming SMIM19 ER association

References

  1. 1 2 3 4 5 6 7 8 Riggins, G. J. (2001). "Genome and genetic resources from the Cancer Genome Anatomy Project". Human Molecular Genetics. 10 (7): 663–667. doi:10.1093/hmg/10.7.663. ISSN   1460-2083. PMID   11257097.
  2. 1 2 3 Strausberg, Robert L.; Buetow, Kenneth H.; Emmert-Buck, Michael R.; Klausner, Richard D. (2000). "The Cancer Genome Anatomy Project: building an annotated gene index". Trends in Genetics. 16 (3): 103–106. doi:10.1016/S0168-9525(99)01937-X. ISSN   0168-9525. PMID   10689348.
  3. 1 2 "Understanding Cancer". Archived from the original on 2014-08-05. Retrieved 2014-09-04.
  4. 1 2 3 4 5 6 7 Krizman, David B.; Wagner, Lukas; Lash, Alex; Strausberg, Robert L.; Emmert-Buck, Michael R. (1999). "The Cancer Genome Anatomy Project: EST Sequencing and the Genetics of Cancer Progression". Neoplasia. 1 (2): 101–106. doi:10.1038/sj.neo.7900002. ISSN   1476-5586. PMC   1508126 . PMID   10933042.
  5. Clifford, R. (2000). "Expression-based Genetic/Physical Maps of Single-Nucleotide Polymorphisms Identified by the Cancer Genome Anatomy Project". Genome Research. 10 (8): 1259–1265. doi:10.1101/gr.10.8.1259. ISSN   1088-9051. PMC   310932 . PMID   10958644.
  6. 1 2 Clifford, Robert J.; Edmonson, Michael N.; Nguyen, Cu; Scherpbier, Titia; Hu, Ying; Buetow, Kenneth H. (2004). "Bioinformatics Tools for Single Nucleotide Polymorphism Discovery and Analysis". Annals of the New York Academy of Sciences. 1020 (1): 101–109. Bibcode:2004NYASA1020..101C. doi:10.1196/annals.1310.011. ISSN   0077-8923. PMID   15208187. S2CID   19088027.
  7. 1 2 "The Cancer Chromosome Aberration Project (CCAP)" . Retrieved 2014-09-05.
  8. "All About the FISH-mapped BACs" . Retrieved 2014-09-07.
  9. "Mammalian Gene Collection". Archived from the original on 2015-02-25. Retrieved 2014-09-07.
  10. "SAGE genie" . Retrieved 2014-09-07.
  11. "Gene Finder" . Retrieved 2014-09-07.
  12. "CGAP How to: Tools" . Retrieved 2014-09-07.
  13. Milnthorpe, Andrew T; Soloviev, Mikhail (2011). "Errors in CGAP xProfiler and cDNA DGED: the importance of library parsing and gene selection algorithms". BMC Bioinformatics. 12 (1): 97. doi: 10.1186/1471-2105-12-97 . ISSN   1471-2105. PMC   3094240 . PMID   21496233.
  14. Croix, B. St. (2000). "Genes Expressed in Human Tumor Endothelium". Science. 289 (5482): 1197–1202. Bibcode:2000Sci...289.1197S. doi:10.1126/science.289.5482.1197. ISSN   0036-8075. PMID   10947988.
  15. Loging, W. T. (2000). "Identifying Potential Tumor Markers and Antigens by Database Mining and Rapid Expression Screening". Genome Research. 10 (9): 1393–1402. doi:10.1101/gr.138000. ISSN   1088-9051. PMC   310902 . PMID   10984457.
  16. C. D. Hough; C. A. Sherman-Baust; E. S. Pizer; F. J. Montz; D. D. Im; N. B. Rosenshein; K. R. Cho; G. J. Riggins; P. J. Morin (November 2000). "Large-scale serial analysis of gene expression reveals genes differentially expressed in ovarian cancer". Cancer Research . 60 (22): 6281–6287. PMID   11103784.
  17. G. Vasmatzis; M. Essand; U. Brinkmann; B. Lee; I. Pastan (January 1998). "Discovery of three genes specifically expressed in human prostate by expressed sequence tag database analysis". Proceedings of the National Academy of Sciences of the United States of America . 95 (1): 300–304. Bibcode:1998PNAS...95..300V. doi: 10.1073/pnas.95.1.300 . PMC   18207 . PMID   9419370.
  18. U. Brinkmann; G. Vasmatzis; B. Lee; N. Yerushalmi; M. Essand; I. Pastan (September 1998). "PAGE-1, an X chromosome-linked GAGE-like gene that is expressed in normal and neoplastic prostate, testis, and uterus". Proceedings of the National Academy of Sciences of the United States of America . 95 (18): 10757–10762. Bibcode:1998PNAS...9510757B. doi: 10.1073/pnas.95.18.10757 . PMC   27968 . PMID   9724777.
  19. D. J. Stekel; Y. Git; F. Falciani (December 2000). "The comparison of gene expression from multiple cDNA libraries". Genome Research . 10 (12): 2055–2061. doi:10.1101/gr.gr-1325rr. PMC   313085 . PMID   11116099.
  20. Schmitt, A. O.; Specht, T.; Beckmann, G.; Dahl, E.; Pilarsky, C. P.; Hinzmann, B.; Rosenthal, A. (1999). "Exhaustive mining of EST libraries for genes differentially expressed in normal and tumour tissues". Nucleic Acids Research. 27 (21): 4251–4260. doi:10.1093/nar/27.21.4251. ISSN   0305-1048. PMC   148701 . PMID   10518618.
  21. V. E. Velculescu; S. L. Madden; L. Zhang; A. E. Lash; J. Yu; C. Rago; A. Lal; C. J. Wang; G. A. Beaudry; K. M. Ciriello; B. P. Cook; M. R. Dufault; A. T. Ferguson; Y. Gao; T. C. He; H. Hermeking; S. K. Hiraldo; P. M. Hwang; M. A. Lopez; H. F. Luderer; B. Mathews; J. M. Petroziello; K. Polyak; L. Zawel; K. W. Kinzler (December 1999). "Analysis of human transcriptomes". Nature Genetics . 23 (4): 387–388. doi:10.1038/70487. PMID   10581018. S2CID   29173492.