Variome

Last updated

The variome is the whole set of genetic variations found in populations of species that have gone through a relatively short evolution change. For example, among humans, about 1 in every 1,200 [ citation needed ] nucleotide bases differ. The size of human variome in terms of effective population size is claimed to be about 10,000 individuals. This variation rate is comparatively small compared to other species. For example, the effective population size of tigers which perhaps has the whole population size less than 10,000 in the wild is not much smaller than the human species indicating a much higher level of genetic diversity although they are close to extinction in the wild. In practice, the variome can be the sum of the single nucleotide polymorphisms (SNPs), indels, and structural variation (SV) of a population or species. The Human Variome Project seeks to compile this genetic variation data worldwide. Variomics is the study of variome and a branch of bioinformatics.

Contents

Ethnic variomes

The human variome can be subdivided into smaller ethnicity specific variomes. Each variome can have a utility in terms of filtering out common ethnic specific variants in the analyses of cancer normal variants filtering for more efficient detection of somatic mutations that can be relevant to certain anti-cancer drugs. KoVariome is one such ethnic specific variome where the project uses the term variome to denote their identity and connection to the concept of the broader human variome. KoVariome founders have been affiliated with HVP since early days of HVP where Prof. Richard Cotton initiated various efforts to compile the human variome resources.

Many curated databases has been established to document the impact of clinically significant sequence variations, such as dbSNP [1] or ClinVar. [2] Similarly, many services have been developed by the bioinformatics community to search the literature for variants. [3]

Etymology

The blend word 'variome' is from genetic variant (“a version of a gene that differs from other versions of the same gene which may or may not have an effect on human health”) and the suffix –ome (“the complete whole of a class of substances for a species or an individual”). ‘Variomics’ (which appeared in the literature before ‘variome’) was first coined by Professor Richard ‘Dick’ Cotton in 2002 to describe (“The systematic study of the effect of genetic variation on human health”). ‘Variome’ has since been most commonly used in reference to the ‘Human Variome Project’ founded four years in 2006 after ‘variomics’ was first coined.

Projects

There are a number of international project studying the human variome, including the International HapMap Project and the Human Variome Project (HVP). [4] The HapMap Project aims to identify and catalog genetic similarities and differences among humans. The HVP aims to collect data on all human genetic variation. The Korean Genome Project, which aims to collect all the East Asian ethnic Korean genetic variations has produced KoVariome that contains currently over 1,000 Korean whole genome variation information. [5] Turkish Genome Project has plans to analyze genomes of 100,000 people in Turkey. [6]

See also


Notes

  1. Sirotkin, Sherry (2001). "dbSNP: the NCBI database of genetic variation". Nucleic Acids Research. 1 (29): 308–311. doi:10.1093/nar/29.1.308. PMC   29783 . PMID   11125122.
  2. Kattman, Landrum (2019). "ClinVar: improvements to accessing data". Nucleic Acids Research. 48 (1): D835–D844. doi:10.1093/nar/gkz972. PMC   6943040 . PMID   31777943.
  3. Ruch, Pasche (2022). "Variomes: a high recall search engine to support the curation of genomic variants". Bioinformatics. 38 (9): 2595–2601. doi:10.1093/bioinformatics/btac146. PMC   9048643 . PMID   35274687.
  4. Vizzini, Casimiro (March 19, 2015). "The Human Variome Project: Global Coordination in Data Sharing". Science & Diplomacy. 4 (1).
  5. Jeon, Sungwon; Bhak, Youngjune; Choi, Yeonsong; Jeon, Yeonsu; Kim, Seunghoon; Jang, Jaeyoung; Jang, Jinho; Blazyte, Asta; Kim, Changjae; Kim, Yeonkyung; Shim, Jungae (2020-05-01). "Korean Genome Project: 1094 Korean personal genomes with clinical information". Science Advances. 6 (22): eaaz7835. Bibcode:2020SciA....6.7835J. doi: 10.1126/sciadv.aaz7835 . ISSN   2375-2548. PMC   7385432 . PMID   32766443.
  6. "Turkish Genome Project discussed at GTU". Gebze Technical University. n.d. Retrieved 1 February 2024.

Related Research Articles

<span class="mw-page-title-main">Human genome</span> Complete set of nucleic acid sequences for humans

The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the nuclear genome and the mitochondrial genome. Human genomes include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for ribosomal RNA, transfer RNA, ribozymes, small nuclear RNAs, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions, telomeres, centromeres, and origins of replication, plus large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly repetitive sequences. Introns make up a large percentage of non-coding DNA. Some of this non-coding DNA is non-functional junk DNA, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.

<span class="mw-page-title-main">Single-nucleotide polymorphism</span> Single nucleotide in genomic DNA at which different sequence alternatives exist

In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome that is present in a sufficiently large fraction of considered population.

The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease and responses to drugs and environmental factors. The information produced by the project is made freely available for research.

<span class="mw-page-title-main">Human genetic variation</span> Genetic diversity in human populations

Human genetic variation is the genetic differences in and among populations. There may be multiple variants of any given gene in the human population (alleles), a situation called polymorphism.

In molecular biology, SNP array is a type of DNA microarray which is used to detect polymorphisms within a population. A single nucleotide polymorphism (SNP), a variation at a single site in DNA, is the most frequent type of variation in the genome. Around 335 million SNPs have been identified in the human genome, 15 million of which are present at frequencies of 1% or higher across different populations worldwide.

A tag SNP is a representative single nucleotide polymorphism (SNP) in a region of the genome with high linkage disequilibrium that represents a group of SNPs called a haplotype. It is possible to identify genetic variation and association to phenotypes without genotyping every SNP in a chromosomal region. This reduces the expense and time of mapping genome areas associated with disease, since it eliminates the need to study every individual SNP. Tag SNPs are useful in whole-genome SNP association studies in which hundreds of thousands of SNPs across the entire genome are genotyped.

<span class="mw-page-title-main">Human Variome Project</span>

The Human Variome Project (HVP) is the global initiative to collect and curate all human genetic variation affecting human health. Its mission is to improve health outcomes by facilitating the unification of data on human genetic variation and its impact on human health.

<span class="mw-page-title-main">Genome-wide association study</span> Study of genetic variants in different individuals

In genomics, a genome-wide association study, is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. GWA studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits like major human diseases, but can equally be applied to any other genetic variants and any other organisms.

Minor allele frequency (MAF) is the frequency at which the second most common allele occurs in a given population. They play a surprising role in heritability since MAF variants which occur only once, known as "singletons", drive an enormous amount of selection.

<span class="mw-page-title-main">1000 Genomes Project</span> International research effort on genetic variation

The 1000 Genomes Project (1KGP), taken place from January 2008 to 2015, was an international research effort to establish the most detailed catalogue of human genetic variation at the time. Scientists planned to sequence the genomes of at least one thousand anonymous healthy participants from a number of different ethnic groups within the following three years, using advancements in newly developed technologies. In 2010, the project finished its pilot phase, which was described in detail in a publication in the journal Nature. In 2012, the sequencing of 1092 genomes was announced in a Nature publication. In 2015, two papers in Nature reported results and the completion of the project and opportunities for future research.

dbSNP Genetics database

The Single Nucleotide Polymorphism Database (dbSNP) is a free public archive for genetic variation within and across different species developed and hosted by the National Center for Biotechnology Information (NCBI) in collaboration with the National Human Genome Research Institute (NHGRI). Although the name of the database implies a collection of one class of polymorphisms only, it in fact contains a range of molecular variation: (1) SNPs, (2) short deletion and insertion polymorphisms (indels/DIPs), (3) microsatellite markers or short tandem repeats (STRs), (4) multinucleotide polymorphisms (MNPs), (5) heterozygous sequences, and (6) named variants. The dbSNP accepts apparently neutral polymorphisms, polymorphisms corresponding to known phenotypes, and regions of no variation. It was created in September 1998 to supplement GenBank, NCBI’s collection of publicly available nucleic acid and protein sequences.

WGAViewer is a bioinformatics software tool which is designed to visualize, annotate, and help interpret the results generated from a genome wide association study (GWAS). Alongside the P values of association, WGAViewer allows a researcher to visualize and consider other supporting evidence, such as the genomic context of the SNP, linkage disequilibrium (LD) with ungenotyped SNPs, gene expression database, and the evidence from other GWAS projects, when determining the potential importance of an individual SNP.

<span class="mw-page-title-main">Exome sequencing</span> Sequencing of all the exons of a genome

Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding regions of genes in a genome. It consists of two steps: the first step is to select only the subset of DNA that encodes proteins. These regions are known as exons—humans have about 180,000 exons, constituting about 1% of the human genome, or approximately 30 million base pairs. The second step is to sequence the exonic DNA using any high-throughput DNA sequencing technology.

Genomic structural variation is the variation in structure of an organism's chromosome. It consists of many kinds of variation in the genome of one species, and usually includes microscopic and submicroscopic types, such as deletions, duplications, copy-number variants, insertions, inversions and translocations. Originally, a structure variation affects a sequence length about 1kb to 3Mb, which is larger than SNPs and smaller than chromosome abnormality. However, the operational range of structural variants has widened to include events > 50bp. The definition of structural variation does not imply anything about frequency or phenotypical effects. Many structural variants are associated with genetic diseases, however many are not. Recent research about SVs indicates that SVs are more difficult to detect than SNPs. Approximately 13% of the human genome is defined as structurally variant in the normal population, and there are at least 240 genes that exist as homozygous deletion polymorphisms in human populations, suggesting these genes are dispensable in humans. Rapidly accumulating evidence indicates that structural variations can comprise millions of nucleotides of heterogeneity within every genome, and are likely to make an important contribution to human diversity and disease susceptibility.

The Variant Call Format (VCF) is a standard text file format used in bioinformatics for storing gene sequence variations. The format was developed in 2010 for the 1000 Genomes Project and has since been used by other large-scale genotyping and DNA sequencing projects. VCF is a common output format for variant calling programs due to its relative simplicity and scalability. Many tools have been developed for editing and manipulating VCF files, including VCFtools, which was released in conjunction with the VCF format in 2011, and BCFtools, which was included as part of SAMtools until being split into an independent package in 2014.

The Functional Element SNPs Database (FESD) is a biological database of single nucleotide polymorphisms in molecular biology. The database is a tool designed to organize functional elements into categories in human gene regions and to output their sequences needed for genotyping experiments as well as provide a set of SNPs that lie within each region. The database defines functional elements into ten types: promoter regions, CpG islands, 5' untranslated regions (5'-UTRs), translation start sites, splice sites, coding exons, introns, translation stop sites, polyadenylation signals, and 3' UTRs. People may reference this database for haplotype information or obtain a flanking sequence for genotyping. This may help in finding mutations that contribute to common and polygenic diseases. Researchers can manually choose a group of SNPs of special interest for certain functional elements along with their corresponding sequences. The database combines information from sources such as HapMap, UCSC GoldenPath, dbSNP, OMIM, and TRANSFAC. Users can obtain information about tag SNPs and simulate LD blocks for each gene. FESD is still a developing database and is not widely known so was unable to find projects that used the database. Research was found using similar databases or databases that are combined in FESD's information pool.

Imputation in genetics refers to the statistical inference of unobserved genotypes. It is achieved by using known haplotypes in a population, for instance from the HapMap or the 1000 Genomes Project in humans, thereby allowing to test for association between a trait of interest and experimentally untyped genetic variants, but whose genotypes have been statistically inferred ("imputed"). Genotype imputation is usually performed on SNPs, the most common kind of genetic variation.

SNV calling from NGS data is any of a range of methods for identifying the existence of single nucleotide variants (SNVs) from the results of next generation sequencing (NGS) experiments. These are computational techniques, and are in contrast to special experimental methods based on known population-wide single nucleotide polymorphisms. Due to the increasing abundance of NGS data, these techniques are becoming increasingly popular for performing SNP genotyping, with a wide variety of algorithms designed for specific experimental designs and applications. In addition to the usual application domain of SNP genotyping, these techniques have been successfully adapted to identify rare SNPs within a population, as well as detecting somatic SNVs within an individual using multiple tissue samples.

Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.

ANNOVAR is a bioinformatics software tool for the interpretation and prioritization of single nucleotide variants (SNVs), insertions, deletions, and copy number variants (CNVs) of a given genome.

References