Locus Reference Genomic (LRG) [1] [2] is a DNA sequence format that was developed to aid in curating locus specific databases (LSDBs) that record DNA sequence variation which can result in inherited diseases. LRGs have fixed sequences that are independent of the genome so that they provide a stable framework for reporting variants. The LRG format uses extensible markup language (XML) to provide highly structured single records containing the genomic DNA sequence for individual genes along with the mRNAs and proteins encoded by these genes. LRG records are recommended in the Human Genome Variation Society Nomenclature guidelines as reference sequences to report sequence variants in LSDBs and the literature.
Deoxyribonucleic acid is a molecule composed of two chains that coil around each other to form a double helix carrying the genetic instructions used in the growth, development, functioning, and reproduction of all known organisms and many viruses. DNA and ribonucleic acid (RNA) are nucleic acids; alongside proteins, lipids and complex carbohydrates (polysaccharides), nucleic acids are one of the four major types of macromolecules that are essential for all known forms of life.
In biology, a mutation is the alteration of the nucleotide sequence of the genome of an organism, virus, or extrachromosomal DNA.
Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The W3C's XML 1.0 Specification and several other related specifications—all of them free open standards—define XML.
The LRG concept was developed by the GEN2PHEN project in conjunction with the European Bioinformatics Institute (EBI) and the National Center for Biotechnology Information (NCBI).
The European Bioinformatics Institute (EMBL-EBI) is an IGO which as part of the European Molecular Biology Laboratory (EMBL) family focuses on research and services in bioinformatics.
The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper.
The LRG homepage provides access to existing LRG sequences and allows the submission of requests for the creation of new LRGs. This page also has a frequently asked questions (FAQs) section.
An FAQ is a list of frequently asked questions (FAQs) and answers on a particular topic. The format is often used in articles, websites, email lists, and online forums where common questions tend to recur, for example through posts or queries by new users related to common knowledge gaps. The purpose of an FAQ is generally to provide information on frequent questions or concerns; however, the format is a useful means of organizing information, and text consisting of questions and their answers may thus be called an FAQ regardless of whether the questions are actually frequently asked.
The human genome is the complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the nuclear genome, and the mitochondrial genome. Human genomes include both protein-coding DNA genes and noncoding DNA. Haploid human genomes, which are contained in germ cells consist of three billion DNA base pairs, while diploid genomes have twice the DNA content. While there are significant differences among the genomes of human individuals, these are considerably smaller than the differences between humans and their closest living relatives, the chimpanzees and bonobos.
Ensembl genome database project is a joint scientific project between the European Bioinformatics Institute and the Wellcome Trust Sanger Institute, which was launched in 1999 in response to the imminent completion of the Human Genome Project. Ensembl aims to provide a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model organisms. Ensembl is one of several well known genome browsers for the retrieval of genomic information.
An accession number in bioinformatics is a unique identifier given to a DNA or protein sequence record to allow for tracking of different versions of that sequence record and the associated sequence over time in a single data repository. Because of its relative stability, accession numbers can be utilized as foreign keys for referring to a sequence object, but not necessarily to a unique sequence. All sequence information repositories implement the concept of "accession number" but might do so with subtle variations.
Copy number variation (CNV) is a phenomenon in which sections of the genome are repeated and the number of repeats in the genome varies between individuals in the human population. Copy number variation is a type of structural variation: specifically, it is a type of duplication or deletion event that affects a considerable number of base pairs. However, note that although modern genomics research is mostly focused on human genomes, copy number variations also occur in a variety of other organisms including E. coli. Recent research indicates that approximately two thirds of the entire human genome is composed of repeats and 4.8–9.5% of the human genome can be classified as copy number variations. In mammals, copy number variations play an important role in generating necessary variation in the population as well as disease phenotype.
Genotyping is the process of determining differences in the genetic make-up (genotype) of an individual by examining the individual's DNA sequence using biological assays and comparing it to another individual's sequence or a reference sequence. It reveals the alleles an individual has inherited from their parents. Traditionally genotyping is the use of DNA sequences to define biological populations by use of molecular tools. It does not usually involve defining the genes of an individual.
The Human Genome Project (HGP) was an international scientific research project with the goal of determining the sequence of nucleotide base pairs that make up human DNA, and of identifying and mapping all of the genes of the human genome from both a physical and a functional standpoint. It remains the world's largest collaborative biological project. After the idea was picked up in 1984 by the US government when the planning started, the project formally launched in 1990 and was declared complete on April 14, 2003. Funding came from the US government through the National Institutes of Health (NIH) as well as numerous other groups from around the world. A parallel project was conducted outside government by the Celera Corporation, or Celera Genomics, which was formally launched in 1998. Most of the government-sponsored sequencing was performed in twenty universities and research centers in the United States, the United Kingdom, Japan, France, Germany and China.
The Saccharomyces Genome Database (SGD) is a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, which is commonly known as baker's or budding yeast.
UniGene is an NCBI database of the transcriptome and thus, despite the name, not primarily a database for genes. Each entry is a set of transcripts that appear to stem from the same transcription locus. Information on protein similarities, gene expression, cDNA clones, and genomic location is included with each entry.
Bardet-Biedl syndrome 5 protein is a protein that in humans is encoded by the BBS5 gene.
SPEG complex locus, also known as SPEG, is a human gene.
The 1000 Genomes Project, launched in January 2008, was an international research effort to establish by far the most detailed catalogue of human genetic variation. Scientists planned to sequence the genomes of at least one thousand anonymous participants from a number of different ethnic groups within the following three years, using newly developed technologies which were faster and less expensive. In 2010, the project finished its pilot phase, which was described in detail in a publication in the journal Nature. In 2012, the sequencing of 1092 genomes was announced in a Nature publication. In 2015, two papers in Nature reported results and the completion of the project and opportunities for future research. Many rare variations, restricted to closely related groups, were identified, and eight structural-variation classes were analyzed.
Genotype to Phenotype Databases: a Holistic Approach (GEN2PHEN) is a European project aiming to develop a knowledge web portal integrating information from the genotype to the phenotype in a unifying portal: The Knowledge Centre.
The Leiden Open Variation Database (LOVDe) is a free, flexible web-based open source database developed in the Leiden University Medical Center in the Netherlands, designed to collect and display variants in the DNA sequence. The focus of an LOVD is usually the combination between a gene and a genetic (heritable) disease. All sequence variants found in individuals are collected in the database, together with information about whether they could be causally connected to the disease or not. Specialized doctors use LOVDs to diagnose and advise patients carrying a genetic disease. Ideally, if a patient has been screened for mutations and one has been found, information in LOVD can predict the progress of the disease.
Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding region of genes in a genome. It consists of two steps: the first step is to select only the subset of DNA that encodes proteins. These regions are known as exons – humans have about 180,000 exons, constituting about 1% of the human genome, or approximately 30 million base pairs. The second step is to sequence the exonic DNA using any high-throughput DNA sequencing technology.
Mutalyzer is a web-based software tool which was primarily developed to check the description of sequence variants identified in a gene during genetic testing. Mutalyzer applies the rules of the standard human sequence variant nomenclature and can correct descriptions accordingly. Apart from the sequence variant description, Mutalyzer requires a DNA sequence record containing the transcript and protein feature annotation as a reference. Mutalyzer 2 accepts GenBank and Locus Reference Genomic (LRG) records. The annotation is also used to apply the correct codon translation tables and generate DNA and protein variant descriptions for any organism. The Mutalyzer server supports programmatic access via a SOAP Web service described in the Web Services Description Language (WSDL) and an HTTP/RPC+JSON web service.
The Variant Call Format (VCF) specifies the format of a text file used in bioinformatics for storing gene sequence variations. The format has been developed with the advent of large-scale genotyping and DNA sequencing projects, such as the 1000 Genomes Project. Existing formats for genetic data such as General feature format (GFF) stored all of the genetic data, much of which is redundant because it will be shared across the genomes. By using the variant call format only the variations need to be stored along with a reference genome.
A gene is said to be polymorphic if more than one allele occupies that gene’s locus within a population. In addition to having more than one allele at a specific locus, each allele must also occur in the population at a rate of at least 1% to generally be considered polymorphic.
De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.
Ensembl Genomes is a scientific project to provide genome-scale data from non-vertebrate species. The project is run by the European Bioinformatics Institute, and was launched in 2009 using the Ensembl technology. The main objective of the Ensembl Genomes database is to complement the main Ensembl database by introducing five additional web pages to include genome data for bacteria, fungi, invertebrate metazoa, plants, and protists. For each of the domains, the Ensembl tools are available for manipulation, analysis and visualization of genome data. Most Ensembl Genomes data is stored in MySQL relational databases and can be accessed by the Ensembl Perl API, virtual machines or online.