KEGG

Last updated
KEGG
KEGG database logo.gif
Content
DescriptionBioinformatics resource for deciphering the genome
Organisms All
Contact
Research center Kyoto University
Laboratory Kanehisa Laboratories
Primary citation PMID   10592173
Release date1995
Access
Website www.kegg.jp
Web service URL REST see KEGG API
Tools
Web KEGG Mapper

KEGG (Kyoto Encyclopedia of Genes and Genomes) is a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances. KEGG is utilized for bioinformatics research and education, including data analysis in genomics, metagenomics, metabolomics and other omics studies, modeling and simulation in systems biology, and translational research in drug development.

Contents

The KEGG database project was initiated in 1995 by Minoru Kanehisa, professor at the Institute for Chemical Research, Kyoto University, under the then ongoing Japanese Human Genome Program. [1] [2] Foreseeing the need for a computerized resource that can be used for biological interpretation of genome sequence data, he started developing the KEGG PATHWAY database. It is a collection of manually drawn KEGG pathway maps representing experimental knowledge on metabolism and various other functions of the cell and the organism. Each pathway map contains a network of molecular interactions and reactions and is designed to link genes in the genome to gene products (mostly proteins) in the pathway. This has enabled the analysis called KEGG pathway mapping, whereby the gene content in the genome is compared with the KEGG PATHWAY database to examine which pathways and associated functions are likely to be encoded in the genome.

According to the developers, KEGG is a "computer representation" of the biological system. [3] It integrates building blocks and wiring diagrams of the system—more specifically, genetic building blocks of genes and proteins, chemical building blocks of small molecules and reactions, and wiring diagrams of molecular interaction and reaction networks. This concept is realized in the following databases of KEGG, which are categorized into systems, genomic, chemical, and health information. [4]

Databases

Systems information

The KEGG PATHWAY database, the wiring diagram database, is the core of the KEGG resource. It is a collection of pathway maps integrating many entities including genes, proteins, RNAs, chemical compounds, glycans, and chemical reactions, as well as disease genes and drug targets, which are stored as individual entries in the other databases of KEGG. The pathway maps are classified into the following sections:

The metabolism section contains aesthetically drawn global maps showing an overall picture of metabolism, in addition to regular metabolic pathway maps. The low-resolution global maps can be used, for example, to compare metabolic capacities of different organisms in genomics studies and different environmental samples in metagenomics studies. In contrast, KEGG modules in the KEGG MODULE database are higher-resolution, localized wiring diagrams, representing tighter functional units within a pathway map, such as subpathways conserved among specific organism groups and molecular complexes. KEGG modules are defined as characteristic gene sets that can be linked to specific metabolic capacities and other phenotypic features, so that they can be used for automatic interpretation of genome and metagenome data.

Another database that supplements KEGG PATHWAY is the KEGG BRITE database. It is an ontology database containing hierarchical classifications of various entities including genes, proteins, organisms, diseases, drugs, and chemical compounds. While KEGG PATHWAY is limited to molecular interactions and reactions of these entities, KEGG BRITE incorporates many different types of relationships.

Genomic information

Several months after the KEGG project was initiated in 1995, the first report of the completely sequenced bacterial genome was published. [5] Since then all published complete genomes are accumulated in KEGG for both eukaryotes and prokaryotes. The KEGG GENES database contains gene/protein-level information and the KEGG GENOME database contains organism-level information for these genomes. The KEGG GENES database consists of gene sets for the complete genomes, and genes in each set are given annotations in the form of establishing correspondences to the wiring diagrams of KEGG pathway maps, KEGG modules, and BRITE hierarchies.

These correspondences are made using the concept of orthologs. The KEGG pathway maps are drawn based on experimental evidence in specific organisms but they are designed to be applicable to other organisms as well, because different organisms, such as human and mouse, often share identical pathways consisting of functionally identical genes, called orthologous genes or orthologs. All the genes in the KEGG GENES database are being grouped into such orthologs in the KEGG ORTHOLOGY (KO) database. Because the nodes (gene products) of KEGG pathway maps, as well as KEGG modules and BRITE hierarchies, are given KO identifiers, the correspondences are established once genes in the genome are annotated with KO identifiers by the genome annotation procedure in KEGG. [4]

Chemical information

The KEGG metabolic pathway maps are drawn to represent the dual aspects of the metabolic network: the genomic network of how genome-encoded enzymes are connected to catalyze consecutive reactions and the chemical network of how chemical structures of substrates and products are transformed by these reactions. [6] A set of enzyme genes in the genome will identify enzyme relation networks when superimposed on the KEGG pathway maps, which in turn characterize chemical structure transformation networks allowing interpretation of biosynthetic and biodegradation potentials of the organism. Alternatively, a set of metabolites identified in the metabolome will lead to the understanding of enzymatic pathways and enzyme genes involved.

The databases in the chemical information category, which are collectively called KEGG LIGAND, are organized by capturing knowledge of the chemical network. In the beginning of the KEGG project, KEGG LIGAND consisted of three databases: KEGG COMPOUND for chemical compounds, KEGG REACTION for chemical reactions, and KEGG ENZYME for reactions in the enzyme nomenclature. [7] Currently, there are additional databases: KEGG GLYCAN for glycans [8] and two auxiliary reaction databases called RPAIR (reactant pair alignments) and RCLASS (reaction class). [9] KEGG COMPOUND has also been expanded to contain various compounds such as xenobiotics, in addition to metabolites.

Health information

In KEGG, diseases are viewed as perturbed states of the biological system caused by perturbants of genetic factors and environmental factors, and drugs are viewed as different types of perturbants. [10] The KEGG PATHWAY database includes not only the normal states but also the perturbed states of the biological systems. However, disease pathway maps cannot be drawn for most diseases because molecular mechanisms are not well understood. An alternative approach is taken in the KEGG DISEASE database, which simply catalogs known genetic factors and environmental factors of diseases. These catalogs may eventually lead to more complete wiring diagrams of diseases.

The KEGG DRUG database contains active ingredients of approved drugs in Japan, the US, and Europe. They are distinguished by chemical structures and/or chemical components and associated with target molecules, metabolizing enzymes, and other molecular interaction network information in the KEGG pathway maps and the BRITE hierarchies. This enables an integrated analysis of drug interactions with genomic information. Crude drugs and other health-related substances, which are outside the category of approved drugs, are stored in the KEGG ENVIRON database. The databases in the health information category are collectively called KEGG MEDICUS, which also includes package inserts of all marketed drugs in Japan.

Subscription model

In July 2011 KEGG introduced a subscription model for FTP download due to a significant cutback of government funding. KEGG continues to be freely available through its website, but the subscription model has raised discussions about sustainability of bioinformatics databases. [11] [12]

See also

Related Research Articles

Metabolism Set of life-sustaining chemical transformations within living cells of organisms

Metabolism is the set of life-sustaining chemical reactions in organisms. The three main purposes of metabolism are: the conversion of the energy in food to energy available to run cellular processes; the conversion of food to building blocks for proteins, lipids, nucleic acids, and some carbohydrates; and the elimination of metabolic wastes. These enzyme-catalyzed reactions allow organisms to grow and reproduce, maintain their structures, and respond to their environments. The word metabolism can also refer to the sum of all chemical reactions that occur in living organisms, including digestion and the transportation of substances into and between different cells, in which case the above described set of reactions within the cells is called intermediary metabolism.

Metabolome

The metabolome refers to the complete set of small-molecule chemicals found within a biological sample. The biological sample can be a cell, a cellular organelle, an organ, a tissue, a tissue extract, a biofluid or an entire organism. The small molecule chemicals found in a given metabolome may include both endogenous metabolites that are naturally produced by an organism as well as exogenous chemicals that are not naturally produced by an organism.

Sequence homology Shared ancestry between DNA, RNA or protein sequences

Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event (paralogs), or else a horizontal gene transfer event (xenologs).

A biochemical cascade, also known as a signaling cascade or signaling pathway, is a series of chemical reactions that occur within a biological cell when initiated by a stimulus. This stimulus, known as a first messenger, acts on a receptor that is transduced to the cell interior through second messengers which amplify the signal and transfer it to effector molecules, causing the cell to respond to the initial stimulus. Most biochemical cascades are series of events, in which one event triggers the next, in a linear fashion. At each step of the signaling cascade, various controlling factors are involved to regulate cellular actions, in order to respond effectively to cues about their changing internal and external environments.

BRENDA is an information system representing one of the most comprehensive enzyme repositories. It is an electronic resource that comprises molecular and biochemical information on enzymes that have been classified by the IUBMB. Every classified enzyme is characterized with respect to its catalyzed biochemical reaction. Kinetic properties of the corresponding reactants are described in detail. BRENDA contains enzyme-specific data manually extracted from primary scientific literature and additional data derived from automatic information retrieval methods such as text mining. It provides a web-based user interface that allows a convenient and sophisticated access to the data.

Rat Genome Database

The Rat Genome Database (RGD) is a database of rat genomics, genetics, physiology and functional data, as well as data for comparative genomics between rat, human and mouse. RGD is responsible for attaching biological information to the rat genome via structured vocabulary, or ontology, annotations assigned to genes and quantitative trait loci (QTL), and for consolidating rat strain data and making it available to the research community. They are also developing a suite of tools for mining and analyzing genomic, physiologic and functional data for the rat, and comparative data for rat, mouse, human, and five other species.

Phosphoenolpyruvic acid Chemical compound

Phosphoenolpyruvate is the ester derived from the enol of pyruvate and phosphate. It exists as an anion. PEP is an important intermediate in biochemistry. It has the highest-energy phosphate bond found in organisms, and is involved in glycolysis and gluconeogenesis. In plants, it is also involved in the biosynthesis of various aromatic compounds, and in carbon fixation; in bacteria, it is also used as the source of energy for the phosphotransferase system.

Metabolic network modelling Form of biological modelling

Metabolic network modelling, also known as metabolic network reconstruction or metabolic pathway analysis, allows for an in-depth insight into the molecular mechanisms of a particular organism. In particular, these models correlate the genome with molecular physiology. A reconstruction breaks down metabolic pathways into their respective reactions and enzymes, and analyzes them within the perspective of the entire network. In simplified terms, a reconstruction collects all of the relevant metabolic information of an organism and compiles it in a mathematical model. Validation and analysis of reconstructions can allow identification of key features of metabolism such as growth yield, resource distribution, network robustness, and gene essentiality. This knowledge can then be applied to create novel biotechnology.

Reactome is a free online database of biological pathways. There are several Reactomes that concentrate on specific organisms, the largest of these is focused on human biology, the following description concentrates on the human Reactome. It is authored by biologists, in collaboration with Reactome editorial staff. The content is cross-referenced to many bioinformatics databases. The rationale behind Reactome is to visually represent biological pathways in full mechanistic detail, while making the source data available in a computationally accessible format.

The MetaCyc database is one of the largest metabolic pathways and enzymes databases currently available. The data in the database is manually curated from the scientific literature, and covers all domains of life. MetaCyc has extensive information about chemical compounds, reactions, metabolic pathways and enzymes. The data have been curated from more than 58,000 publications.

The BioCyc database collection is an assortment of organism specific Pathway/Genome Databases (PGDBs) that provide reference to genome and metabolic pathway information for thousands of organisms. As of June 2021, there were over 17,800 databases within BioCyc. SRI International, based in Menlo Park, California, maintains the BioCyc database family.

Acetylserotonin O-methyltransferase Mammalian protein found in Homo sapiens

N-Acetylserotonin O-methyltransferase, also known as ASMT, is an enzyme which catalyzes the final reaction in melatonin biosynthesis: converting Normelatonin to melatonin. This reaction is embedded in the more general tryptophan metabolism pathway. The enzyme also catalyzes a second reaction in tryptophan metabolism: the conversion of 5-hydroxy-indoleacetate to 5-methoxy-indoleacetate. The other enzyme which catalyzes this reaction is n-acetylserotonin-o-methyltransferase-like-protein.

Microbial biodegradation is the use of bioremediation and biotransformation methods to harness the naturally occurring ability of microbial xenobiotic metabolism to degrade, transform or accumulate environmental pollutants, including hydrocarbons, polychlorinated biphenyls (PCBs), polyaromatic hydrocarbons (PAHs), heterocyclic compounds, pharmaceutical substances, radionuclides and metals.

MicrobesOnline

MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.

The Small Molecule Pathway Database (SMPDB) is a comprehensive, high-quality, freely accessible, online database containing more than 600 small molecule pathways found in humans. SMPDB is designed specifically to support pathway elucidation and pathway discovery in metabolomics, transcriptomics, proteomics and systems biology. It is able to do so, in part, by providing colorful, detailed, fully searchable, hyperlinked diagrams of five types of small molecule pathways: 1) general human metabolic pathways; 2) human metabolic disease pathways; 3) human metabolite signaling pathways; 4) drug-action pathways and 5) drug metabolism pathways. SMPDB pathways may be navigated, viewed and zoomed interactively using a Google Maps-like interface. All SMPDB pathways include information on the relevant organs, subcellular compartments, protein cofactors, protein locations, metabolite locations, chemical structures and protein quaternary structures. Each small molecule in SMPDB is hyperlinked to detailed descriptions contained in the HMDB or DrugBank and each protein or enzyme complex is hyperlinked to UniProt. Additionally, all SMPDB pathways are accompanied with detailed descriptions and references, providing an overview of the pathway, condition or processes depicted in each diagram. Users can browse the SMPDB or search its contents by text searching, sequence searching, or chemical structure searching. More powerful queries are also possible including searching with lists of gene or protein names, drug names, metabolite names, GenBank IDs, Swiss-Prot IDs, Agilent or Affymetrix microarray IDs. These queries will produce lists of matching pathways and highlight the matching molecules on each of the pathway diagrams. Gene, metabolite and protein concentration data can also be visualized through SMPDB's mapping interface.

Metabolomic Pathway Analysis, shortened to MetPA, is a freely available, user-friendly web server to assist with the identification analysis and visualization of metabolic pathways using metabolomic data. MetPA makes use of advances originally developed for pathway analysis in microarray experiments and applies those principles and concepts to the analysis of metabolic pathways. For input, MetPA expects either a list of compound names or a metabolite concentration table with phenotypic labels. The list of compounds can include common names, HMDB IDs or KEGG IDs with one compound per row. Compound concentration tables must have samples in rows and compounds in columns. MetPA’s output is a series of tables indicating which pathways are significantly enriched as well as a variety of graphs or pathway maps illustrating where and how certain pathways were enriched. MetPA’s graphical output uses a colorful Google-Maps visualization system that allows simple, intuitive data exploration that lets users employ a computer mouse or track pad to select, drag and place images and to seamlessly zoom in and out. Users can explore MetPA’s output using three different views or levels: 1) a metabolome view; 2) a pathway view; 3) a compound view.

In bioinformatics, a Gene Disease Database is a systematized collection of data, typically structured to model aspects of reality, in a way to comprehend the underlying mechanisms of complex diseases, by understanding multiple composite interactions between phenotype-genotype relationships and gene-disease mechanisms. Gene Disease Databases integrate human gene-disease associations from various expert curated databases and text mining derived associations including Mendelian, complex and environmental diseases.

Model organism databases (MODs) are biological databases, or knowledgebases, dedicated to the provision of in-depth biological data for intensively studied model organisms. MODs allow researchers to easily find background information on large sets of genes, plan experiments efficiently, combine their data with existing knowledge, and construct novel hypotheses. They allow users to analyse results and interpret datasets, and the data they generate are increasingly used to describe less well studied species. Where possible, MODs share common approaches to collect and represent biological information. For example, all MODs use the Gene Ontology (GO) to describe functions, processes and cellular locations of specific gene products. Projects also exist to enable software sharing for curation, visualization and querying between different MODs. Organismal diversity and varying user requirements however mean that MODs are often required to customize capture, display, and provision of data.

Minoru Kanehisa is a Japanese bioinformatician. He is a project professor at Kyoto University, technical director of Pathway Solutions Inc and president of NPO Bioinformatics Japan. He is one of Japan's most recognized and respected bioinformatics experts and is known for developing the KEGG bioinformatics database.

References

  1. Kanehisa M, Goto S (2000). "KEGG: Kyoto Encyclopedia of Genes and Genomes". Nucleic Acids Res. 28 (1): 27–30. doi:10.1093/nar/28.1.27. PMC   102409 . PMID   10592173.
  2. Kanehisa M (1997). "A database for post-genome analysis". Trends Genet. 13 (9): 375–6. doi:10.1016/S0168-9525(97)01223-7. PMID   9287494.
  3. Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M (2006). "From genomics to chemical genomics: new developments in KEGG". Nucleic Acids Res. 34 (Database issue): D354–7. doi:10.1093/nar/gkj102. PMC   1347464 . PMID   16381885.
  4. 1 2 Kanehisa M, Goto S, Sato Y, Kawashima M, Furumichi M, Tanabe M (2014). "Data, information, knowledge and principle: back to metabolism in KEGG". Nucleic Acids Res. 42 (Database issue): D199–205. doi:10.1093/nar/gkt1076. PMC   3965122 . PMID   24214961.
  5. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, et al. (1995). "Whole-genome random sequencing and assembly of Haemophilus influenzae Rd". Science. 269 (5223): 496–512. Bibcode:1995Sci...269..496F. doi:10.1126/science.7542800. PMID   7542800. S2CID   10423613.
  6. Kanehisa M (2013). "Chemical and genomic evolution of enzyme-catalyzed reaction networks". FEBS Lett. 587 (17): 2731–7. doi:10.1016/j.febslet.2013.06.026. hdl: 2433/178762 . PMID   23816707. S2CID   40074657.
  7. Goto S, Nishioka T, Kanehisa M (1999). "LIGAND database for enzymes, compounds and reactions". Nucleic Acids Res. 27 (1): 377–9. doi:10.1093/nar/27.1.377. PMC   148189 . PMID   9847234.
  8. Hashimoto K, Goto S, Kawano S, Aoki-Kinoshita KF, Ueda N, Hamajima M, Kawasaki T, Kanehisa M (2006). "KEGG as a glycome informatics resource". Glycobiology. 16 (5): 63R–70R. doi: 10.1093/glycob/cwj010 . PMID   16014746.
  9. Muto A, Kotera M, Tokimatsu T, Nakagawa Z, Goto S, Kanehisa M (2013). "Modular architecture of metabolic pathways revealed by conserved sequences of reactions". J Chem Inf Model. 53 (3): 613–22. doi:10.1021/ci3005379. PMC   3632090 . PMID   23384306.
  10. Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M (2010). "KEGG for representation and analysis of molecular networks involving diseases and drugs". Nucleic Acids Res. 38 (Database issue): D355–60. doi:10.1093/nar/gkp896. PMC   2808910 . PMID   19880382.
  11. Galperin MY, Fernández-Suárez XM (2012). "The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection". Nucleic Acids Res. 40 (Database issue): D1–8. doi:10.1093/nar/gkr1196. PMC   3245068 . PMID   22144685.
  12. Hayden, EC (2013). "Popular plant database set to charge users". Nature. doi:10.1038/nature.2013.13642.