International Chemical Identifier

Last updated
InChI
Developer(s) InChI Trust
Initial releaseApril 15, 2005 (2005-04-15) [1] [2]
Stable release
1.06 / December 15, 2020;17 months ago (2020-12-15)
Operating system Microsoft Windows and Unix-like
Platform IA-32 and x86-64
Available in English
License IUPAC / InChI Trust Licence
Website www.inchi-trust.org

The IUPAC International Chemical Identifier (InChI /ˈɪn/ IN-chee or /ˈɪŋk/ ING-kee) is a textual identifier for chemical substances, designed to provide a standard way to encode molecular information and to facilitate the search for such information in databases and on the web. Initially developed by IUPAC (International Union of Pure and Applied Chemistry) and NIST (National Institute of Standards and Technology) from 2000 to 2005, the format and algorithms are non-proprietary.

Contents

The identifiers describe chemical substances in terms of layers of information the atoms and their bond connectivity, tautomeric information, isotope information, stereochemistry, and electronic charge information. [3] Not all layers have to be provided; for instance, the tautomer layer can be omitted if that type of information is not relevant to the particular application. The InChI algorithm converts input structural information into a unique InChI identifier in a three-step process: normalization (to remove redundant information), canonicalization (to generate a unique number label for each atom), and serialization (to give a string of characters).

InChIs differ from the widely used CAS registry numbers in three respects: firstly, they are freely usable and non-proprietary; secondly, they can be computed from structural information and do not have to be assigned by some organization; and thirdly, most of the information in an InChI is human readable (with practice). InChIs can thus be seen as akin to a general and extremely formalized version of IUPAC names. They can express more information than the simpler SMILES notation and differ in that every structure has a unique InChI string, which is important in database applications. Information about the 3-dimensional coordinates of atoms is not represented in InChI; for this purpose a format such as PDB can be used.

The InChIKey, sometimes referred to as a hashed InChI, is a fixed length (27 character) condensed digital representation of the InChI that is not human-understandable. The InChIKey specification was released in September 2007 in order to facilitate web searches for chemical compounds, since these were problematic with the full-length InChI. [4] Unlike the InChI, the InChIKey is not unique: though collisions can be calculated to be very rare, they happen. [5]

In January 2009 the 1.02 version of the InChI software was released. This provided a means to generate so called standard InChI, which does not allow for user selectable options in dealing with the stereochemistry and tautomeric layers of the InChI string. The standard InChIKey is then the hashed version of the standard InChI string. The standard InChI will simplify comparison of InChI strings and keys generated by different groups, and subsequently accessed via diverse sources such as databases and web resources.

The continuing development of the standard has been supported since 2010 by the not-for-profit InChI Trust, of which IUPAC is a member. The current software version is 1.06 and was released in December 2020. [6] Prior to 1.04, the software was freely available under the open-source LGPL license, [7] but it now uses a custom license called IUPAC-InChI Trust License. [8]

Generation

In order to avoid generating different InChIs for tautomeric structures, before generating the InChI, an input chemical structure is normalized to reduce it to its so-called core parent structure. This may involve changing bond orders, rearranging formal charges and possibly adding and removing protons. Different input structures may give the same result; for example, acetic acid and acetate would both give the same core parent structure, that of acetic acid. A core parent structure may be disconnected, consisting of more than one component, in which case the sublayers in the InChI usually consist of sublayers for each component, separated by semicolons (periods for the chemical formula sublayer.) One way this can happen is that all metal atoms are disconnected during normalization; so, for example, the InChI for tetraethyllead will have five components, one for lead and four for the ethyl groups. [3]

The first, main, layer of the InChI refers to this core parent structure, giving its chemical formula, non-hydrogen connectivity without bond order (/c sublayer) and hydrogen connectivity (/h sublayer.) The /q portion of the charge layer gives its charge, and the /p portion of the charge layer tells how many protons (hydrogen ions) must be added to or removed from it to regenerate the original structure. If present, the stereochemical layer, with sublayers /b, /t, /m and /s, gives stereochemical information, and the isotopic layer /i (which may contain sublayers /h, /b, /t, /m and /s) gives isotopic information. These are the only layers which can occur in a standard InChI. [3]

If the user wants to specify an exact tautomer, a fixed hydrogen layer /f can be appended, which may contain various additional sublayers; this cannot be done in standard InChI though, so different tautomers will have the same standard InChI (for example, alanine will give the same standard InChI whether input in a neutral or a zwitterionic form.) Finally, a nonstandard reconnected /r layer can be added, which effectively gives a new InChI generated without breaking bonds to metal atoms. This may contain various sublayers, including /f. [3]

Format and layers

InChI format
Internet media type
chemical/x-inchi
Type of format chemical file format

Every InChI starts with the string "InChI=" followed by the version number, currently 1. If the InChI is standard, this is followed by the letter S for standard InChIs, which is a fully standardized InChI flavor maintaining the same level of attention to structure details and the same conventions for drawing perception. The remaining information is structured as a sequence of layers and sub-layers, with each layer providing one specific type of information. The layers and sub-layers are separated by the delimiter "/" and start with a characteristic prefix letter (except for the chemical formula sub-layer of the main layer). The six layers with important sublayers are:

  1. Main layer
    • Chemical formula (no prefix). This is the only sublayer that must occur in every InChI. Numbers used throughout the InChI are given in the formula's element order excluding hydrogen atoms. For example, “/C10H16N5O13P3” implies that atoms numbered 1–10 are carbons, 11–15 are nitrogens, 16–28 are oxygens, and 29–31 are phosphorus. [9]
    • Atom connections (prefix: "c"). The atoms in the chemical formula (except for hydrogens) are numbered in sequence; this sublayer describes which atoms are connected by bonds to which other ones.
    • Hydrogen atoms (prefix: "h"). Describes how many hydrogen atoms are connected to each of the other atoms.
  2. Charge layer
    • charge sublayer (prefix: "q")
    • proton sublayer (prefix: "p" for "protons")
  3. Stereochemical layer
    • double bonds and cumulenes (prefix: "b")
    • tetrahedral stereochemistry of atoms and allenes (prefixes: "t", "m")
    • type of stereochemistry information (prefix: "s")
  4. Isotopic layer (prefixes: "i", "h", as well as "b", "t", "m", "s" for isotopic stereochemistry)
  5. Fixed-H layer (prefix: "f"); contains some or all of the above types of layers except atom connections; may end with "o" sublayer; never included in standard InChI
  6. Reconnected layer (prefix: "r"); contains the whole InChI of a structure with reconnected metal atoms; never included in standard InChI

The delimiter-prefix format has the advantage that a user can easily use a wildcard search to find identifiers that match only in certain layers.

Examples
Structural formulastandard InChI
InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3
L-ascorbic acid with InChI L-ascorbic acid.png
L-ascorbic acid with InChI
InChI=1S/C6H8O6/c7-1-2(8)5-3(9)4(10)6(11)12-5/h2,5,7-10H,1H2/t2-,5+/m0/s1

InChIKey

The condensed, 27 character InChIKey is a hashed version of the full InChI (using the SHA-256 algorithm), designed to allow for easy web searches of chemical compounds. [4] The standard InChIKey is the hashed counterpart of standard InChI. Most chemical structures on the Web up to 2007 have been represented as GIF files, which are not searchable for chemical content. The full InChI turned out to be too lengthy for easy searching, and therefore the InChIKey was developed. There is a very small, but nonzero chance of two different molecules having the same InChIKey, but the probability for duplication of only the first 14 characters has been estimated as only one duplication in 75 databases each containing one billion unique structures. With all databases currently having below 50 million structures, such duplication appears unlikely at present. A recent study more extensively studies the collision rate finding that the experimental collision rate is in agreement with the theoretical expectations. [10]

The InChIKey currently consists of three parts separated by hyphens, of 14, 10 and one character(s), respectively, like XXXXXXXXXXXXXX-YYYYYYYYFV-P. The first 14 characters result from a SHA-256 hash of the connectivity information (the main layer and /q sublayer of the charge layer) of the InChI. The second part consists of 8 characters resulting from a hash of the remaining layers of the InChI, a single character indicating the kind of InChIKey (S for standard and N for nonstandard), and a character indicating the version of InChI used (currently A for version 1.) Finally, the single character at the end indicates the protonation of the core parent structure, corresponding to the /p sublayer of the charge layer (N for no protonation, O, P, ... if protons should be added and M, L, ... if they should be removed.) [11] [3]

Example

Morphine structure Morphin - Morphine.svg
Morphine structure

Morphine has the structure shown on the right. The standard InChI for morphine is InChI=1S/C17H19NO3/c1-18-7-6-17-10-3-5-13(20)16(17)21-15-12(19)4-2-9(14(15)17)8-11(10)18/h2-5,10-11,13,16,19-20H,6-8H2,1H3/t10-,11+,13-,16-,17-/m0/s1 and the standard InChIKey for morphine is BQJCRHHNABKAKU-KBQPJGBKSA-N. [12]

InChI resolvers

As the InChI cannot be reconstructed from the InChIKey, an InChIKey always needs to be linked to the original InChI to get back to the original structure. InChI Resolvers act as a lookup service to make these links, and prototype services are available from National Cancer Institute, the UniChem service at the European Bioinformatics Institute, and PubChem. ChemSpider has had a resolver until July 2015 when it was decommissioned. [13]

Name

The format was originally called IChI (IUPAC Chemical Identifier), then renamed in July 2004 to INChI (IUPAC-NIST Chemical Identifier), and renamed again in November 2004 to InChI (IUPAC International Chemical Identifier), a trademark of IUPAC.

Continuing development

Scientific direction of the InChI standard is carried out by the IUPAC Division VIII Subcommittee, and funding of subgroups investigating and defining the expansion of the standard is carried out by both IUPAC and the InChI Trust. The InChI Trust funds the development, testing and documentation of the InChI. Current extensions are being defined to handle polymers and mixtures, Markush structures, reactions [14] and organometallics, and once accepted by the Division VIII Subcommittee will be added to the algorithm.

Software

The InChI Trust has developed software to generate the InChI, InChIKey and other identifiers. The release history of this software follows. [15]

Software and versionDateLicenseComments
InChI v. 1April 2005
InChI v. 1.01August 2006
InChI v. 1.02betaSep. 2007 LGPL 2.1Adds InChIKey functionality.
InChI v. 1.02Jan. 2009LGPL 2.1Changed format for InChIKey.
Introduces standard InChI.
InChI v. 1.03June 2010LGPL 2.1
InChI v. 1.03 source code docsMarch 2011
InChI v. 1.04Sep. 2011IUPAC/InChI Trust InChI Licence 1.0New license.
Support for elements 105-112 added.
CML support removed.
InChI v. 1.05Jan. 2017IUPAC/InChI Trust InChI Licence 1.0Support for elements 113-118 added.
Experimental polymer support.
Experimental large molecule support.
RInChI v. 1.00March 2017IUPAC/InChI Trust InChI Licence 1.0, and BSD-styleComputes reaction InChIs. [14]
InChI v. 1.06Dec. 2020IUPAC/InChI Trust InChI Licence 1.0Revised polymer support.

Adoption

The InChI has been adopted by many larger and smaller databases, including ChemSpider, ChEMBL, Golm Metabolome Database, OpenPHACTS, and PubChem. [16] However, the adoption is not straightforward, and many databases show a discrepancy between the chemical structures and the InChI they contain, which is a problem for linking databases. [17]

See also

Notes and references

  1. "IUPAC International Chemical Identifier Project Page". IUPAC. Archived from the original on 27 May 2012. Retrieved 5 December 2012.
  2. Heller, S.; McNaught, A.; Stein, S.; Tchekhovskoi, D.; Pletnev, I. (2013). "InChI - the worldwide chemical structure identifier standard". Journal of Cheminformatics. 5 (1): 7. doi:10.1186/1758-2946-5-7. PMC   3599061 . PMID   23343401.
  3. 1 2 3 4 5 Heller, S.R.; McNaught, A.; Pletnev, I.; Stein, S.; Tchekhovskoi, D. (2015). "InChI, the IUPAC International Chemical Identifier". Journal of Cheminformatics. 7: 23. doi:10.1186/s13321-015-0068-4. PMC   4486400 . PMID   26136848.
  4. 1 2 "The IUPAC International Chemical Identifier (InChI)". IUPAC. 5 September 2007. Archived from the original on October 30, 2007. Retrieved 2007-09-18.
  5. E.L. Willighagen (17 September 2011). "InChIKey collision: the DIY copy/pastables" . Retrieved 2012-11-06.
  6. Goodman, Jonathan M.; Pletnev, Igor; Thiessen, Paul; Bolton, Evan; Heller, Stephen R. (December 2021). "InChI version 1.06: now more than 99.99% reliable". Journal of Cheminformatics. 13 (1): 40. doi:10.1186/s13321-021-00517-z. PMC   8147039 . PMID   34030732.
  7. McNaught, Alan (2006). "The IUPAC International Chemical Identifier:InChl". Chemistry International. Vol. 28, no. 6. IUPAC . Retrieved 2007-09-18.
  8. http://www.inchi-trust.org/download/104/LICENCE.pdf [ bare URL PDF ]
  9. Heller, Stephen R.; McNaught, Alan; Pletnev, Igor; Stein, Stephen; Tchekhovskoi, Dmitrii (2015). "InChI, the IUPAC International Chemical Identifier". Journal of Cheminformatics. 7: 23. doi:10.1186/s13321-015-0068-4. PMC   4486400 . PMID   26136848.
  10. Pletnev, I.; Erin, A.; McNaught, A.; Blinov, K.; Tchekhovskoi, D.; Heller, S. (2012). "InChIKey collision resistance: An experimental testing". Journal of Cheminformatics. 4 (1): 39. doi:10.1186/1758-2946-4-39. PMC   3558395 . PMID   23256896.
  11. "Technical FAQ - InChI Trust". inchi-trust.org. Retrieved 8 Jan 2021.
  12. "InChI=1/C17H19NO3/c1-18..." Chemspider . Retrieved 2007-09-18.
  13. InChI Resolver, 27 July 2015
  14. 1 2 Grethe, Guenter; Blanke, Gerd; Kraut, Hans; Goodman, Jonathan M. (9 May 2018). "International chemical identifier for reactions (RInChI)". Journal of Cheminformatics. 10 (1): 45. doi:10.1186/s13321-018-0277-8. PMC   4015173 . PMID   24152584.
  15. Downloads of InChI Software, accessed Jan. 8, 2021.
  16. Warr, W.A. (2015). "Many InChIs and quite some feat". Journal of Computer-Aided Molecular Design. 29 (8): 681–694. Bibcode:2015JCAMD..29..681W. doi:10.1007/s10822-015-9854-3. PMID   26081259. S2CID   31786997.
  17. Akhondi, S. A.; Kors, J. A.; Muresan, S. (2012). "Consistency of systematic chemical identifiers within and between small-molecule databases". Journal of Cheminformatics. 4 (1): 35. doi:10.1186/1758-2946-4-35. PMC   3539895 . PMID   23237381.

Related Research Articles

Alkene Hydrocarbon compound containing one or more carbon-carbon double bonds

In organic chemistry, an alkene is a hydrocarbon containing a carbon–carbon double bond.

A chemical formula is a way of presenting information about the chemical proportions of atoms that constitute a particular chemical compound or molecule, using chemical element symbols, numbers, and sometimes also other symbols, such as parentheses, dashes, brackets, commas and plus (+) and minus (−) signs. These are limited to a single typographic line of symbols, which may include subscripts and superscripts. A chemical formula is not a chemical name, and it contains no words. Although a chemical formula may imply certain simple chemical structures, it is not the same as a full chemical structural formula. Chemical formulae can fully specify the structure of only the simplest of molecules and chemical substances, and are generally more limited in power than chemical names and structural formulae.

Hydrogen bond Hydrogen Partial intermolecular bonding interaction

A hydrogen bond is a primarily electrostatic force of attraction between a hydrogen (H) atom which is covalently bound to a more electronegative atom or group, and another electronegative atom bearing a lone pair of electrons—the hydrogen bond acceptor (Ac). Such an interacting system is generally denoted Dn–H···Ac, where the solid line denotes a polar covalent bond, and the dotted or dashed line indicates the hydrogen bond. The most frequent donor and acceptor atoms are the second-row elements nitrogen (N), oxygen (O), and fluorine (F).

Simplified molecular-input line-entry system Chemical species structure notation

The simplified molecular-input line-entry system (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings. SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules.

Aldehyde Organic compound containing the functional group R−CH=O

In organic chemistry, an aldehyde is an organic compound containing a functional group with the structure R−CH=O. The functional group itself can be referred to as an aldehyde but can also be classified as a formyl group. Aldehydes are common and play important roles in the technology and biological spheres.

Structural formula Graphic representation of a molecular structure

The structural formula of a chemical compound is a graphic representation of the molecular structure, showing how the atoms are possibly arranged in the real three-dimensional space. The chemical bonding within the molecule is also shown, either explicitly or implicitly. Unlike chemical formulas, which have a limited number of symbols and are capable of only limited descriptive power, structural formulas provide a more complete geometric representation of the molecular structure. For example, many chemical compounds exist in different isomeric forms, which have different enantiomeric structures but the same chemical formula.

A chemical database is a database specifically designed to store chemical information. This information is about chemical and crystal structures, spectra, reactions and syntheses, and thermophysical data.

Cheminformatics refers to use of physical chemistry theory with computer and information science techniques—so called "in silico" techniques—in application to a range of descriptive and prescriptive problems in the field of chemistry, including in its applications to biology and related molecular fields. Such in silico techniques are used, for example, by pharmaceutical companies and in academic settings to aid and inform the process of drug discovery, for instance in the design of well-defined combinatorial libraries of synthetic compounds, or to assist in structure-based drug design. The methods can also be used in chemical and allied industries, and such fields as environmental science and pharmacology, where chemical processes are involved or studied.

Skeletal formula Representation method in chemistry

The skeletal formula, or line-angle formula or shorthand formula, of an organic compound is a type of molecular structural formula that serves as a shorthand representation of a molecule's bonding and some details of its molecular geometry. A skeletal formula shows the skeletal structure or skeleton of a molecule, which is composed of the skeletal atoms that make up the molecule. It is represented in two dimensions, as on a piece of paper. It employs certain conventions to represent carbon and hydrogen atoms, which are the most common in organic chemistry.

A substituent is one or a group of atoms that replaces hydrogen atoms on the parent chain of a hydrocarbon, thereby becoming a moiety in the resultant (new) molecule.

Tautomer Structural isomers of chemical compounds that readily interconvert

Tautomers are structural isomers of chemical compounds that readily interconvert. The chemical reaction interconverting the two is called tautomerization. This conversion commonly results from the relocation of a hydrogen atom within the compound. The phenomenon of tautomerization is called tautomerism, also called desmotropism. Tautomerism is for example relevant to the behavior of amino acids and nucleic acids, two of the fundamental building blocks of life.

Chemical table file is a family of text-based chemical file formats that describe molecules and chemical reactions. One format, for example, lists each atom in a molecule, the x-y-z coordinates of that atom, and the bonds among the atoms.

Alkane stereochemistry concerns the stereochemistry of alkanes. Alkane conformers are one of the subjects of alkane stereochemistry.

PubChem is a database of chemical molecules and their activities against biological assays. The system is maintained by the National Center for Biotechnology Information (NCBI), a component of the National Library of Medicine, which is part of the United States National Institutes of Health (NIH). PubChem can be accessed for free through a web user interface. Millions of compound structures and descriptive datasets can be freely downloaded via FTP. PubChem contains multiple substance descriptions and small molecules with fewer than 100 atoms and 1,000 bonds. More than 80 database vendors contribute to the growing PubChem database.

Chemistry Development Kit Computer software

The Chemistry Development Kit (CDK) is computer software, a library in the programming language Java, for chemoinformatics and bioinformatics. It is available for Windows, Linux, Unix, and macOS. It is free and open-source software distributed under the GNU Lesser General Public License (LGPL) 2.0.

ISIS/Draw

ISIS/Draw was a chemical structure drawing program developed by MDL Information Systems. It introduced a number of file formats for the storage of chemical information that have become industry standards.

In chemistry, a hydron is the general name for a cationic form of atomic hydrogen, represented with the symbol H+
. The term "hydron", endorsed by the IUPAC, includes cations of hydrogen regardless of their isotopic composition: thus it refers collectively to protons (1H+) for the protium isotope, deuterons (2H+ or D+) for the deuterium isotope, and tritons (3H+ or T+) for the tritium isotope.

Chemicalize

Chemicalize is an online platform for chemical calculations, search, and text processing. It is developed and owned by ChemAxon and offers various cheminformatics tools in freemium model: chemical property predictions, structure-based and text-based search, chemical text processing, and checking compounds with respect to national regulations of different countries.

CompTox Chemicals Dashboard Chemical database

The CompTox Chemicals Dashboard is a freely accessible online database created and maintained by the U.S. Environmental Protection Agency (EPA). The database provides access to multiple types of data including physicochemical properties, environmental fate and transport, exposure, usage, in vivo toxicity, and in vitro bioassay. EPA and other scientists use the data and models contained within the dashboard to help identify chemicals that require further testing and reduce the use of animals in chemical testing. The Dashboard is also used to provide public access to information from EPA Action Plans, e.g. around perfluorinated alkylated substances.