Substructure search (SSS) is a method to retrieve from a database only those chemicals matching a pattern of atoms and bonds which a user specifies. It is an application of graph theory, specifically subgraph matching in which the query is a hydrogen-depleted molecular graph. The mathematical foundations for the method were laid in the 1870s, when it was suggested that chemical structure drawings were equivalent to graphs with atoms as vertices and bonds as edges. SSS is now a standard part of cheminformatics and is widely used by pharmaceutical chemists in drug discovery.
There are many commercial systems that provide SSS, typically having a graphical user interface and chemical drawing software. Large publicly-available databases like PubChem and ChemSpider can be searched this way, as can Wikipedia's articles describing individual chemicals.
Substructure search is used to retrieve from a database of chemicals those which contain the pattern of atoms and bonds specified by a user. It is implemented using a specialist type of query language and in real-world applications the search may be further constrained using logical operators on additional data held in the database. Thus "return all carboxylic acids where a sample of >1 g is available". [1] [2] One definition of "substructure" was provided in 2008: "given two chemical structures A and B, if structure A is fully contained in structure B, then A is a substructure of B, while B is a superstructure of A." [3]
molecular graph: The graph with differently labelled (coloured) vertices (chromatic graph) which represent different kinds of atoms and differently labelled (coloured) edges related to different types of bonds. Within the topological electron distribution theory, a complete network of the bond paths for a given nuclear configuration. [4]
In this definition, the word "structure" is not synonymous with "compound". If it were, the structure for ethanol, CH3CH2OH would not be a substructure of propanol, CH3CH2CH2OH, since the terminal CH3 of ethanol is not fully contained at the propanol chain two atoms away from the OH group. Instead the query structure is, formally, a hydrogen-depleted molecular graph. The search is thus for substances which contain three atoms and two single bonds connected as C–C–O. Propanol is a "hit", as is diethyl ether, with C–C–O–C–C. If a user wished to limit the hits to alcohols, then the query structure would have to be drawn with an "explicit hydrogen", as C–C–O–H and ether would no longer match. [1] In mathematical terms, finding substructures is an application of graph theory, specifically subgraph matching. [5]
Standard conventions used when chemists draw chemical structures [6] need to be considered when implementing substructure search. Historically, the representation of tautomer [7] forms and stereochemistry [8] has posed difficulties. This can be illustrated using histidine. [9]
The top row shows the standard two-dimensional chemical drawing for (S)-histidine (the natural isomer of this amino acid), its enantiomer (R)-histidine and a drawing which conventionally indicates the racemic mixture of equal amounts of the R and S forms. [10] The bottom row shows the same three compounds with the imidazole ring drawn in its alternative tautomer form. For histidine, it has been experimentally determined by 15N NMR spectroscopy that the 1-H tautomer is preferred over the 3-H form in samples. [11] Choice of representation for storage in a database can influence substucture searches. All six drawings are hits for a propanol substructure C–C–C–O, as shown in red. However, only the top row would, apparently, be a hit for the blue substructure of 1-H imidazole-4-methyl, as this is not fully contained in the other three compounds. In fact, each vertical pair is the same chemical substance: tautomers in general cannot be isolated as separate samples. [7] In modern databases, substances are held in a single canonical form, with checks made for uniqueness. The InChIKey provides one way to do this. [9] (S)-Histidine's standard key is HNDVDQJCIGZPNO-YFKPBYRVSA-N, [12] (R)-histidine's key is HNDVDQJCIGZPNO-RXMQYKEDSA-N [13] and (RS)-histidine's is HNDVDQJCIGZPNO-UHFFFAOYSA-N. [14] The first block of 14 letters is identical for all these substances, as it encodes the molecular graph. [9]
Most substructure search systems present the user with a graphical user interface with a chemical structure drawing component. Query structures may contain bonding patterns such as "single/aromatic" or "any" to provide flexibility. Similarly, the vertices which in an actual compound would be a specific atom may be replaced with an atom list in the query. Cis–trans isomerism at double bonds is catered for by giving a choice of retrieving only the E form, the Z form, or both. [1] [15]
The algorithms for searching are computationally intensive, often of O (n3) or O (n4) time complexity (where n is the number of atoms involved) but the problem is known to be NP-complete. [16] Speedups are achieved using fragment screening as a first step. This pre-computation typically involves creation of bitstrings representing presence or absence of molecular fragments. Target compounds that do not possess the fragments present in the query cannot be hits and are eliminated. [17] [18] Atom-by-atom-searching, in which a mapping of the query's atoms and bonds with the target molecule is sought, is usually done with a variant of the Ullman algorithm. [5] [19]
As of 2024 [update] , substructure search is a standard feature in chemical databases accessible via the web. Large databases such as PubChem, [20] [15] maintained by the National Center for Biotechnology Information and ChemSpider, [21] maintained by the Royal Society of Chemistry have graphical interfaces for search. The Chemical Abstracts Service, a division of the American Chemical Society, provides tools to search the chemical literature and Reaxys supplied by Elsevier covers both chemicals and reaction information, including that originally held in the Beilstein database. [22] PATENTSCOPE maintained by the World Intellectual Property Organization makes chemical patents accessible by substructure [23] and Wikipedia's articles describing individual chemicals can also be searched that way. [24]
Suppliers of chemicals as synthesis intermediates or for high-throughput screening routinely provide search interfaces. Currently, the largest database that can be freely searched by the public is the ZINC database, which is claimed to contain over 37 billion commercially available molecules. [25] [26]
The idea that chemical structures as depicted using drawings of the type introduced by Kekulé were related to what is now called graph theory was suggested by the mathematician J. J. Sylvester in 1878. He was the first to use the word "graph" in the sense of a network. [27] [28] Arthur Cayley had already, in 1874, considered how to enumerate chemical isomers, in what was an early approach to molecular graphs, where atoms are at vertices and bonds correspond to edges. [29] [30]
structural formula: A formula which gives information about the way the atoms in a molecule are connected and arranged in space. [31]
In the 20th century, chemists developed standard ways to show structural formula, especially for individual organic compounds that were increasingly being synthesized and tested as potential drugs or agrochemicals, [32] [6] By the 1950s, as the number of compounds made and tested grew, the first attempts to create chemical databases were made and the sub-discipline of cheminformatics was established. [33] As stated in 2012, "searching for substructures in molecules belongs to the most elementary tasks in cheminformatics and is nowadays part of virtually every cheminformatics software". [34]
The first suggested use for substructure search was in 1957, to reduce the workload of patent examiners. They have to search published literature to decide whether an invention is novel, which for chemical patents often means finding known examples within the generic claims of a Markush structure. [35] [33] Before this could become a reality, a number of developments were required. Importantly, the existing literature had to be made searchable and a way to input a chemical structure query and return the matching results had to devised. These requirements had been partially met as early as 1881 when Friedrich Konrad Beilstein introduced the Handbuch der organischen Chemie (Handbook of Organic Chemistry) which carefully classified known chemicals in a very systematic manner so that all examples containing a given heterocycle would be located together. [36] [37]
In 1907, the American Chemical Society set up the Chemical Abstracts Service (CAS). This weekly subscription service included a printed publication with summaries of articles in thousands of scholarly journals and claims in worldwide patents. This had a chemical substance index that, in principle, allowed searching by chemical name or formula. [38] However, it was only when the CAS records had been fully converted into machine-readable form and the internet was available to connect its database to end-users that comprehensive searching became possible. CAS provided various specialist search services from the 1980s but it was not until 2008 that its "SciFinder" system became available via the web. [39]
By the 1960s, companies synthesizing and testing new chemicals made significant progress in creating in-house databases. Imperial Chemical Industries stored chemical structures encoded as text strings, using Wiswesser line notation. Its associated CROSSBOW software allowed substructure search using key-based searches followed by more processor-intensive atom-by-atom search. [40] [41] It was recognised that research chemists wanted not only to search company collections for existing inventory but also to search third-party databases supplied by vendors of small-molecule intermediates. The latter application evolved as a collaboration involving six companies with pharmaceutical interests and their commercial suppliers. [42] [9]
By the 1980s, other line notations were used for commercially-available substructure search systems. SMILES encoding, together with its SMARTS query language, [43] and SYBYL line notation [9] [44] are examples. [45] A comprehensive survey of then-available chemical information systems was produced for NASA in 1985. [46]
The need to combine chemistry search with biological data produced by screening compounds at ever-larger scales led to implementation of systems such as MACCS. [46] : 73–77 [47] This commercial system from MDL Information Systems made use of an algorithm specifically designed for storage and search within groups of chemicals that differed only in their stereochemistry. [48] A review of the many systems available by the mid-1980s pointed out that "most in-house developed systems have been replaced with commercially available standardised software for managing chemical structure databases." [49] The MDL Molfile is now an open file format for storing single-molecule data in the form of a connection table. [50] [9]
By the 2000s, personal computers had become powerful enough that storage and search of chemistry within office software such as Microsoft Excel was possible. [51]
Subsequent developments involved the use of new techniques to allow efficient searches over very large databases and, importantly, the use of a standardised International Chemical Identifier, a type of line notation, to uniquely define a chemical substance. [9] [25] [52] [53]
The Simplified Molecular Input Line Entry System (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings. SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules.
A chemical database is a database specifically designed to store chemical information. This information is about chemical and crystal structures, spectra, reactions and syntheses, and thermophysical data.
Cheminformatics refers to the use of physical chemistry theory with computer and information science techniques—so called "in silico" techniques—in application to a range of descriptive and prescriptive problems in the field of chemistry, including in its applications to biology and related molecular fields. Such in silico techniques are used, for example, by pharmaceutical companies and in academic settings to aid and inform the process of drug discovery, for instance in the design of well-defined combinatorial libraries of synthetic compounds, or to assist in structure-based drug design. The methods can also be used in chemical and allied industries, and such fields as environmental science and pharmacology, where chemical processes are involved or studied.
A molecule editor is a computer program for creating and modifying representations of chemical structures.
In chemistry, tautomers are structural isomers of chemical compounds that readily interconvert. The chemical reaction interconverting the two is called tautomerization. This conversion commonly results from the relocation of a hydrogen atom within the compound. The phenomenon of tautomerization is called tautomerism, also called desmotropism. Tautomerism is for example relevant to the behavior of amino acids and nucleic acids, two of the fundamental building blocks of life.
A chemical file format is a type of data file which is used specifically for depicting molecular data. One of the most widely used is the chemical table file format, which is similar to Structure Data Format (SDF) files. They are text files that represent multiple chemical structure records and associated data fields. The XYZ file format is a simple format that usually gives the number of atoms in the first line, a comment on the second, followed by a number of lines with atomic symbols and cartesian coordinates. The Protein Data Bank Format is commonly used for proteins but is also used for other types of molecules. There are many other types which are detailed below. Various software systems are available to convert from one format to another.
Chemical space is a concept in cheminformatics referring to the property space spanned by all possible molecules and chemical compounds adhering to a given set of construction principles and boundary conditions. It contains millions of compounds which are readily accessible and available to researchers. It is a library used in the method of molecular docking.
The International Chemical Identifier is a textual identifier for chemical substances, designed to provide a standard way to encode molecular information and to facilitate the search for such information in databases and on the web. Initially developed by the International Union of Pure and Applied Chemistry (IUPAC) and National Institute of Standards and Technology (NIST) from 2000 to 2005, the format and algorithms are non-proprietary. Since May 2009, it has been developed by the InChI Trust, a nonprofit charity from the United Kingdom which works to implement and promote the use of InChI.
PubChem is a database of chemical molecules and their activities against biological assays. The system is maintained by the National Center for Biotechnology Information (NCBI), a component of the National Library of Medicine, which is part of the United States National Institutes of Health (NIH). PubChem can be accessed for free through a web user interface. Millions of compound structures and descriptive datasets can be freely downloaded via FTP. PubChem contains multiple substance descriptions and small molecules with fewer than 100 atoms and 1,000 bonds. More than 80 database vendors contribute to the growing PubChem database.
The Chemistry Development Kit (CDK) is computer software, a library in the programming language Java, for chemoinformatics and bioinformatics. It is available for Windows, Linux, Unix, and macOS. It is free and open-source software distributed under the GNU Lesser General Public License (LGPL) 2.0.
Molecule mining is the process of data mining, or extracting and discovering patterns, as applied to molecules. Since molecules may be represented by molecular graphs, this is strongly related to graph mining and structured data mining. The main problem is how to represent molecules while discriminating the data instances. One way to do this is chemical similarity metrics, which has a long tradition in the field of cheminformatics.
ISIS/Draw was a chemical structure drawing program developed by MDL Information Systems. It introduced a number of file formats for the storage of chemical information that have become industry standards.
Virtual screening (VS) is a computational technique used in drug discovery to search libraries of small molecules in order to identify those structures which are most likely to bind to a drug target, typically a protein receptor or enzyme.
ChemSpider is a freely accessible online database of chemicals owned by the Royal Society of Chemistry. It contains information on more than 100 million molecules from over 270 data sources, each of them receiving a unique identifier called ChemSpider Identifier.
SMILES arbitrary target specification (SMARTS) is a language for specifying substructural patterns in molecules. The SMARTS line notation is expressive and allows extremely precise and transparent substructural specification and atom typing.
Chemical similarity refers to the similarity of chemical elements, molecules or chemical compounds with respect to either structural or functional qualities, i.e. the effect that the chemical compound has on reaction partners in inorganic or biological settings. Biological effects and thus also similarity of effects are usually quantified using the biological activity of a compound. In general terms, function can be related to the chemical activity of compounds.
Matched molecular pair analysis (MMPA) is a method in cheminformatics that compares the properties of two molecules that differ only by a single chemical transformation, such as the substitution of a hydrogen atom by a chlorine one. Such pairs of compounds are known as matched molecular pairs (MMP). Because the structural difference between the two molecules is small, any experimentally observed change in a physical or biological property between the matched molecular pair can more easily be interpreted. The term was first coined by Kenny and Sadowski in the book Chemoinformatics in Drug Discovery.
LiSiCA is a ligand-based virtual screening software that searches for 2D and 3D similarities between a reference compound and a database of target compounds which should be represented in a Mol2 format. The similarities are expressed using the Tanimoto coefficients and the target compounds are ranked accordingly. LiSiCA is also available as LiSiCA PyMOL plugin both on Linux and Windows operating systems.
A chemical graph generator is a software package to generate computer representations of chemical structures adhering to certain boundary conditions. The development of such software packages is a research topic of cheminformatics. Chemical graph generators are used in areas such as virtual library generation in drug design, in molecular design with specified properties, called inverse QSAR/QSPR, as well as in organic synthesis design, retrosynthesis or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational metabolomics, a current area of computational biology.
Every invariant and covariant thus becomes expressible by a graph precisely identical with a Kekuléan diagram or chemicograph.