|Internet media type|
|Type of format||chemical file format|
The simplified molecular-input line-entry system (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings. SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules.
The original SMILES specification was initiated in the 1980s. It has since been modified and extended. In 2007, an open standard called OpenSMILES was developed in the open-source chemistry community.
The original SMILES specification was initiated by David Weininger at the USEPA Mid-Continent Ecology Division Laboratory in Duluth in the 1980s.Acknowledged for their parts in the early development were "Gilman Veith and Rose Russo (USEPA) and Albert Leo and Corwin Hansch (Pomona College) for supporting the work, and Arthur Weininger (Pomona; Daylight CIS) and Jeremy Scofield (Cedar River Software, Renton, WA) for assistance in programming the system." The Environmental Protection Agency funded the initial project to develop SMILES.
It has since been modified and extended by others, most notably by Daylight Chemical Information Systems. In 2007, an open standard called "OpenSMILES" was developed by the Blue Obelisk open-source chemistry community. Other 'linear' notations include the Wiswesser Line Notation (WLN), ROSDAL and SLN (Tripos Inc).
In July 2006, the IUPAC introduced the InChI as a standard for formula representation. SMILES is generally considered to have the advantage of being more human-readable than InChI; it also has a wide base of software support with extensive theoretical backing (such as graph theory).
The term SMILES refers to a line notation for encoding molecular structures and specific instances should strictly be called SMILES strings. However, the term SMILES is also commonly used to refer to both a single SMILES string and a number of SMILES strings; the exact meaning is usually apparent from the context. The terms "canonical" and "isomeric" can lead to some confusion when applied to SMILES. The terms describe different attributes of SMILES strings and are not mutually exclusive.
Typically, a number of equally valid SMILES strings can be written for a molecule. For example,
C(O)C all specify the structure of ethanol. Algorithms have been developed to generate the same SMILES string for a given molecule; of the many possible strings, these algorithms choose only one of them. This SMILES is unique for each structure, although dependent on the canonicalization algorithm used to generate it, and is termed the canonical SMILES. These algorithms first convert the SMILES to an internal representation of the molecular structure; an algorithm then examines that structure and produces a unique SMILES string. Various algorithms for generating canonical SMILES have been developed and include those by Daylight Chemical Information Systems, OpenEye Scientific Software, MEDIT, Chemical Computing Group, MolSoft LLC, and the Chemistry Development Kit. A common application of canonical SMILES is indexing and ensuring uniqueness of molecules in a database.
The original paper that described the CANGENalgorithm claimed to generate unique SMILES strings for graphs representing molecules, but the algorithm fails for a number of simple cases (e.g. cuneane, 1,2-dicyclopropylethane) and cannot be considered a correct method for representing a graph canonically. There is currently no systematic comparison across commercial software to test if such flaws exist in those packages.
SMILES notation allows the specification of configuration at tetrahedral centers, and double bond geometry. These are structural features that cannot be specified by connectivity alone, and therefore SMILES which encode this information are termed isomeric SMILES. A notable feature of these rules is that they allow rigorous partial specification of chirality. The term isomeric SMILES is also applied to SMILES in which isomers are specified.
In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a depth-first tree traversal of a chemical graph. The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a spanning tree. Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes. Parentheses are used to indicate points of branching on the tree.
The resultant SMILES form depends on the choices:
From the view point of a formal language theory, SMILES is a word. A SMILES is parsable with a context-free parser. The use of this representation has been in the prediction of biochemical properties (incl. toxicity and biodegradability) based on the main principle of chemoinformatics that similar molecules have similar properties. The predictive models implemented a syntactic pattern recognition approach (which involved defining a molecular distance)as well as a more robust scheme based on statistical pattern recognition.
Atoms are represented by the standard abbreviation of the chemical elements, in square brackets, such as
[Au] for gold. Brackets may be omitted in the common case of atoms which:
All other elements must be enclosed in brackets, and have charges and hydrogens shown explicitly. For instance, the SMILES for water may be written as either
[OH2]. Hydrogen may also be written as a separate atom; water may also be written as
When brackets are used, the symbol
H is added if the atom in brackets is bonded to one or more hydrogen, followed by the number of hydrogen atoms if greater than 1, then by the sign
+ for a positive charge or by
- for a negative charge. For example,
[NH4+] for ammonium (NH+
4). If there is more than one charge, it is normally written as digit; however, it is also possible to repeat the sign as many times as the ion has charges: one may write either
[Ti++++] for titanium(IV) Ti4+. Thus, the hydroxide anion ( OH−) is represented by
[OH-], the hydronium cation (H3O+) is
[OH3+] and the cobalt(III) cation (Co3+) is either
A bond is represented using one of the symbols
. - = # $ : / \.
Bonds between aliphatic atoms are assumed to be single unless specified otherwise and are implied by adjacency in the SMILES string. Although single bonds may be written as
-, this is usually omitted. For example, the SMILES for ethanol may be written as
C-CO, but is usually written
Double, triple, and quadruple bonds are represented by the symbols
$ respectively as illustrated by the SMILES
O=C=O (carbon dioxide CO2),
C#N (hydrogen cyanide HCN) and
[Ga+]$[As-] (gallium arsenide).
An additional type of bond is a "non-bond", indicated with
., to indicate that two parts are not bonded together. For example, aqueous sodium chloride may be written as
[Na+].[Cl-] to show the dissociation.
An aromatic "one and a half" bond may be indicated with
:; see § Aromaticity below.
Single bonds adjacent to double bonds may be represented using
\ to indicate stereochemical configuration; see § Stereochemistry below.
Ring structures are written by breaking each ring at an arbitrary point (although some choices will lead to a more legible SMILES than others) to make an acyclic structure and adding numerical ring closure labels to show connectivity between non-adjacent atoms.
For example, cyclohexane and dioxane may be written as
O1CCOCC1 respectively. For a second ring, the label will be 2. For example, decalin (decahydronaphthalene) may be written as
SMILES does not require that ring numbers be used in any particular order, and permits ring number zero, although this is rarely used. Also, it is permitted to reuse ring numbers after the first ring has closed, although this usually makes formulae harder to read. For example, bicyclohexyl is usually written as
C1CCCCC1C2CCCCC2, but it may also be written as
Multiple digits after a single atom indicate multiple ring-closing bonds. For example, an alternative SMILES notation for decalin is
C1CCCC2CCCCC12, where the final carbon participates in both ring-closing bonds 1 and 2. If two-digit ring numbers are required, the label is preceded by
C%12 is a single ring-closing bond of ring 12.
Either or both of the digits may be preceded by a bond type to indicate the type of the ring-closing bond. For example, cyclopropene is usually written
C1=CC1, but if the double bond is chosen as the ring-closing bond, it may be written as
C=1CC=1. (The first form is preferred.)
C=1CC-1 is illegal, as it explicitly specifies conflicting types for the ring-closing bond.
Ring-closing bonds may not be used to denote multiple bonds. For example,
C1C1 is not a valid alternative to
C=C for ethylene. However, they may be used with non-bonds;
C1.C2.C12 is a peculiar but legal alternative way to write propane, more commonly written
Choosing a ring-break point adjacent to attached groups can lead to a simpler SMILES form by avoiding branches. For example, cyclohexane-1,2-diol is most simply written as
OC1CCCCC1O; choosing a different ring-break location produces a branched structure that requires parentheses to write.
Aromatic rings such as benzene may be written in one of three forms:
In the latter case, bonds between two aromatic atoms are assumed (if not explicitly shown) to be aromatic bonds. Thus, benzene, pyridine and furan can be represented respectively by the SMILES
Aromatic nitrogen bonded to hydrogen, as found in pyrrole must be represented as
[nH]; thus imidazole is written in SMILES notation as
When aromatic atoms are singly bonded to each other, such as in biphenyl, a single bond must be shown explicitly:
c1ccccc1-c2ccccc2. This is one of the few cases where the single bond symbol
- is required. (In fact, most SMILES software can correctly infer that the bond between the two rings cannot be aromatic and so will accept the nonstandard form
The Daylight and OpenEye algorithms for generating canonical SMILES differ in their treatment of aromaticity.
Branches are described with parentheses, as in
CCC(=O)O for propionic acid and
FC(F)F for fluoroform. The first atom within the parentheses, and the first atom after the parenthesized group, are both bonded to the same branch point atom. The bond symbol must appear inside the parentheses; outside (E.g.:
CCC=(O)O) is invalid.
Substituted rings can be written with the branching point in the ring as illustrated by the SMILES
COc(c1)cccc1C#N (see depiction) and
COc(cc1)ccc1C#N (see depiction) which encode the 3 and 4-cyanoanisole isomers. Writing SMILES for substituted rings in this way can make them more human-readable.
Branches may be written in any order. For example, bromochlorodifluoromethane may be written as
C(F)(Cl)(F)Br, or the like. Generally, a SMILES form is easiest to read if the simpler branch comes first, with the final, unparenthesized portion being the most complex. The only caveats to such rearrangements are:
The one form of branch which does not require parentheses are ring-closing bonds. Choosing ring-closing bonds appropriately can reduce the number of parentheses required. For example, toluene is normally written as
c1ccccc1C, avoiding the parentheses required if written as
SMILES permits, but does not require, specification of stereoisomers.
Configuration around double bonds is specified using the characters
\ to show directional single bonds adjacent to a double bond. For example,
F/C=C/F (see depiction) is one representation of trans -1,2-difluoroethylene, in which the fluorine atoms are on opposite sides of the double bond (as shown in the figure), whereas
F/C=C\F (see depiction) is one possible representation of cis -1,2-difluoroethylene, in which the fluorines are on the same side of the double bond.
Bond direction symbols always come in groups of at least two, of which the first is arbitrary. That is,
F\C=C\F is the same as
F/C=C/F. When alternating single-double bonds are present, the groups are larger than two, with the middle directional symbols being adjacent to two double bonds. For example, the common form of (2,4)-hexadiene is written
As a more complex example, beta-carotene has a very long backbone of alternating single and double bonds, which may be written
Configuration at tetrahedral carbon is specified by
@@. Consider the four bonds in the order in which they appear, left to right, in the SMILES form. Looking toward the central carbon from the perspective of the first bond, the other three are either clockwise or counter-clockwise. These cases are indicated with
@, respectively (because the
@ symbol itself is a counter-clockwise spiral).
For example, consider the amino acid alanine. One of its SMILES forms is
NC(C)C(=O)O, more fully written as
N[CH](C)C(=O)O. L-Alanine, the more common enantiomer, is written as
N[C@@H](C)C(=O)O (see depiction). Looking from the nitrogen–carbon bond, the hydrogen (
H), methyl (
C), and carboxylate (
C(=O)O) groups appear clockwise. D-Alanine can be written as
N[C@H](C)C(=O)O (see depiction).
While the order in which branches are specified in SMILES is normally unimportant, in this case it matters; swapping any two groups requires reversing the chirality indicator. If the branches are reversed so alanine is written as
NC(C(=O)O)C, then the configuration also reverses; L-alanine is written as
N[C@H](C(=O)O)C (see depiction). Other ways of writing it include
Normally, the first of the four bonds appears to the left of the carbon atom, but if the SMILES is written beginning with the chiral carbon, such as
C(C)(N)C(=O)O, then all four are to the right, but the first to appear (the
[CH] bond in this case) is used as the reference to order the following three: L-alanine may also be written
The SMILES specification includes elaborations on the
@ symbol to indicate stereochemistry around more complex chiral centers, such as trigonal bipyramidal molecular geometry.
Isotopes are specified with a number equal to the integer isotopic mass preceding the atomic symbol. Benzene in which one atom is carbon-14 is written as
[14c]1ccccc1 and deuterochloroform is
|Methyl isocyanate (MIC)||CH3−N=C=O|
|Pyrethrin II (C22H28O5)|
|Aflatoxin B1 (C17H12O6)|
|Glucose (β-D-glucopyranose) (C6H12O6)|
|Bergenin (cuscutin, a resin) (C14H16O9)|
|A pheromone of the Californian scale insect|
|(2S,5R)-Chalcogran: a pheromone of the bark beetle Pityogenes chalcographus|
|Thiamine (vitamin B1, C12H17N4OS+)|
To illustrate a molecule with more than 9 rings, consider cephalostatin-1,a steroidic 13-ringed pyrazine with the empirical formula C54H74N2O10 isolated from the Indian Ocean hemichordate Cephalodiscus gilchristi :
Starting with the left-most methyl group in the figure:
% appears in front of the index of ring closure labels above 9; see § Rings above.
The SMILES notation is described extensively in the SMILES theory manual provided by Daylight Chemical Information Systems and a number of illustrative examples are presented. Daylight's depict utility provides users with the means to check their own examples of SMILES and is a valuable educational tool.
SMARTS is a line notation for specification of substructural patterns in molecules. While it uses many of the same symbols as SMILES, it also allows specification of wildcard atoms and bonds, which can be used to define substructural queries for chemical database searching. One common misconception is that SMARTS-based substructural searching involves matching of SMILES and SMARTS strings. In fact, both SMILES and SMARTS strings are first converted to internal graph representations which are searched for subgraph isomorphism.
SMIRKS, a superset of "reaction SMILES" and a subset of "reaction SMARTS", is a line notation for specifying reaction transforms. The general syntax for the reaction extensions is
REACTANT>AGENT>PRODUCT (without spaces), where any of the fields can either be left blank or filled with multiple molecules deliminated with a dot (
.), and other descriptions dependent on the base language. Atoms can additionally be identified with a number (e.g.
[C:1]) for mapping, for example in .
SMILES corresponds to discrete molecular structures. However many materials are macromolecules, which are too large (and often stochastic) to conveniently generate SMILES for. BigSMILES is an extension of SMILES that aims to provide an efficient representation system for macromolecules.
SMILES can be converted back to two-dimensional representations using structure diagram generation (SDG) algorithms.This conversion is not always unambiguous. Conversion to three-dimensional representation is achieved by energy-minimization approaches. There are many downloadable and web-based conversion utilities.
Aromatic compounds, also known as "mono- and polycyclic aromatic hydrocarbons", are organic compounds containing one or more aromatic rings. The parent member of aromatic compounds is benzene. The word "aromatic" originates from the past grouping of molecules based on smell, before their general chemical properties are understood. The current definition of aromatic compounds does not have any relation with their smell.
A covalent bond is a chemical bond that involves the sharing of electrons to form electron pairs between atoms. These electron pairs are known as shared pairs or bonding pairs. The stable balance of attractive and repulsive forces between atoms, when they share electrons, is known as covalent bonding. For many molecules, the sharing of electrons allows each atom to attain the equivalent of a full valence shell, corresponding to a stable electronic configuration. In organic chemistry, covalent bonding is much more common than ionic bonding.
In chemistry, a chemical formula is a way of presenting information about the chemical proportions of atoms that constitute a particular chemical compound or molecule, using chemical element symbols, numbers, and sometimes also other symbols, such as parentheses, dashes, brackets, commas and plus (+) and minus (−) signs. These are limited to a single typographic line of symbols, which may include subscripts and superscripts. A chemical formula is not a chemical name, and it contains no words. Although a chemical formula may imply certain simple chemical structures, it is not the same as a full chemical structural formula. Chemical formulae can fully specify the structure of only the simplest of molecules and chemical substances, and are generally more limited in power than chemical names and structural formulae.
In organic chemistry, a functional group is a substituent or moiety in a molecule that causes the molecule's characteristic chemical reactions. The same functional group will undergo the same or similar chemical reactions regardless of the rest of the molecule's composition. This enables systematic prediction of chemical reactions and behavior of chemical compounds and the design of chemical synthesis. The reactivity of a functional group can be modified by other functional groups nearby. Functional group interconversion can be used in retrosynthetic analysis to plan organic synthesis.
Organic chemistry is a subdiscipline within chemistry involving the scientific study of the structure, properties, and reactions of organic compounds and organic materials, i.e., matter in its various forms that contain carbon atoms. Study of structure determines their structural formula. Study of properties includes physical and chemical properties, and evaluation of chemical reactivity to understand their behavior. The study of organic reactions includes the chemical synthesis of natural products, drugs, and polymers, and study of individual organic molecules in the laboratory and via theoretical study.
The structural formula of a chemical compound is a graphic representation of the molecular structure, showing how the atoms are possibly arranged in the real three-dimensional space. The chemical bonding within the molecule is also shown, either explicitly or implicitly. Unlike other chemical formula types, which have a limited number of symbols and are capable of only limited descriptive power, structural formulas provide a more complete geometric representation of the molecular structure. For example, many chemical compounds exist in different isomeric forms, which have different enantiomeric structures but the same molecular formula. There are multiple types of ways to draw these structural formulas such as: Lewis Structures, condensed formulas, skeletal formulas, Newman projections, Cyclohexane conformations, Haworth projections, and Fischer projections.
In theoretical chemistry, a conjugated system is a system of connected p-orbitals with delocalized electrons in a molecule, which in general lowers the overall energy of the molecule and increases stability. It is conventionally represented as having alternating single and multiple bonds. Lone pairs, radicals or carbenium ions may be part of the system, which may be cyclic, acyclic, linear or mixed. The term "conjugated" was coined in 1899 by the German chemist Johannes Thiele.
In chemistry, aromaticity is a chemical property of cyclic (ring-shaped), typically planar (flat) molecular structures with pi bonds in resonance that gives increased stability compared to saturated compounds having single bonds, and other geometric or connective non-cyclic arrangements with the same set of atoms. Aromatic rings are very stable and do not break apart easily. Organic compounds that are not aromatic are classified as aliphatic compounds—they might be cyclic, but only aromatic rings have enhanced stability. The term aromaticity with this meaning is historically related to the concept of having an aroma, but is a distinct property from that meaning.
In chemistry, polarity is a separation of electric charge leading to a molecule or its chemical groups having an electric dipole moment, with a negatively charged end and a positively charged end.
Lewis structures, also known as Lewis dot formulas,Lewis dot structures, electron dot structures, or Lewis electron dot structures (LEDS), are diagrams that show the bonding between atoms of a molecule, as well as the lone pairs of electrons that may exist in the molecule. A Lewis structure can be drawn for any covalently bonded molecule, as well as coordination compounds. The Lewis structure was named after Gilbert N. Lewis, who introduced it in his 1916 article The Atom and the Molecule. Lewis structures extend the concept of the electron dot diagram by adding lines between atoms to represent shared pairs in a chemical bond.
A chemical database is a database specifically designed to store chemical information. This information is about chemical and crystal structures, spectra, reactions and syntheses, and thermophysical data.
The skeletal formula, or line-angle formula or shorthand formula, of an organic compound is a type of molecular structural formula that serves as a shorthand representation of a molecule's bonding and some details of its molecular geometry. A skeletal formula shows the skeletal structure or skeleton of a molecule, which is composed of the skeletal atoms that make up the molecule. It is represented in two dimensions, as on a piece of paper. It employs certain conventions to represent carbon and hydrogen atoms, which are the most common in organic chemistry.
Annulenes are monocyclic hydrocarbons that contain the maximum number of non-cumulated or conjugated double bonds. They have the general formula CnHn or CnHn+1. The IUPAC naming conventions are that annulenes with 7 or more carbon atoms are named as [n]annulene, where n is the number of carbon atoms in their ring, though sometimes the smaller annulenes are referred to using the same notation, and benzene is sometimes referred to simply as annulene.
In organic chemistry, Hückel's rule predicts that a planar ring molecule will have aromatic properties if it has 4n + 2 π electrons, where n is a non-negative integer. The quantum mechanical basis for its formulation was first worked out by physical chemist Erich Hückel in 1931. The succinct expression as the 4n + 2 rule has been attributed to W. v. E. Doering (1951), although several authors were using this form at around the same time.
A chemical file format is a type of data file which is used specifically to depicting molecular data. One of the most widely used is the chemical table file format, which is similar to Structure Data Format (SDF) files. They are text files that represent multiple chemical structure records and associated data fields. The XYZ file format is a simple format that usually gives the number of atoms in the first line, a comment on the second, followed by a number of lines with atomic symbols and cartesian coordinates. The Protein Data Bank Format is commonly used for proteins but is also used for other types of molecules. There are many other types which are detailed below. Various software systems are available to convert from one format to another.
In coordination chemistry, hapticity is the coordination of a ligand to a metal center via an uninterrupted and contiguous series of atoms. The hapticity of a ligand is described with the Greek letter η ('eta'). For example, η2 describes a ligand that coordinates through 2 contiguous atoms. In general the η-notation only applies when multiple atoms are coordinated. In addition, if the ligand coordinates through multiple atoms that are not contiguous then this is considered denticity, and the κ-notation is used once again. When naming complexes care should be taken not to confuse η with μ ('mu'), which relates to bridging ligands.
ISIS/Draw was a chemical structure drawing program developed by MDL Information Systems. It introduced a number of file formats for the storage of chemical information that have become industry standards.
The SYBYL line notation or SLN is a specification for unambiguously describing the structure of chemical molecules using short ASCII strings. SLN differs from SMILES in several significant ways. SLN can specify molecules, molecular queries, and reactions in a single line notation whereas SMILES handles these through language extensions. SLN has support for relative stereochemistry, it can distinguish mixtures of enantiomers from pure molecules with pure but unresolved stereochemistry. In SMILES aromaticity is considered to be a property of both atoms and bonds whereas in SLN it is a property of bonds.
SMILES arbitrary target specification (SMARTS) is a language for specifying substructural patterns in molecules. The SMARTS line notation is expressive and allows extremely precise and transparent substructural specification and atom typing.
Mass spectral interpretation is the method employed to identify the chemical formula, characteristic fragment patterns and possible fragment ions from the mass spectra. Mass spectra is a plot of relative abundance against mass-to-charge ratio. It is commonly used for the identification of organic compounds from electron ionization mass spectrometry. Organic chemists obtain mass spectra of chemical compounds as part of structure elucidation and the analysis is part of many organic chemistry curricula.