SMILES arbitrary target specification

Last updated

SMILES arbitrary target specification (SMARTS) is a language for specifying substructural patterns in molecules. The SMARTS line notation is expressive and allows extremely precise and transparent substructural specification and atom typing.

Contents

SMARTS is related to the SMILES line notation that is used to encode molecular structures and like SMILES was originally developed by David Weininger and colleagues at The Pomona College Medicinal Chemistry Project (MedChem). A SMARTS software search engine named GENIE was used as an additional user-specified search filter in the MedChem database searching tool MERLIN. GENIE was also used in the MedChem interpreted language GCL (GENIE Control Language), where input was a list of structures. In GCL, a SMARTS specification was used as an expression that could be used in control flow statements. For example "for (SMARTS) {...}" would loop over each substructure (of the currently examined structure) that matched a SMARTS specification. Additional SMARTS development was made at Daylight Chemical Information Systems, Inc, which is a private company that was spun out of the software side of MedChem.

The most comprehensive descriptions of the SMARTS language can be found in Daylight's SMARTS theory manual, [1] tutorial [2] and examples. [3] OpenEye Scientific Software has developed their own version of SMARTS which differs from the original Daylight version in how the R descriptor (see cyclicity below) is defined.

SMARTS syntax

Atomic properties

Atoms can be specified by symbol or atomic number. Aliphatic carbon is matched by [C], aromatic carbon by [c] and any carbon by [#6] or [C,c]. The wild card symbols *, A and a match any atom, any aliphatic atom and any aromatic atom respectively. Implicit hydrogens are considered to be a characteristic of atoms and the SMARTS for an amino group can be written as [NH2]. Charge is specified by the descriptors + and - as exemplified by the SMARTS [nH+] (protonated aromatic nitrogen atom) and [O-]C(=O)c (deprotonated aromatic carboxylic acid).

Bonds

A number of bond types can be specified: - (single), = (double), # (triple), : (aromatic) and ~ (any).

Connectivity

The X and D descriptors are used to specify the total numbers of connections (including implicit hydrogen atoms) and connections to explicit atoms, respectively. Thus [CX4] matches carbon atoms with bonds to any four other atoms while [CD4] matches quaternary carbon.

Cyclicity

As originally defined by Daylight, the R descriptor is used to specify ring membership. In the Daylight model for cyclic systems, the smallest set of smallest rings (SSSR) [4] is used as a basis for ring membership. For example, indole is perceived as a 5-membered ring fused with a 6-membered ring rather than a 9-membered ring. The two carbon atoms that make up the ring fusion would match [cR2] and the other carbon atoms would match [cR1].

The SSSR model has been criticised by OpenEye [5] who, in their implementation of SMARTS, use R to denote the number of ring bonds for an atom. The two carbon atoms in the ring fusion match [cR3] and the other carbons match [cR2] in the OpenEye implementation of SMARTS. Used without a number, R specifies an atom in a ring in both implementations, for example [CR] (aliphatic carbon atom in ring).

Lower case r specifies the size of the smallest ring of which the atom is a member. The carbon atoms of the ring fusion would both match [cr5]. Bonds can be specified as cyclic, for example C@C matches directly bonded atoms in a ring.

Logical operators

Four logical operators allow atom and bond descriptors to be combined. The 'and' operator ; can be used to define a protonated primary amine as [N;H3;+][C;X4]. The 'or' operator , has a higher priority so [c,n;H] defines (aromatic carbon or aromatic nitrogen) with implicit hydrogen. The 'and' operator & has higher priority than , so [c,n&H] defines aromatic carbon or (aromatic nitrogen with implicit hydrogen).

The 'not' operator ! can be used to define unsaturated aliphatic carbon as [C;!X4] and acyclic bonds as *-!@*.

Recursive SMARTS

Recursive SMARTS allow detailed specification of an atom's environment. For example, the more reactive (with respect to electrophilic aromatic substitution) ortho and para carbon atoms of phenol can be defined as [$(c1c([OH])cccc1),$(c1ccc([OH])cc1)].

Examples of SMARTS

A number of illustrative examples of SMARTS have been assembled by Daylight.

The definitions of hydrogen bond donors and acceptors used to apply Lipinski's Rule of Five [6] are easily coded in SMARTS. Donors are defined as nitrogen or oxygen atoms that have at least one directly bonded hydrogen atom:

[N,n,O;!H0] or [#7,#8;!H0] (aromatic oxygen cannot have a bonded hydrogen)

Acceptors are defined as nitrogen or oxygen:

[N,n,O,o] or [#7,#8]

A simple definition of aliphatic amines that are likely to protonate at physiological pH can be written as the following recursive SMARTS:

[$([NH2][CX4]),$([NH]([CX4])[CX4]),$([NX3]([CX4])([CX4])[CX4])]

In real applications the CX4 atoms would need to be defined more precisely to prevent matching against electron withdrawing groups such as CF3 that would render the amine insufficiently basic to protonate at physiological pH.

SMARTS can be used to encode pharmacophore elements such as anionic centers. In the following example, recursive SMARTS notation is used to combine acid oxygen and tetrazole nitrogen in a definition of oxygen atoms that are likely to be anionic under normal physiological conditions.

[$([OH][C,S,P]=O),$([nH]1nnnc1)]

The SMARTS above would only match the acid hydroxyl and the tetrazole N−H. When a carboxylic acid deprotonates the negative charge is delocalised over both oxygen atoms and it may be desirable to designate both as anionic. This can be achieved using the following SMARTS.

[$([OH])C=O),$(O=C[OH])]

Applications of SMARTS

The precise and transparent substructural specification that SMARTS allows has been exploited in a number of applications.

Substructural filters defined in SMARTS have been used [7] to identify undesirable compounds when performing strategic pooling of compounds for high-throughput screening. The REOS (rapid elimination of swill) [8] procedure uses SMARTS to filter out reactive, toxic and otherwise undesirable moieties from databases of chemical structures.

RECAP [9] (Retrosynthetic Combinatorial Analysis Procedure) uses SMARTS to define bond types. RECAP is a molecule editor which generates fragments of structures by breaking bonds of defined types and the original link points in these are specified using isotopic labels. Searching databases of biologically active compounds for occurrences of fragments allows privileged structural motifs to be identified. The Molecular Slicer [10] is similar to RECAP and has been used to identify fragments that are commonly found in marketed oral drugs.

The Leatherface program [11] is a general purpose molecule editor which allows automated modification of a number of substructural features of molecules in databases, including protonation state, hydrogen count, formal charge, isotopic weight and bond order. The molecular editing rules used by Leatherface are defined in SMARTS. Leatherface can be used to standardise tautomeric and ionization states and to set and enumerate these in preparation of databases [12] for virtual screening. Leatherface has been used in Matched molecular pair analysis, which enables the effects of structural changes (e.g. substitution of hydrogen with chlorine) to be quantified, [13] over a range of structural types.

ALADDIN [14] is a pharmacophore matching program that uses SMARTS to define recognition points (e.g. neutral hydrogen bond acceptor) of pharmacophores. A key problem in pharmacophore matching is that functional groups that are likely to be ionised at physiological pH are typically registered in their neutral forms in structural databases. The ROCS shape matching program allows atom types to be defined using SMARTS. [15]

Notes and references

  1. SMARTS Theory Manual, Daylight Chemical Information Systems, Santa Fe, New Mexico
  2. SMARTS Tutorial, Daylight Chemical Information Systems, Santa Fe, New Mexico
  3. SMARTS Examples, Daylight Chemical Information Systems, Santa Fe, New Mexico.
  4. Downs, G.M.; Gillet, V.J.; Holliday, J.D.; Lynch, M.F. (1989). "A Review of Ring Perception Algorithms for Chemical Graphs". J. Chem. Inf. Comput. Sci. 29 (3): 172–187. doi:10.1021/ci00063a007.
  5. "Smallest Set of Smallest Rings (SSSR) considered Harmful". Archived from the original on October 14, 2007. Retrieved 2017-02-08.{{cite web}}: CS1 maint: bot: original URL status unknown (link), OEChem - C++ Manual, Version 1.5.1, OpenEye Scientific Software, Santa Fe, New Mexico
  6. Lipinski, Christopher A.; Lombardo, Franco; Dominy, Beryl W.; Feeney, Paul J. (2001). "Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings". Advanced Drug Delivery Reviews. 46 (1–3): 3–26. doi:10.1016/S0169-409X(00)00129-0. PMID   11259830.
  7. Hann, Mike; Hudson, Brian; Lewell, Xiao; Lifely, Rob; Miller, Luke; Ramsden, Nigel (1999). "Strategic Pooling of Compounds for High-Throughput Screening". Journal of Chemical Information and Computer Sciences. 39 (5): 897–902. doi:10.1021/ci990423o. PMID   10529988.
  8. Walters, W.Patrick; Murcko, Mark A. (2002). "Prediction of 'drug-likeness'". Advanced Drug Delivery Reviews. 54 (3): 255–271. doi:10.1016/S0169-409X(02)00003-0. PMID   11922947.
  9. Lewell, Xiao Qing; Judd, Duncan B.; Watson, Stephen P.; Hann, Michael M. (1998). "RECAPRetrosynthetic Combinatorial Analysis Procedure: A Powerful New Technique for Identifying Privileged Molecular Fragments with Useful Applications in Combinatorial Chemistry". Journal of Chemical Information and Computer Sciences. 38 (3): 511–522. doi:10.1021/ci970429i. PMID   9611787.
  10. Vieth, Michal; Siegel, Miles G.; Higgs, Richard E.; Watson, Ian A.; Robertson, Daniel H.; Savin, Kenneth A.; Durst, Gregory L.; Hipskind, Philip A. (2004). "Characteristic Physical Properties and Structural Fragments of Marketed Oral Drugs". Journal of Medicinal Chemistry. 47 (1): 224–232. doi:10.1021/jm030267j. PMID   14695836.
  11. Kenny, Peter W.; Sadowski, Jens (2005). "Structure Modification in Chemical Databases". Chemoinformatics in Drug Discovery. Methods and Principles in Medicinal Chemistry. pp.  271–285. doi:10.1002/3527603743.ch11. ISBN   9783527307531.
  12. Lyne, Paul D.; Kenny, Peter W.; Cosgrove, David A.; Deng, Chun; Zabludoff, Sonya; Wendoloski, John J.; Ashwell, Susan (2004). "Identification of Compounds with Nanomolar Binding Affinity for Checkpoint Kinase-1 Using Knowledge-Based Virtual Screening". Journal of Medicinal Chemistry. 47 (8): 1962–1968. doi:10.1021/jm030504i. PMID   15055996.
  13. Leach, Andrew G.; Jones, Huw D.; Cosgrove, David A.; Kenny, Peter W.; Ruston, Linette; MacFaul, Philip; Wood, J. Matthew; Colclough, Nicola; Law, Brian (2006). "Matched Molecular Pairs as a Guide in the Optimization of Pharmaceutical Properties; a Study of Aqueous Solubility, Plasma Protein Binding and Oral Exposure". Journal of Medicinal Chemistry. 49 (23): 6672–6682. doi:10.1021/jm0605233. PMID   17154498.
  14. Van Drie, John H.; Weininger, David; Martin, Yvonne C. (1989). "ALADDIN: An integrated tool for computer-assisted molecular design and pharmacophore recognition from geometric, steric, and substructure searching of three-dimensional molecular structures". Journal of Computer-Aided Molecular Design. 3 (3): 225–251. doi:10.1007/BF01533070. PMID   2573695. S2CID   206795998.
  15. OpenEye Scientific Software | ROCS

Related Research Articles

<span class="mw-page-title-main">Aromatic compound</span> Compound containing rings with delocalized pi electrons

Aromatic compounds or arenes usually refers to organic compounds "with a chemistry typified by benzene" and "cyclically conjugated." The word "aromatic" originates from the past grouping of molecules based on odor, before their general chemical properties were understood. The current definition of aromatic compounds does not have any relation to their odor. Aromatic compounds are now defined as cyclic compounds satisfying Hückel's Rule. Aromatic compounds have the following general properties:

<span class="mw-page-title-main">Amine</span> Chemical compounds and groups containing nitrogen with a lone pair (:N)

In chemistry, amines are compounds and functional groups that contain a basic nitrogen atom with a lone pair. Formally, amines are derivatives of ammonia, wherein one or more hydrogen atoms have been replaced by a substituent such as an alkyl or aryl group. Important amines include amino acids, biogenic amines, trimethylamine, and aniline. Inorganic derivatives of ammonia are also called amines, such as monochloramine.

<span class="mw-page-title-main">Amide</span> Organic compounds of the form RC(=O)NR′R″

In organic chemistry, an amide, also known as an organic amide or a carboxamide, is a compound with the general formula R−C(=O)−NR′R″, where R, R', and R″ represent any group, typically organyl groups or hydrogen atoms. The amide group is called a peptide bond when it is part of the main chain of a protein, and an isopeptide bond when it occurs in a side chain, as in asparagine and glutamine. It can be viewed as a derivative of a carboxylic acid with the hydroxyl group replaced by an amine group ; or, equivalently, an acyl (alkanoyl) group joined to an amine group.

<span class="mw-page-title-main">Aliphatic compound</span> Hydrocarbon compounds without aromatic rings

In organic chemistry, hydrocarbons are divided into two classes: aromatic compounds and aliphatic compounds. Aliphatic compounds can be saturated like hexane, or unsaturated, like hexene and hexyne. Open-chain compounds, whether straight or branched, and which contain no rings of any type, are always aliphatic. Cyclic compounds can be aliphatic if they are not aromatic.

<span class="mw-page-title-main">Organic chemistry</span> Subdiscipline of chemistry, focusing on carbon compounds

Organic chemistry is a subdiscipline within chemistry involving the scientific study of the structure, properties, and reactions of organic compounds and organic materials, i.e., matter in its various forms that contain carbon atoms. Study of structure determines their structural formula. Study of properties includes physical and chemical properties, and evaluation of chemical reactivity to understand their behavior. The study of organic reactions includes the chemical synthesis of natural products, drugs, and polymers, and study of individual organic molecules in the laboratory and via theoretical study.

<span class="mw-page-title-main">Simplified Molecular Input Line Entry System</span> Chemical species structure notation

The Simplified Molecular Input Line Entry System (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings. SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules.

<span class="mw-page-title-main">Phenyl group</span> Cyclic chemical group (–C₆H₅)

In organic chemistry, the phenyl group, or phenyl ring, is a cyclic group of atoms with the formula C6H5, and is often represented by the symbol Ph. The phenyl group is closely related to benzene and can be viewed as a benzene ring, minus a hydrogen, which may be replaced by some other element or compound to serve as a functional group. A phenyl group has six carbon atoms bonded together in a hexagonal planar ring, five of which are bonded to individual hydrogen atoms, with the remaining carbon bonded to a substituent. Phenyl groups are commonplace in organic chemistry. Although often depicted with alternating double and single bonds, the phenyl group is chemically aromatic and has equal bond lengths between carbon atoms in the ring.

<span class="mw-page-title-main">Kerogen</span> Solid organic matter in sedimentary rocks

Kerogen is solid, insoluble organic matter in sedimentary rocks. It consists of a variety of organic materials, including dead plants, algae, and other microorganisms, that have been compressed and heated by geological processes. All the kerogen on earth is estimated to contain 1016 tons of carbon. This makes it the most abundant source of organic compounds on earth, exceeding the total organic content of living matter 10,000-fold.

<span class="mw-page-title-main">Aromaticity</span> Chemical property

In organic chemistry, aromaticity is a chemical property describing the way in which a conjugated ring of unsaturated bonds, lone pairs, or empty orbitals exhibits a stabilization stronger than would be expected by the stabilization of conjugation alone. The earliest use of the term was in an article by August Wilhelm Hofmann in 1855. There is no general relationship between aromaticity as a chemical property and the olfactory properties of such compounds.

Simple aromatic rings, also known as simple arenes or simple aromatics, are aromatic organic compounds that consist only of a conjugated planar ring system. Many simple aromatic rings have trivial names. They are usually found as substructures of more complex molecules. Typical simple aromatic compounds are benzene, indole, and pyridine.

<span class="mw-page-title-main">DABCO</span> Chemical compound

DABCO (1,4-diazabicyclo[2.2.2]octane), also known as triethylenediamine or TEDA, is a bicyclic organic compound with the formula N2(C2H4)3. This colorless solid is a highly nucleophilic tertiary amine base, which is used as a catalyst and reagent in polymerization and organic synthesis.

In organic chemistry, Madelung synthesis is a chemical reaction that produces indoles by the intramolecular cyclization of N-phenylamides using strong base at high temperature. The Madelung synthesis was reported in 1912 by Walter Madelung, when he observed that 2-phenylindole was synthesized using N-benzoyl-o-toluidine and two equivalents of sodium ethoxide in a heated, airless reaction. Common reaction conditions include use of sodium or potassium alkoxide as base in hexane or tetrahydrofuran solvents, at temperatures ranging between 200–400 °C. A hydrolysis step is also required in the synthesis. The Madelung synthesis is important because it is one of few known reactions that produce indoles from a base-catalyzed thermal cyclization of N-acyl-o-toluidines.

<span class="mw-page-title-main">Magic acid</span> Superacid system prepared from a Brønsted and a Lewis superacid

Magic acid (FSO3H·SbF5) is a superacid consisting of a mixture, most commonly in a 1:1 molar ratio, of fluorosulfuric acid (HSO3F) and antimony pentafluoride (SbF5). This conjugate Brønsted–Lewis superacid system was developed in the 1960s by the George Olah lab at Case Western Reserve University, and has been used to stabilize carbocations and hypercoordinated carbonium ions in liquid media. Magic acid and other superacids are also used to catalyze isomerization of saturated hydrocarbons, and have been shown to protonate even weak bases, including methane, xenon, halogens, and molecular hydrogen.

<span class="mw-page-title-main">Asphaltene</span> Heavy organic molecular substances that are found in crude oil

Asphaltenes are molecular substances that are found in crude oil, along with resins, aromatic hydrocarbons, and saturates. The word "asphaltene" was coined by Boussingault in 1837 when he noticed that the distillation residue of some bitumens had asphalt-like properties. Asphaltenes in the form of asphalt or bitumen products from oil refineries are used as paving materials on roads, shingles for roofs, and waterproof coatings on building foundations.

<span class="mw-page-title-main">Cyclic compound</span> Molecule with a ring of bonded atoms

A cyclic compound is a term for a compound in the field of chemistry in which one or more series of atoms in the compound is connected to form a ring. Rings may vary in size from three to many atoms, and include examples where all the atoms are carbon, none of the atoms are carbon, or where both carbon and non-carbon atoms are present. Depending on the ring size, the bond order of the individual links between ring atoms, and their arrangements within the rings, carbocyclic and heterocyclic compounds may be aromatic or non-aromatic; in the latter case, they may vary from being fully saturated to having varying numbers of multiple bonds between the ring atoms. Because of the tremendous diversity allowed, in combination, by the valences of common atoms and their ability to form rings, the number of possible cyclic structures, even of small size numbers in the many billions.

In medicinal chemistry, bioisosteres are chemical substituents or groups with similar physical or chemical properties which produce broadly similar biological properties in the same chemical compound. In drug design, the purpose of exchanging one bioisostere for another is to enhance the desired biological or physical properties of a compound without making significant changes in chemical structure. The main use of this term and its techniques are related to pharmaceutical sciences. Bioisosterism is used to reduce toxicity, change bioavailability, or modify the activity of the lead compound, and may alter the metabolism of the lead.

The SYBYL line notation or SLN is a specification for unambiguously describing the structure of chemical molecules using short ASCII strings. SLN differs from SMILES in several significant ways. SLN can specify molecules, molecular queries, and reactions in a single line notation whereas SMILES handles these through language extensions. SLN has support for relative stereochemistry, it can distinguish mixtures of enantiomers from pure molecules with pure but unresolved stereochemistry. In SMILES aromaticity is considered to be a property of both atoms and bonds whereas in SLN it is a property of bonds.

The Stieglitz rearrangement is a rearrangement reaction in organic chemistry which is named after the American chemist Julius Stieglitz (1867–1937) and was first investigated by him and Paul Nicholas Leech in 1913. It describes the 1,2-rearrangement of trityl amine derivatives to triaryl imines. It is comparable to a Beckmann rearrangement which also involves a substitution at a nitrogen atom through a carbon to nitrogen shift. As an example, triaryl hydroxylamines can undergo a Stieglitz rearrangement by dehydration and the shift of a phenyl group after activation with phosphorus pentachloride to yield the respective triaryl imine, a Schiff base.

<span class="mw-page-title-main">Mass spectral interpretation</span>

Mass spectral interpretation is the method employed to identify the chemical formula, characteristic fragment patterns and possible fragment ions from the mass spectra. Mass spectra is a plot of relative abundance against mass-to-charge ratio. It is commonly used for the identification of organic compounds from electron ionization mass spectrometry. Organic chemists obtain mass spectra of chemical compounds as part of structure elucidation and the analysis is part of many organic chemistry curricula.

Triptans are a family of tryptamine-based drugs used as abortive medication in the treatment of migraines and cluster headaches. They are selective 5-hydroxytryptamine/serotonin1B/1D (5-HT1B/1D) agonists. Migraine is a complex disease which affects about 15% of the population and can be highly disabling. Triptans have advantages over ergotamine and dihydroergotamine, such as selective pharmacology, well established safety record and evidence-based prescribing instructions. Triptans are therefore often preferred treatment in migraine.