Molecule mining

Last updated

Molecule mining is the process of data mining, or extracting and discovering patterns, as applied to molecules. Since molecules may be represented by molecular graphs, this is strongly related to graph mining and structured data mining. The main problem is how to represent molecules while discriminating the data instances. One way to do this is chemical similarity metrics, which has a long tradition in the field of cheminformatics.

Contents

Typical approaches to calculate chemical similarities use chemical fingerprints, but this loses the underlying information about the molecule topology. Mining the molecular graphs directly avoids this problem. So does the inverse QSAR problem which is preferable for vectorial mappings.

Coding(Moleculei,Moleculej≠i)

Kernel methods

Maximum common graph methods

Coding(Moleculei)

Molecular query methods

Methods based on special architectures of neural networks

See also

Related Research Articles

A chemical database is a database specifically designed to store chemical information. This information is about chemical and crystal structures, spectra, reactions and syntheses, and thermophysical data.

Cheminformatics refers to the use of physical chemistry theory with computer and information science techniques—so called "in silico" techniques—in application to a range of descriptive and prescriptive problems in the field of chemistry, including in its applications to biology and related molecular fields. Such in silico techniques are used, for example, by pharmaceutical companies and in academic settings to aid and inform the process of drug discovery, for instance in the design of well-defined combinatorial libraries of synthetic compounds, or to assist in structure-based drug design. The methods can also be used in chemical and allied industries, and such fields as environmental science and pharmacology, where chemical processes are involved or studied.

Chemical Markup Language is an approach to managing molecular information using tools such as XML and Java. It was the first domain specific implementation based strictly on XML, first based on a DTD and later on an XML Schema, the most robust and widely used system for precise information management in many areas. It has been developed over more than a decade by Murray-Rust, Rzepa and others and has been tested in many areas and on a variety of machines.

Quantitative structure–activity relationship models are regression or classification models used in the chemical and biological sciences and engineering. Like other regression models, QSAR regression models relate a set of "predictor" variables (X) to the potency of the response variable (Y), while classification QSAR models relate the predictor variables to a categorical value of the response variable.

Corwin Herman Hansch was a professor of chemistry at Pomona College in California. He became known as the 'father of computer-assisted molecule design.'

Retrosynthetic analysis is a technique for solving problems in the planning of organic syntheses. This is achieved by transforming a target molecule into simpler precursor structures regardless of any potential reactivity/interaction with reagents. Each precursor material is examined using the same method. This procedure is repeated until simple or commercially available structures are reached. These simpler/commercially available compounds can be used to form a synthesis of the target molecule. E.J. Corey formalized this concept in his book The Logic of Chemical Synthesis.

<span class="mw-page-title-main">JOELib</span>

JOELib is computer software, a chemical expert system used mainly to interconvert chemical file formats. Because of its strong relationship to informatics, this program belongs more to the category cheminformatics than to molecular modelling. It is available for Windows, Unix and other operating systems supporting the programming language Java. It is free and open-source software distributed under the GNU General Public License (GPL) 2.0.

<span class="mw-page-title-main">ISIS/Draw</span>

ISIS/Draw was a chemical structure drawing program developed by MDL Information Systems. It introduced a number of file formats for the storage of chemical information that have become industry standards.

<span class="mw-page-title-main">Substructure search</span> Method of finding chemicals in a database

Substructure search (SSS) is a method to retrieve from a database only those chemicals matching a pattern of atoms and bonds which a user specifies. It is an application of graph theory, specifically subgraph matching in which the query is a hydrogen-depleted molecular graph. The mathematical foundations for the method were laid in the 1870s, when it was suggested that chemical structure drawings were equivalent to graphs with atoms as vertices and bonds as edges. SSS is now a standard part of cheminformatics and is widely used by pharmaceutical chemists in drug discovery.

<span class="mw-page-title-main">Unimolecular rectifier</span>

A unimolecular rectifier is a single organic molecule which functions as a rectifier of electric current. The idea was first proposed in 1974 by Arieh Aviram, then at IBM, and Mark Ratner, then at New York University. Their publication was the first serious and concrete theoretical proposal in the new field of molecular electronics (UE). Based on the mesomeric effect of certain chemical compounds on organic molecules, a molecular rectifier was built by simulating the pn junction with the help of chemical compounds.

ChemSpider is a freely accessible online database of chemicals owned by the Royal Society of Chemistry. It contains information on more than 100 million molecules from over 270 data sources, each of them receiving a unique identifier called ChemSpider Identifier.

In the fields of chemical graph theory, molecular topology, and mathematical chemistry, a topological index, also known as a connectivity index, is a type of a molecular descriptor that is calculated based on the molecular graph of a chemical compound. Topological indices are numerical parameters of a graph which characterize its topology and are usually graph invariant. Topological indices are used for example in the development of quantitative structure-activity relationships (QSARs) in which the biological activity or other properties of molecules are correlated with their chemical structure.

SMILES arbitrary target specification (SMARTS) is a language for specifying substructural patterns in molecules. The SMARTS line notation is expressive and allows extremely precise and transparent substructural specification and atom typing.

<span class="mw-page-title-main">Chemical similarity</span> Chemical term

Chemical similarity refers to the similarity of chemical elements, molecules or chemical compounds with respect to either structural or functional qualities, i.e. the effect that the chemical compound has on reaction partners in inorganic or biological settings. Biological effects and thus also similarity of effects are usually quantified using the biological activity of a compound. In general terms, function can be related to the chemical activity of compounds.

This is a list of notable computer programs that are used for nucleic acids simulations.

Periodic systems of molecules are charts of molecules similar to the periodic table of the elements. Construction of such charts was initiated in the early 20th century and is still ongoing.

Matched molecular pair analysis (MMPA) is a method in cheminformatics that compares the properties of two molecules that differ only by a single chemical transformation, such as the substitution of a hydrogen atom by a chlorine one. Such pairs of compounds are known as matched molecular pairs (MMP). Because the structural difference between the two molecules is small, any experimentally observed change in a physical or biological property between the matched molecular pair can more easily be interpreted. The term was first coined by Kenny and Sadowski in the book Chemoinformatics in Drug Discovery.

In chemical graph theory, the Padmakar–Ivan (PI) index is a topological index of a molecule, used in biochemistry. The Padmakar–Ivan index is a generalization introduced by Padmakar V. Khadikar and Iván Gutman of the concept of the Wiener index, introduced by Harry Wiener. The Padmakar–Ivan index of a graph G is the sum over all edges uv of G of number of edges which are not equidistant from u and v. Let G be a graph and e = uv an edge of G. Here denotes the number of edges lying closer to the vertex u than the vertex v, and is the number of edges lying closer to the vertex v than the vertex u. The Padmakar–Ivan index of a graph G is defined as

A chemical graph generator is a software package to generate computer representations of chemical structures adhering to certain boundary conditions. The development of such software packages is a research topic of cheminformatics. Chemical graph generators are used in areas such as virtual library generation in drug design, in molecular design with specified properties, called inverse QSAR/QSPR, as well as in organic synthesis design, retrosynthesis or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational metabolomics, a current area of computational biology.

References

  1. 1 2 H. Kashima, K. Tsuda, A. Inokuchi, Marginalized Kernels Between Labeled Graphs, The 20th International Conference on Machine Learning (ICML2003), 2003. PDF
  2. H. Fröhlich, J. K. Wegner, A. Zell, Optimal Assignment Kernels For Attributed Molecular Graphs, The 22nd International Conference on Machine Learning (ICML 2005), Omnipress, Madison, WI, USA, 2005, 225-232. PDF
  3. Fröhlich H., Wegner J. K., Zell A. (2006). "Kernel Functions for Attributed Molecular Graphs - A New Similarity Based Approach To ADME Prediction in Classification and Regression". QSAR Comb. Sci. 25 (4): 317–326. doi:10.1002/qsar.200510135.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  4. H. Fröhlich, J. K. Wegner, A. Zell, Assignment Kernels For Chemical Compounds, International Joint Conference on Neural Networks 2005 (IJCNN'05), 2005, 913-918. CiteSeer
  5. 1 2 Mahe P., Ralaivola L., Stoven V., Vert J. (2006). "The pharmacophore kernel for virtual screening with support vector machines". J Chem Inf Model. 46 (5): 2003–2014. arXiv: q-bio/0603006 . Bibcode:2006q.bio.....3006M. doi:10.1021/ci060138m. PMID   16995731. S2CID   15060229.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  6. P. Mahé, N. Ueda, T. Akutsu, J.-L. Perret and P. Vert, J.-P. (2004). "Extensions of marginalized graph kernels". Proceedings of the 21st ICML: 552–559.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  7. L. Ralaivola, S. J. Swamidass, S. Hiroto and P. Baldi (2005). "Graph kernels for chemical informatics". Neural Networks. 18 (8): 1093–1110. doi:10.1016/j.neunet.2005.07.009. PMID   16157471.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  8. P. Mahé and J.-P. Vert (2009). "Graph kernels based on tree patterns for molecules". Machine Learning. 75 (1): 3–35. arXiv: q-bio/0609024 . doi:10.1007/s10994-008-5086-2. ISSN   0885-6125. S2CID   5943581.
  9. Wegner J. K., Fröhlich H., Mielenz H., Zell A. (2006). "Data and Graph Mining in Chemical Space for ADME and Activity Data Sets". QSAR Comb. Sci. 25 (3): 205–220. doi:10.1002/qsar.200510009.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  10. Rahman S. A., Bashton M., Holliday G. L., Schrader R., Thornton J. M. (2009). "Small Molecule Subgraph Detector (SMSD) toolkit". Journal of Cheminformatics. 1 (1): 12. doi: 10.1186/1758-2946-1-12 . PMC   2820491 . PMID   20298518.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  11. "Small Molecule Subgraph Detector (SMSD)".
  12. King R. D., Srinivasan A., Dehaspe L. (2001). "Wamr: a data mining tool for chemical data". J. Comput.-Aid. Mol. Des. 15 (2): 173–181. Bibcode:2001JCAMD..15..173K. doi:10.1023/A:1008171016861. PMID   11272703. S2CID   3055046.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  13. L. Dehaspe, H. Toivonen, King, Finding frequent substructures in chemical compounds, 4th International Conference on Knowledge Discovery and Data Mining, AAAI Press., 1998, 30-36.
  14. A. Inokuchi, T. Washio, T. Okada, H. Motoda, Applying the Apriori-based Graph Mining Method to Mutagenesis Data Analysis, Journal of Computer Aided Chemistry, 2001;, 2, 87-92.
  15. A. Inokuchi, T. Washio, K. Nishimura, H. Motoda, A Fast Algorithm for Mining Frequent Connected Subgraphs, IBM Research, Tokyo Research Laboratory, 2002.
  16. A. Clare, R. D. King, Data mining the yeast genome in a lazy functional language, Practical Aspects of Declarative Languages (PADL2003), 2003.
  17. Kuramochi M., Karypis G. (2004). "An Efficient Algorithm for Discovering Frequent Subgraphs". IEEE Transactions on Knowledge and Data Engineering. 16 (9): 1038–1051. CiteSeerX   10.1.1.107.3913 . doi:10.1109/tkde.2004.33. S2CID   242887.
  18. Deshpande M., Kuramochi M., Wale N., Karypis G. (2005). "Frequent Substructure-Based Approaches for Classifying Chemical Compounds". IEEE Transactions on Knowledge and Data Engineering. 17 (8): 1036–1050. doi:10.1109/tkde.2005.127. hdl: 11299/215559 .{{cite journal}}: CS1 maint: multiple names: authors list (link)
  19. Helma C., Cramer T., Kramer S., de Raedt L. (2004). "Data Mining and Machine Learning Techniques for the Identification of Mutagenicity Inducing Substructures and Structure Activity Relationships of Noncongeneric Compounds". J. Chem. Inf. Comput. Sci. 44 (4): 1402–1411. doi:10.1021/ci034254q. PMID   15272848.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  20. T. Meinl, C. Borgelt, M. R. Berthold, Discriminative Closed Fragment Mining and Perfect Extensions in MoFa, Proceedings of the Second Starting AI Researchers Symposium (STAIRS 2004), 2004.
  21. T. Meinl, C. Borgelt, M. R. Berthold, M. Philippsen, Mining Fragments with Fuzzy Chains in Molecular Databases, Second International Workshop on Mining Graphs, Trees and Sequences (MGTS2004), 2004.
  22. Meinl, T.; Berthold, M. R. (2004). "Hybrid fragment mining with MoFa and FSG" (PDF). 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No. 04CH37583). Vol. 5. pp. 4559–4564. doi:10.1109/ICSMC.2004.1401250. ISBN   0-7803-8567-5. S2CID   3248671.
  23. S. Nijssen, J. N. Kok. Frequent Graph Mining and its Application to Molecular Databases, Proceedings of the 2004 IEEE Conference on Systems, Man & Cybernetics (SMC2004), 2004.
  24. C. Helma, Predictive Toxicology, CRC Press, 2005.
  25. M. Wörlein, Extension and parallelization of a graph-mining-algorithm, Friedrich-Alexander-Universität, 2006. PDF
  26. K. Jahn, S. Kramer, Optimizing gSpan for Molecular Datasets, Proceedings of the Third International Workshop on Mining Graphs, Trees and Sequences (MGTS-2005), 2005.
  27. X. Yan, J. Han, gSpan: Graph-Based Substructure Pattern Mining, Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), IEEE Computer Society, 2002, 721-724.
  28. Karwath A., Raedt L. D. (2006). "SMIREP: predicting chemical activity from SMILES". J Chem Inf Model. 46 (6): 2432–2444. doi:10.1021/ci060159g. PMID   17125185. S2CID   1460089.
  29. Ando H., Dehaspe L., Luyten W., Craenenbroeck E., Vandecasteele H., Meervelt L. (2006). "Discovering H-Bonding Rules in Crystals with Inductive Logic Programming". Mol Pharm. 3 (6): 665–674. doi:10.1021/mp060034z. PMID   17140254.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  30. Mazzatorta P., Tran L., Schilter B., Grigorov M. (2007). "Integration of Structure-Activity Relationship and Artificial Intelligence Systems To Improve in Silico Prediction of Ames Test Mutagenicity". J. Chem. Inf. Model. 47 (1): 34–38. doi:10.1021/ci600411v. PMID   17238246.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  31. Wale N., Karypis G. "Comparison of Descriptor Spaces for Chemical Compound Retrieval and Classification". ICDM. 2006: 678–689.
  32. A. Gago Alonso, J.E. Medina Pagola, J.A. Carrasco-Ochoa and J.F. Martínez-Trinidad Mining Connected Subgraph Mining Reducing the Number of Candidates, Proc. of ECML--PKDD, pp. 365–376, 2008.
  33. Xiaohong Wang, Jun Huan, Aaron Smalter, Gerald Lushington, Application of Kernel Functions for Accurate Similarity Search in Large Chemical Databases , BMC Bioinformatics Vol. 11 (Suppl 3):S8 2010.
  34. Baskin, I. I.; V. A. Palyulin; N. S. Zefirov (1993). "[A methodology for searching direct correlations between structures and properties of organic compounds by using computational neural networks]". Doklady Akademii Nauk SSSR . 333 (2): 176–179.
  35. I. I. Baskin, V. A. Palyulin, N. S. Zefirov (1997). "A Neural Device for Searching Direct Correlations between Structures and Properties of Organic Compounds". J. Chem. Inf. Comput. Sci. 37 (4): 715–721. doi:10.1021/ci940128y.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  36. D. B. Kireev (1995). "ChemNet: A Novel Neural Network Based Method for Graph/Property Mapping". J. Chem. Inf. Comput. Sci. 35 (2): 175–180. doi:10.1021/ci00024a001.
  37. A. M. Bianucci; Micheli, Alessio; Sperduti, Alessandro; Starita, Antonina (2000). "Application of Cascade Correlation Networks for Structures to Chemistry". Applied Intelligence. 12 (1–2): 117–146. doi:10.1023/A:1008368105614. S2CID   10031212.
  38. A. Micheli, A. Sperduti, A. Starita, A. M. Bianucci (2001). "Analysis of the Internal Representations Developed by Neural Networks for Structures Applied to Quantitative Structure-Activity Relationship Studies of Benzodiazepines". J. Chem. Inf. Comput. Sci. 41 (1): 202–218. CiteSeerX   10.1.1.137.2895 . doi:10.1021/ci9903399. PMID   11206375.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  39. O. Ivanciuc (2001). "Molecular Structure Encoding into Artificial Neural Networks Topology". Roumanian Chemical Quarterly Reviews. 8: 197–220.
  40. A. Goulon, T. Picot, A. Duprat, G. Dreyfus (2007). "Predicting activities without computing descriptors: Graph machines for QSAR". SAR and QSAR in Environmental Research. 18 (1–2): 141–153. Bibcode:2007SQER...18..141G. doi:10.1080/10629360601054313. PMID   17365965. S2CID   11759797.{{cite journal}}: CS1 maint: multiple names: authors list (link)

Further reading