Latent semantic structure indexing

Last updated

Latent semantic structure indexing (LaSSI) is a technique for calculating chemical similarity derived from latent semantic analysis (LSA).

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per paragraph is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Paragraphs are then compared by taking the cosine of the angle between the two vectors formed by any two columns. Values close to 1 represent very similar paragraphs while values close to 0 represent very dissimilar paragraphs.

LaSSI was developed at Merck & Co. and patented in 2007 [1] by Richard Hull, Eugene Fluder, Suresh Singh, Robert Sheridan, Robert Nachbar and Simon Kearsley.

Merck & Co., Inc., d.b.a. Merck Sharp & Dohme (MSD) outside the United States and Canada, is an American multinational pharmaceutical company and one of the largest pharmaceutical companies in the world.

Overview

LaSSI is similar to LSA in that it involves the construction of an occurrence matrix from a corpus of items and the application of singular value decomposition to that matrix to derive latent features. What differs is that the occurrence matrix represents the frequency of two- and three-dimensional chemical descriptors (rather than natural language terms) found within a chemical database of chemical structures. This process derives latent chemical structure concepts that can be used to calculate chemical similarities and structure–activity relationships for drug discovery.

A chemical database is a database specifically designed to store chemical information. This information is about chemical and crystal structures, spectra, reactions and syntheses, and thermophysical data.

The structure–activity relationship (SAR) is the relationship between the chemical or 3D structure of a molecule and its biological activity. This idea was first presented by Crum-Brown and Fraser in 1865. The analysis of SAR enables the determination of the chemical group responsible for evoking a target biological effect in the organism. This allows modification of the effect or the potency of a bioactive compound by changing its chemical structure. Medicinal chemists use the techniques of chemical synthesis to insert new chemical groups into the biomedical compound and test the modifications for their biological effects.

Drug discovery the process by which new candidate medications are discovered

In the fields of medicine, biotechnology and pharmacology, drug discovery is the process by which new candidate medications are discovered. Historically, drugs were discovered through identifying the active ingredient from traditional remedies or by serendipitous discovery. Later chemical libraries of synthetic small molecules, natural products or extracts were screened in intact cells or whole organisms to identify substances that have a desirable therapeutic effect in a process known as classical pharmacology. Since sequencing of the human genome which allowed rapid cloning and synthesis of large quantities of purified proteins, it has become common practice to use high throughput screening of large compounds libraries against isolated biological targets which are hypothesized to be disease modifying in a process known as reverse pharmacology. Hits from these screens are then tested in cells and then in animals for efficacy.

Related Research Articles

Information retrieval (IR) is the activity of obtaining information system resources relevant to an information need from a collection. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for metadata that describe data, and for databases of texts, images or sounds.

Matrix metalloproteinases (MMPs), also known as matrixins, are calcium-dependent zinc-containing endopeptidases; other family members are adamalysins, serralysins, and astacins. The MMPs belong to a larger family of proteases known as the metzincin superfamily.

Semantic memory is one of the two types of declarative or explicit memory. Semantic memory refers to general world knowledge that we have accumulated throughout our lives. This general knowledge is intertwined in experience and dependent on culture. Semantic memory is distinct from episodic memory, which is our memory of experiences and specific events that occur during our lives, from which we can recreate at any given point. For instance, semantic memory might contain information about what a cat is, whereas episodic memory might contain a specific memory of petting a particular cat. We can learn about new concepts by applying our knowledge learned from things in the past. The counterpart to declarative or explicit memory is nondeclarative memory or implicit memory.

Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between them is based on the likeness of their meaning or semantic content as opposed to similarity which can be estimated regarding their syntactical representation. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".

Probabilistic latent semantic analysis (PLSA), also known as probabilistic latent semantic indexing is a statistical technique for the analysis of two-mode and co-occurrence data. In effect, one can derive a low-dimensional representation of the observed variables in terms of their affinity to certain hidden variables, just as in latent semantic analysis, from which PLSA evolved.

In linguistics, statistical semantics applies the methods of statistics to the problem of determining the meaning of words or phrases, ideally through unsupervised learning, to a degree of precision at least sufficient for the purpose of information retrieval.

Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. The basic idea of distributional semantics can be summed up in the so-called Distributional hypothesis: linguistic items with similar distributions have similar meanings.

ChemSpider database of chemicals owned by the Royal Society of Chemistry; see P661

ChemSpider is a database of chemicals. ChemSpider is owned by the Royal Society of Chemistry.

MAP2K3 protein-coding gene in the species Homo sapiens

Dual specificity mitogen-activated protein kinase kinase 3 is an enzyme that in humans is encoded by the MAP2K3 gene.

LTBP1 (gene) protein-coding gene in the species Homo sapiens

Latent-transforming growth factor beta-binding protein 1 is a protein that in humans is encoded by the LTBP1 gene.

LTBP2 protein-coding gene in the species Homo sapiens

Latent-transforming growth factor beta-binding protein 2 is a protein that in humans is encoded by the LTBP2 gene.

Chemical similarity

Chemical similarity refers to the similarity of chemical elements, molecules or chemical compounds with respect to either structural or functional qualities, i.e. the effect that the chemical compound has on reaction partners in inorganic or biological settings. Biological effects and thus also similarity of effects are usually quantified using the biological activity of a compound. In general terms, function can be related to the chemical activity of compounds.

A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.

The ChemDB HIV, Opportunistic Infection and Tuberculosis Therapeutics Database is a publicly available tool developed by the National Institute of Allergy and Infectious Diseases to compile preclinical data on small molecules with potential therapeutic action against HIV/AIDS and related opportunistic infections.

In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectoral representation of text that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text corpus and a document is represented as the centroid of the vectors representing its words. Typically, the text corpus is English Wikipedia, though other corpora including the Open Directory Project have been used.

Semantic folding theory describes a procedure for encoding the semantics of natural language text in a semantically grounded binary representation. This approach provides a framework for modelling how language data is processed by the neocortex.

References

Digital object identifier Character string used as a permanent identifier for a digital object, in a format controlled by the International DOI Foundation

In computing, a Digital Object Identifier or DOI is a persistent identifier or handle used to identify objects uniquely, standardized by the International Organization for Standardization (ISO). An implementation of the Handle System, DOIs are in wide use mainly to identify academic, professional, and government information, such as journal articles, research reports and data sets, and official publications though they also have been used to identify other types of information resources, such as commercial videos.