Scott Deerwester

Last updated

Scott Deerwester (born 1956) is one of the inventors of latent semantic analysis. [1] [2] He was a member of the faculty of the Colgate University, University of Chicago and the Hong Kong University of Science and Technology. He moved to Hong Kong in 1991, where he worked in the humanitarian sector.

Related Research Articles

Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

<span class="mw-page-title-main">Cognition</span> Act or process of knowing

Cognition refers to "the mental action or process of acquiring knowledge and understanding through thought, experience, and the senses". It encompasses all aspects of intellectual functions and processes such as: perception, attention, thought, intelligence, the formation of knowledge, memory and working memory, judgment and evaluation, reasoning and computation, problem solving and decision making, comprehension and production of language. Imagination is also a cognitive process, it is considered as such because it involves thinking about possibilities. Cognitive processes use existing knowledge and discover new knowledge.

Semantic memory refers to general world knowledge that humans have accumulated throughout their lives. This general knowledge is intertwined in experience and dependent on culture. We can learn about new concepts by applying our knowledge learned from things in the past.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix is a specific instance of a document-feature matrix where "features" may refer to other properties of a document besides terms. It is also common to encounter the transpose, or term-document matrix where documents are the columns and terms are the rows. They are useful in the field of natural language processing and computational text analysis.

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".

<span class="mw-page-title-main">Susan Dumais</span> American computer scientist

Susan Dumais is an American computer scientist who is a leader in the field of information retrieval, and has been a significant contributor to Microsoft's search technologies. According to Mary Jane Irwin, who heads the Athena Lecture awards committee, “Her sustained contributions have shaped the thinking and direction of human-computer interaction and information retrieval."

George William Furnas is an American academic, Professor and Associate Dean for Academic Strategy at the School of Information of the University of Michigan, known for his work on semantic analysis and on human-system communication.

Dr. Thomas K. Landauer was a Professor Emeritus at the Department of Psychology of the University of Colorado. He received his doctorate in 1960 from Harvard University, and also held academic appointments at Harvard, Dartmouth College, Stanford University and Princeton University. During his 25-year tenure as Distinguished Member of Technical Staff at Bell Labs and its successors, where he was the manager of an information science and human-computer interaction research group, he was one of the pioneers of Latent semantic analysis. His publications include:

In statistics, latent variables are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or measured. Such latent variable models are used in many disciplines, including political science, demography, engineering, medicine, ecology, physics, machine learning/artificial intelligence, bioinformatics, chemometrics, natural language processing, management and the social sciences.

In linguistics, statistical semantics applies the methods of statistics to the problem of determining the meaning of words or phrases, ideally through unsupervised learning, to a degree of precision at least sufficient for the purpose of information retrieval.

Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. The basic idea of distributional semantics can be summed up in the so-called distributional hypothesis: linguistic items with similar distributions have similar meanings.

Latent semantic mapping (LSM) is a data-driven framework to model globally meaningful relationships implicit in large volumes of data. It is a generalization of latent semantic analysis. In information retrieval, LSA enables retrieval on the basis of conceptual content, instead of merely matching words between queries and documents.

In machine learning, semantic analysis of a corpus is the task of building structures that approximate concepts from a large set of documents. It generally does not involve prior semantic understanding of the documents. A metalanguage based on predicate logic can analyze the speech of humans. Another strategy to understand the semantics of a text is symbol grounding. If language is grounded, it is equal to recognizing a machine readable meaning. For the restricted domain of spatial analysis, a computer based language understanding system was demonstrated.

A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.

In statistics and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear approximately equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.

<span class="mw-page-title-main">HKUST Library</span> Library of the Hong Kong University of Science and Technology

The Hong Kong University of Science and Technology Library is housed in the Lee Shau Kee Library, located at the Hong Kong University of Science and Technology. It has over 1 million books, 728,426 printed volumes 754,146 in electronic format, as well as tens of thousands of e-journals, and streaming audio and video collections. A good part of its special collections, like its Antique Maps of China Collection has been digitized.

Semantic Scholar is an artificial intelligence–powered research tool for scientific literature developed at the Allen Institute for AI and publicly released in November 2015. It uses advances in natural language processing to provide summaries for scholarly papers. The Semantic Scholar team is actively researching the use of artificial-intelligence in natural language processing, machine learning, Human-Computer interaction, and information retrieval.

Semantic folding theory describes a procedure for encoding the semantics of natural language text in a semantically grounded binary representation. This approach provides a framework for modelling how language data is processed by the neocortex.

Semantic spaces in the natural language domain aim to create representations of natural language that are capable of capturing meaning. The original motivation for semantic spaces stems from two core challenges of natural language: Vocabulary mismatch and ambiguity of natural language.

References

  1. White, Martin Scott (2007). Making search work: implementing Web, intranet, and enterprise search. Information Today, Inc. pp. 24–. ISBN   978-1-57387-305-5 . Retrieved 14 May 2011.
  2. Scott Deerwester; Susan T Dumais; George W Furnas; Thomas K Landauer; Richard Harshman. (Sep 1990). "Indexing by Latent Semantic Analysis". Journal of the American Society for Information Science. 41 (6): 391–407. CiteSeerX   10.1.1.33.2447 . doi:10.1002/(sici)1097-4571(199009)41:6<391::aid-asi1>3.0.co;2-9.