Scott Deerwester

Last updated


Scott Craig Deerwester is an American computer scientist and computer engineer who created the mathematical and natural language processing (NLP) technique known as Latent Semantic Analysis (LSA). [1] [2]

Contents

Early life

Deerwester was born in Rossville, Indiana, United States in January of 1956 [3] . He is the son of Kenneth F. Deerwester (July 8, 1927 – March 3, 2013) and Donna Stone.[ citation needed ]

Scientific career

Deerwester began his academic career in the United States, contributing to the development of LSA [4] during his time at Colgate University and the University of Chicago. [5] Deerwester published his first research paper, The Retrieval Expert Model of Information Retrieval [6] , at Purdue University in 1984. [7]

Publications and research work

Deerwester co-authored a research paper on LSA in 1988. [8] This paper helped improve how information retrieval systems process textual information by finding latent associations between keywords in documents, even when they lack common words. This method aimed to address issues related to polysemy (words with multiple meanings) and synonymy (different words with similar meanings). [9]

According to Deerwester's seminal 1988 work, Latent Semantic Analysis (LSA) enabled search engines to retrieve relevant documents even when they did not contain the exact keywords, which led to a more user-friendly and contextual retrieval mechanism. [1] His research contributed to advancements in Latent Dirichlet Allocation (LDA) and probabilistic models, which are widely used in topic modeling and semantic analysis. [2]

LSA is used in natural language processing applications, including chatbots and automatic translation services, and has the ability to emulate some human traits such as word sorting and category assessment. [10] Deerwester's work has found applications in data mining, recommended systems, and business intelligence tools. [2]

Related Research Articles

Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need. The information need can be specified in the form of a search query. In the case of document retrieval, queries can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

In information science and information retrieval, relevance denotes how well a retrieved document or set of documents meets the information need of the user. Relevance may include concerns such as timeliness, authority or novelty of the result.

Semantic memory refers to general world knowledge that humans have accumulated throughout their lives. This general knowledge is intertwined in experience and dependent on culture. New concepts are learned by applying knowledge learned from things in the past.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".

<span class="mw-page-title-main">Automatic image annotation</span>

Automatic image annotation is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database.

Probabilistic latent semantic analysis (PLSA), also known as probabilistic latent semantic indexing is a statistical technique for the analysis of two-mode and co-occurrence data. In effect, one can derive a low-dimensional representation of the observed variables in terms of their affinity to certain hidden variables, just as in latent semantic analysis, from which PLSA evolved.

<span class="mw-page-title-main">Susan Dumais</span> American computer scientist

Susan Dumais is an American computer scientist who is a leader in the field of information retrieval, and has been a significant contributor to Microsoft's search technologies. According to Mary Jane Irwin, who heads the Athena Lecture awards committee, “Her sustained contributions have shaped the thinking and direction of human-computer interaction and information retrieval."

George William Furnas is an American academic, Professor and Associate Dean for Academic Strategy at the School of Information of the University of Michigan, known for his work on semantic analysis and on human-system communication.

In linguistics, statistical semantics applies the methods of statistics to the problem of determining the meaning of words or phrases, ideally through unsupervised learning, to a degree of precision at least sufficient for the purpose of information retrieval.

<span class="mw-page-title-main">Distributional semantics</span> Field of linguistics

Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. The basic idea of distributional semantics can be summed up in the so-called distributional hypothesis: linguistic items with similar distributions have similar meanings.

Latent semantic mapping (LSM) is a data-driven framework to model globally meaningful relationships implicit in large volumes of data. It is a generalization of latent semantic analysis. In information retrieval, LSA enables retrieval on the basis of conceptual content, instead of merely matching words between queries and documents.

A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.

Vector space model or term vector model is an algebraic model for representing text documents as vectors such that the distance between vectors represents the relevance between the documents. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.

Collaborative tagging, also known as social tagging or folksonomy, allows users to apply public tags to online items, typically to make those items easier for themselves or others to find later. It has been argued that these tagging systems can provide navigational cues or "way-finders" for other users to explore information. The notion is that given that social tags are labels users create to represent topics extracted from online documents, the interpretation of these tags should allow other users to predict the contents of different documents efficiently. Social tags are arguably more important in exploratory search, in which the users may engage in iterative cycles of goal refinement and exploration of new information, and interpretation of information contents by others will provide useful cues for people to discover topics that are relevant.

Multimedia information retrieval is a research discipline of computer science that aims at extracting semantic information from multimedia data sources. Data sources include directly perceivable media such as audio, image and video, indirectly perceivable sources such as text, semantic descriptions, biosignals as well as not perceivable sources such as bioinformation, stock prices, etc. The methodology of MMIR can be organized in three groups:

  1. Methods for the summarization of media content. The result of feature extraction is a description.
  2. Methods for the filtering of media descriptions
  3. Methods for the categorization of media descriptions into classes.

In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectoral representation of text that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text corpus and a document is represented as the centroid of the vectors representing its words. Typically, the text corpus is English Wikipedia, though other corpora including the Open Directory Project have been used.

Vocabulary mismatch is a common phenomenon in the usage of natural languages, occurring when different people name the same thing or concept differently.

Semantic folding theory describes a procedure for encoding the semantics of natural language text in a semantically grounded binary representation. This approach provides a framework for modelling how language data is processed by the neocortex.

Semantic spaces in the natural language domain aim to create representations of natural language that are capable of capturing meaning. The original motivation for semantic spaces stems from two core challenges of natural language: Vocabulary mismatch and ambiguity of natural language.

References

  1. 1 2 Deerwester, Scott; Dumais, Susan T.; Furnas, George W.; Landauer, Thomas K.; Harshman, Richard (September 1990). "Indexing by latent semantic analysis". Journal of the American Society for Information Science . 41 (6): 391–407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9. ISSN   0002-8231.
  2. 1 2 3 Dumais, S. T.; Furnas, G. W.; Landauer, T. K.; Deerwester, S.; Harshman, R. (1988-05-01). "Using latent semantic analysis to improve access to textual information". Proceedings of the SIGCHI conference on Human factors in computing systems – CHI '88. New York, NY, USA: Association for Computing Machinery. pp. 281–285. doi:10.1145/57167.57214. ISBN   978-0-201-14237-2.
  3. "Scott Craig DEERWESTER personal appointments - Find and update company information - GOV.UK". find-and-update.company-information.service.gov.uk. Retrieved 2024-12-26.
  4. Deerwester, Scott; Dumais, Susan T.; Furnas, George W.; Landauer, Thomas K.; Harshman, Richard (1990). "Indexing by latent semantic analysis". Journal of the American Society for Information Science. 41 (6): 391–407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9. ISSN   1097-4571.
  5. Scott, Deerwester. "Scott Deerwester | LinkedIn". LinkedIn.
  6. "THE RETRIEVAL EXPERT MODEL OF INFORMATION RETRIEVAL - ProQuest". www.proquest.com. Retrieved 2024-12-26.
  7. Deerwester, Scott (1984). "The retrieval expert model of information retrieval". Google Scholar . Retrieved 18 October 2024.
  8. Dumais, S. T.; Furnas, G. W.; Landauer, T. K.; Deerwester, S.; Harshman, R. (1988). "Using latent semantic analysis to improve access to textual information". Proceedings of the SIGCHI conference on Human factors in computing systems - CHI '88. Washington, D.C., United States: ACM Press. pp. 281–285. doi:10.1145/57167.57214. ISBN   978-0-201-14237-2.
  9. Hurtado, Jose L.; Agarwal, Ankur; Zhu, Xingquan (14 April 2016). "Topic discovery and future trend forecasting for texts". Journal of Big Data. 3. doi: 10.1186/s40537-016-0039-2 .
  10. Foltz, Peter W. (1996-06-01). "Latent semantic analysis for text-based research". Behavior Research Methods, Instruments, & Computers. 28 (2): 197–202. doi:10.3758/BF03204765. ISSN   1532-5970.