Scott Deerwester

Last updated

Scott Craig Deerwester is a patented and highly cited computer scientist who founded latent semantic analysis (LSA), [1] [2] a significant method in the field of natural language processing. His expertise encompasses information and data science, software systems architecture, and data modeling, reflecting his commitment to applying technology to address complex societal challenges.

Contents

History

Deerwester began his academic career in the United States, contributing to LSA's development during his tenure at Colgate University and the University of Chicago.

Early life

Deerwester was born in Rossville, Indiana, United States.

Deerwester is the son of Kenneth F. Deerwester (July 8, 1927  -  March 3, 2013), who played an important role in his upbringing and shaping his life as a researcher. Kenneth was a US Army veteran and graduated from Ripon College, where he met the love of his life, Donna Stone [3]

Scientific career

Deerwester started his scientific career by publishing his first research paper, The Retrieval Expert Model of Information Retrieval, at Purdue University in 1984. [4]

Data Science and AI Contributions and Impact

Deerwester's pioneering work in Latent Semantic Analysis (LSA) laid the foundation for the development of Latent Semantic Indexing (LSI). LSI has become essential for recommendation and search engines. Its ability to identify similar concepts and themes has made it a crucial tool for recommendation and search engines. LSI significantly enhances the relevance and accuracy of search results, enabling users to find information more quickly.

Importance and how LSI works

LSI is a powerful tool utilized across various industries, especially in content marketing and search engine optimization (SEO). By leveraging LSI, companies can significantly enhance their content and website performance. It efficiently identifies relevant keywords, boosts search engine rankings, and ultimately elevates the user experience, making it an essential asset for any business aiming for online success and increasing user experience.

Natural language processing and AI

Natural Language Processing is the sub-field of Artificial Intelligence that represents and analyses human language automatically. NLP has been employed in many applications, such as information retrieval, processing, and automated answer ranking. Semantic analysis focuses on understanding the meaning of text. Among other proposed approaches, Latent Semantic Analysis (LSA) is a widely used corpus-based approach that evaluates the similarity of text based on the semantic relations among words. LSA has been applied successfully in diverse language systems for calculating the semantic similarity of texts [5]

Publications and Research Work

Deerwester co-authored a groundbreaking research paper on Latent Semantic Analysis (LSA) in 1988. This paper transformed how information retrieval systems process textual information by finding latent associations between keywords in documents even when they lack common words. This novel method addressed polysemy (words containing multiple meanings) and synonymy (different words with similar meanings), two significant problems in language processing. His work was a foundation for contemporary search engines and is essential to the widely used natural language processing (NLP) systems. [6]

According to Deerwester's seminal 1988 work, Latent Semantic Analysis is an algorithm that converts textual data into a matrix to calculate associations between words depending on the contexts in which they occur. The algorithm can map concepts and documents onto a shared conceptual space through singular value decomposition, thereby revealing hidden patterns within the data by reducing this matrix to a lower-dimensional space. LSA enabled search engines to retrieve relevant documents even when they did not contain the exact keywords, resulting in a more user-friendly and contextual retrieval mechanism. [7]

Deerwester and his colleagues' work on LSA was believed to be the fundamental precedent of contemporary machine learning algorithms and information retrieval models. His work, which was widely praised and heavily cited, impacted the advancement of subsequent technologies, such as Latent Dirichlet Allocation (LDA) and probabilistic models, and its uses in topic modeling and semantic similarity in texts. [8]

Deerwester's contributions continue to resonate within the fields of natural language processing (NLP) and machine learning (ML). Beyond keyword-based searches, his discoveries gave machines a more meaningful approach to understanding human language by creating a model miming human cognitive processes in language classification. LSA is a crucial tool for AI applications, ranging from chatbots to automatic translation services, as it has demonstrated the ability to emulate human abilities like word sorting and category assessment.

In interviews in the late 1990s, Deerwester discussed how his work on "latent meanings" in data found increasing applications in academic settings and corporations trying to extract value from massive unstructured data sets. He believed that the real strength of analytics was in "finding meaning where none appears to exist," the increasing application of LSA in market research, business analytics, and other areas reflects this outlook. [9]

Though Deerwester's name may not be well known outside academic circles, his works, which contributed to developing the search technologies and text analytics tools that characterize and define today's information age, remain extremely important. As search engines like Google evolved, the principles laid out by Deerwester and his colleagues continued to guide algorithms used to improve search results relevance, precision, and Accuracy.

Moreover, the concept of uncovering "hidden relationships" in vast datasets, a central theme of Deerwester's work, extends beyond search engines. It has found applications in data mining , recommender systems, and business intelligence tools. His work has been referenced and built upon in various academic and technical publications, ensuring that his influence will endure as the field of artificial intelligence evolves. [10]

Deerwester's pioneering efforts in Latent Semantic Analysis have earned him a prominent place in information retrieval and machine learning history. His research provided a bridge between mathematical modeling and linguistics, allowing machines to extract and interpret hidden meanings in text data. This capability is now indispensable across industries.

Patents

Deerwester has three internationally recognized patents. The first patent (US4839853A) is Computer Information Retrieval using Latent Semantic Structure. The second patent (US5778362) is titled Method and System for Revealing Information Structures in Collections of Data Items. The third patent (WO1997049045A1) is about an Apparatus and Method for Generating Optimal Search Queries.

The First Patent US4839853A

It states the computer in the context of information retrieval, which is associated with how users interact with and present text in the files. With the increase in computer storage and processing power, most of the data that was previously hard to come by can now be accessed from a computer with relative ease. However, locating specific pieces of information from these crowded data collections is still a stressful task because the methods employed to retrieve information are primarily based on keywords with their limitations.

As much as any search or query system may seem practical, it has shortcomings. One of its shortfalls is the words “Synonymy” (the use of different words to describe the same concept) and “polysemy” (the quality of a word having numerous far-spread meanings). When this occurs, errors or irrelevant data are sought. [6]

In this regard, the invention uses a method for forming a “semantic space” that would be useful in information retrieval, and the Computer information retrieval using latent semantic structure is based on a statistical approach. The process extracts some disguised or unveiled relationships that detail why specific words or groups of words mean what they do to attain higher latitude around text retrieval. The analysis suggests that you can find relevant information even when none of the words you seek are in the retrieved documents. [6]

The Second Patent US5778362A

The system for revealing information structures in collections of data items are also cited in later patents in the data science field. This invention provides a method for analyzing data collection by treating the data as a two-dimensional map. To retrieve meaningful information, a query is made, and its elements are compared with the map to create a result vector. This result is then refined using another profile vector, which helps measure how closely the query matches the data. In short, it's a system for uncovering relationships and patterns in large data sets. [2]

The invention shows how to identify hidden structures within data sets, cross-correlate between different data sets, and find similarities between items. It also calculates distance and similarity measures between data points. The flexible system allows experts to modify the method while staying true to its core purpose of effectively analyzing complex data sets. [2]

The Third Patent WO1997049045A1

The Third Patent is originally in French and invented the way for Apparatus and method for generating optimal search queries. A computer generates a data structure that illustrates the connections between words found within a collection of documents. Utilizing this data structure, which encompasses a similarity matrix, the computer formulates a search query to locate documents about the subject matter of a document containing relevant information. [10]

Related Research Articles

Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need. The information need can be specified in the form of a search query. In the case of document retrieval, queries can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

Music information retrieval (MIR) is the interdisciplinary science of retrieving information from music. Those involved in MIR may have a background in academic musicology, psychoacoustics, psychology, signal processing, informatics, machine learning, optical music recognition, computational intelligence, or some combination of these.

An image retrieval system is a computer system used for browsing, searching and retrieving images from a large database of digital images. Most traditional and common methods of image retrieval utilize some method of adding metadata such as captioning, keywords, title or descriptions to the images so that retrieval can be performed over the annotation words. Manual image annotation is time-consuming, laborious and expensive; to address this, there has been a large amount of research done on automatic image annotation. Additionally, the increase in social web applications and the semantic web have inspired the development of several web-based image annotation tools.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.

<span class="mw-page-title-main">Content-based image retrieval</span> Method of image retrieval

Content-based image retrieval, also known as query by image content and content-based visual information retrieval (CBVIR), is the application of computer vision techniques to the image retrieval problem, that is, the problem of searching for digital images in large databases. Content-based image retrieval is opposed to traditional concept-based approaches.

A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in each document in a collection. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix is a specific instance of a document-feature matrix where "features" may refer to other properties of a document besides terms. It is also common to encounter the transpose, or term-document matrix where documents are the columns and terms are the rows. They are useful in the field of natural language processing and computational text analysis.

<span class="mw-page-title-main">Automatic image annotation</span>

Automatic image annotation is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database.

In linguistics, statistical semantics applies the methods of statistics to the problem of determining the meaning of words or phrases, ideally through unsupervised learning, to a degree of precision at least sufficient for the purpose of information retrieval.

<span class="mw-page-title-main">Distributional semantics</span> Field of linguistics

Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. The basic idea of distributional semantics can be summed up in the so-called distributional hypothesis: linguistic items with similar distributions have similar meanings.

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.

In computer vision, the problem of object categorization from image search is the problem of training a classifier to recognize categories of objects, using only the images retrieved automatically with an Internet search engine. Ideally, automatic image collection would allow classifiers to be trained with nothing but the category names as input. This problem is closely related to that of content-based image retrieval (CBIR), where the goal is to return better image search results rather than training a classifier for image recognition.

A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.

Ranking of query is one of the fundamental problems in information retrieval (IR), the scientific/engineering discipline behind search engines. Given a query q and a collection D of documents that match the query, the problem is to rank, that is, sort, the documents in D according to some criterion so that the "best" results appear early in the result list displayed to the user. Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and recommender systems. A majority of search engines use ranking algorithms to provide users with accurate and relevant results.

Vector space model or term vector model is an algebraic model for representing text documents as vectors such that the distance between vectors represents the relevance between the documents. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.

In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectoral representation of text that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text corpus and a document is represented as the centroid of the vectors representing its words. Typically, the text corpus is English Wikipedia, though other corpora including the Open Directory Project have been used.

<span class="mw-page-title-main">Gensim</span> Vector space modeling and topic modeling toolkit

Gensim is an open-source library for unsupervised topic modeling, document indexing, retrieval by similarity, and other natural language processing functionalities, using modern statistical machine learning.

The following outline is provided as an overview of and topical guide to natural-language processing:

Semantic folding theory describes a procedure for encoding the semantics of natural language text in a semantically grounded binary representation. This approach provides a framework for modelling how language data is processed by the neocortex.

Semantic spaces in the natural language domain aim to create representations of natural language that are capable of capturing meaning. The original motivation for semantic spaces stems from two core challenges of natural language: Vocabulary mismatch and ambiguity of natural language.

References

  1. Deerwester, Scott; Dumais, Susan T.; Furnas, George W.; Landauer, Thomas K.; Harshman, Richard (September 1990). "Indexing by latent semantic analysis". Journal of the American Society for Information Science. 41 (6): 391–407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9. ISSN   0002-8231.
  2. 1 2 3 Dumais, S. T.; Furnas, G. W.; Landauer, T. K.; Deerwester, S.; Harshman, R. (1988-05-01). "Using latent semantic analysis to improve access to textual information". Proceedings of the SIGCHI conference on Human factors in computing systems - CHI '88. New York, NY, USA: Association for Computing Machinery. pp. 281–285. doi:10.1145/57167.57214. ISBN   978-0-201-14237-2.
  3. Deerwester, Kenneth (18 October 2024). "Kenneth F. Deerwester Obituary". www.gundersonfh.com/. Retrieved 18 October 2024.
  4. Deerwester, Scott (1984). "The retrieval expert model of information retrieval". Scholar Google. Retrieved 18 October 2024.{{cite web}}: CS1 maint: url-status (link)
  5. Suleman, Raja Muhammad; Korkontzelos, Ioannis (2021-03-01). "Extending latent semantic analysis to manage its syntactic blindness". Expert Systems with Applications. 165: 114130. doi:10.1016/j.eswa.2020.114130. ISSN   0957-4174.
  6. 1 2 3 Hurtado, Jose L.; Agarwal, Ankur; Zhu, Xingquan (14 April 2016). "Topic discovery and future trend forecasting for texts". Journal of Big Data. 3. doi: 10.1186/s40537-016-0039-2 .
  7. Deerwester, Scott (September 1990). "Indexing by latent semantic analysis". Journal of the American Society for Information Science. 41 (6): 391–407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 . Retrieved October 11, 2024.
  8. Deerwester, Scott; Dumais, S. (May 1988). Using latent semantic analysis to improve access to textual information. ACM Digital. pp. 281–285. doi:10.1145/57167.57214 . Retrieved 11 October 2024.{{cite book}}: |website= ignored (help)CS1 maint: date and year (link)
  9. Hu, Xiangen (January 2007). "Strengths, Limitations, and Extensions of LSA". Research Gate. Retrieved 11 October 2024.
  10. 1 2 Furnas, George W; Deerwester, Scott C (August 2017). "Information Retrieval using a Singular Value Decomposition Model of Latent Semantic Structure". ACM SIGIR Forum. 51 (2): 90–105. doi:10.1145/3130348.3130358 . Retrieved 11 October 2024.