Document-term matrix

Last updated

A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in each document in a collection. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix is a specific instance of a document-feature matrix where "features" may refer to other properties of a document besides terms. [1] It is also common to encounter the transpose, or term-document matrix where documents are the columns and terms are the rows. They are useful in the field of natural language processing and computational text analysis. [2]

Contents

While the value of the cells is commonly the raw count of a given term, there are various schemes for weighting the raw counts such as, row normalizing (i.e. relative frequency/proportions) and tf-idf.

Terms are commonly single words separated by whitespace or punctuation on either side (a.k.a. unigrams). In such a case, this is also referred to as "bag of words" representation because the counts of individual words is retained, but not the order of the words in the document.

General concept

When creating a data-set of terms that appear in a corpus of documents, the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. Each ij cell, then, is the number of times word j occurs in document i. As such, each row is a vector of term counts that represents the content of the document corresponding to that row. For instance if one has the following two (short) documents:

then the document-term matrix would be:

Ilikedislikedatabases
D11101
D21011

which shows which documents contain which terms and how many times they appear. Note that, unlike representing a document as just a token-count list, the document-term matrix includes all terms in the corpus (i.e. the corpus vocabulary), which is why there are zero-counts for terms in the corpus which do not also occur in a specific document. For this reason, document-term matrices are usually stored in a sparse matrix format.

As a result of the power-law distribution of tokens in nearly every corpus (see Zipf's law), it is common to weight the counts. This can be as simple as dividing counts by the total number of tokens in a document (called relative frequency or proportions), dividing by the maximum frequency in each document (called prop max), or taking the log of frequencies (called log count). If one desires to weight the words most unique to an individual document as compared to the corpus as a whole, it is common to use tf-idf, which divides the term frequency by the term's document frequency.

History of the concept

The document-term matrix emerged in the earliest years of the computerization of text. The increasing capacity for storing documents created the problem of retrieving a given document in an efficient manner. While previously the work of classifying and indexing was accomplished by hand, researchers explored the possibility of doing this automatically using word frequency information.

One of the first published document-term matrices was in Harold Borko's 1962 article "The construction of an empirically based mathematically derived classification system" (page 282, see also his 1965 article [3] ). Borko references two computer programs, "FEAT" which stood for "Frequency of Every Allowable Term," written by John C. Olney of the System Development Corporation and the Descriptor Word Index Program, written by Eileen Stone also of the System Development Corporation:

Having selected the documents which were to make up the experimental library, the next step consisted of keypunching the entire body of text preparatory to computer processing.  The program used for this analysis was FEAT (Frequency of Every Allowable Term).  it was written by John C. Olney of the System Development Corporation and is designed to perform frequency and summary counts of individual words and of word pairs.  The  output of this program is an alphabetical listing, by frequency of occurrence, of all word types which appeared in the text.  Certain function words such as and, the,  at, a, etc., were placed in a "forbidden word list" table, and the frequency of these words was recorded  in a separate listing... A special computer program, called the Descriptor Word Index Program, was written to provide this information and to prepare a document-term matrix in a form suitable for in-put to the Factor Analysis Program. The Descriptor Word Index program was prepared by Eileen Stone of the System Development Corporation. [4]

Shortly thereafter, Gerard Salton published "Some hierarchical models for automatic document retrieval" in 1963 which also included a visual depiction of a document-term matrix. [5] Salton was at Harvard University at the time and his work was supported by the Air Force Cambridge Research Laboratories and Sylvania Electric Products, Inc. In this paper, Salton introduces the document-term matrix by comparison to a kind of term-context matrix used to measure similarities between words:

If it is desired to generate document associations or document clusters instead of word associations, the same procedures can be used with slight modifications. Instead of starting with a word-sentence matrix C,... it is now convenient to construct a word-document matrix F, listing frequency of occurrence of word Wi in Document Dj... Document similarities can now be computed as before by comparing pairs of rows and by obtaining similarity coefficients based on the frequency of co-occurrences of the content words included in the given document. This procedure produces a document-document similarity matrix which can in turn be used for the generation of document clusters... [5]

In addition to Borko and Salton, in 1964, F.W. Lancaster published a comprehensive review of automated indexing and retrieval. While the work was published while he worked at the Herner and Company in Washington D.C., the paper was written while he was "employed in research work at Aslib, on the Aslib Cranfield Project." [6] Lancaster credits Borko with the document-term matrix:

Harold Borko, of the System Development Corporation, has carried this operation a little further. A significant group of clue words is chosen from the vocabulary of an experimental collection. These are arranged in a document/term matrix to show the frequency of occurrence of each term in each document.... A correlation coefficient for each word pair is then computed, based on their co-occurrence in the document set. The resulting term/term matrix... is then factor analysed and a series of factors are isolated. These factors, when interpreted and named on the basis of the terms with high loadings which appear in each of the factors, become the classes of an empirical classification. The terms with high loadings in each factor are the clue words or predictors of the categories.

Choice of terms

A point of view on the matrix is that each row represents a document. In the vectorial semantic model, which is normally the one used to compute a document-term matrix, the goal is to represent the topic of a document by the frequency of semantically significant terms. The terms are semantic units of the documents. It is often assumed, for Indo-European languages, that nouns, verbs and adjectives are the more significant categories, and that words from those categories should be kept as terms. Adding collocation as terms improves the quality of the vectors, especially when computing similarities between documents.

Applications

Improving search results

Latent semantic analysis (LSA, performing singular-value decomposition on the document-term matrix) can improve search results by disambiguating polysemous words and searching for synonyms of the query. However, searching in the high-dimensional continuous space is much slower than searching the standard trie data structure of search engines.

Finding topics

Multivariate analysis of the document-term matrix can reveal topics/themes of the corpus. Specifically, latent semantic analysis and data clustering can be used, and, more recently, probabilistic latent semantic analysis with its generalization Latent Dirichlet allocation, and non-negative matrix factorization, have been found to perform well for this task.

See also

Implementations

Related Research Articles

Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need. The information need can be specified in the form of a search query. In the case of document retrieval, queries can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

Semantic memory refers to general world knowledge that humans have accumulated throughout their lives. This general knowledge is intertwined in experience and dependent on culture. New concepts are learned by applying knowledge learned from things in the past.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".

Biclustering, block clustering, Co-clustering or two-mode clustering is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. The term was first introduced by Boris Mirkin to name a technique introduced many years earlier, in 1972, by John A. Hartigan.

In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general. It was often used as a weighting factor in searches of information retrieval, text mining, and user modeling. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries used tf–idf.

Probabilistic latent semantic analysis (PLSA), also known as probabilistic latent semantic indexing is a statistical technique for the analysis of two-mode and co-occurrence data. In effect, one can derive a low-dimensional representation of the observed variables in terms of their affinity to certain hidden variables, just as in latent semantic analysis, from which PLSA evolved.

In linguistics, statistical semantics applies the methods of statistics to the problem of determining the meaning of words or phrases, ideally through unsupervised learning, to a degree of precision at least sufficient for the purpose of information retrieval.

<span class="mw-page-title-main">Distributional semantics</span> Field of linguistics

Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. The basic idea of distributional semantics can be summed up in the so-called distributional hypothesis: linguistic items with similar distributions have similar meanings.

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.

Latent semantic structure indexing (LaSSI) is a technique for calculating chemical similarity derived from latent semantic analysis (LSA).

In machine learning, semantic analysis of a corpus is the task of building structures that approximate concepts from a large set of documents. It generally does not involve prior semantic understanding of the documents. A metalanguage based on predicate logic can analyze the speech of humans. Another strategy to understand the semantics of a text is symbol grounding. If language is grounded, it is equal to recognizing a machine readable meaning. For the restricted domain of spatial analysis, a computer based language understanding system was demonstrated.

Ranking of query is one of the fundamental problems in information retrieval (IR), the scientific/engineering discipline behind search engines. Given a query q and a collection D of documents that match the query, the problem is to rank, that is, sort, the documents in D according to some criterion so that the "best" results appear early in the result list displayed to the user. Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and recommender systems. A majority of search engines use ranking algorithms to provide users with accurate and relevant results.

Vector space model or term vector model is an algebraic model for representing text documents as vectors such that the distance between vectors represents the relevance between the documents. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.

The Generalized vector space model is a generalization of the vector space model used in information retrieval. Wong et al. presented an analysis of the problems that the pairwise orthogonality assumption of the vector space model (VSM) creates. From here they extended the VSM to the generalized vector space model (GVSM).

In computational linguistics, word-sense induction (WSI) or discrimination is an open problem of natural language processing, which concerns the automatic identification of the senses of a word. Given that the output of word-sense induction is a set of senses for the target word, this task is strictly related to that of word-sense disambiguation (WSD), which relies on a predefined sense inventory and aims to solve the ambiguity of words in context.

In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectoral representation of text that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text corpus and a document is represented as the centroid of the vectors representing its words. Typically, the text corpus is English Wikipedia, though other corpora including the Open Directory Project have been used.

In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov and colleagues at Google and published in 2013.

Semantic folding theory describes a procedure for encoding the semantics of natural language text in a semantically grounded binary representation. This approach provides a framework for modelling how language data is processed by the neocortex.

References

  1. "Document-feature matrix :: Tutorials for quanteda". tutorials.quanteda.io. Retrieved 2021-01-02.
  2. "15 Ways to Create a Document-Term Matrix in R". Dustin S. Stoltz. Retrieved 2021-01-02.
  3. Borko, Harold (1965). "A Factor Analytically Derived Classification System for Psychological Reports". Perceptual and Motor Skills. 20 (2): 393–406. doi:10.2466/pms.1965.20.2.393. ISSN   0031-5125. PMID   14279310. S2CID   34230652.
  4. Borko, Harold (1962). "The construction of an empirically based mathematically derived classification system". Proceedings of the May 1-3, 1962, spring joint computer conference on - AIEE-IRE '62 (Spring). AIEE-IRE '62 (Spring). New York, New York, USA: ACM Press. pp. 279–289. doi: 10.1145/1460833.1460865 . ISBN   9781450378758. S2CID   6483337.
  5. 1 2 Salton, Gerard (July 1963). "Some hierarchical models for automatic document retrieval". American Documentation. 14 (3): 213–222. doi:10.1002/asi.5090140307. ISSN   0096-946X.
  6. LANCASTER, F.W. (1964-01-01). "MECHANIZED DOCUMENT CONTROL: A Review of Some Recent Research". ASLIB Proceedings. 16 (4): 132–152. doi:10.1108/eb049960. ISSN   0001-253X.