Catpac

Last updated

Catpac is a computer program that analyzes text samples to identify key concepts contained within the sample. It was conceived chiefly by Richard Holmes, a Michigan State computer programmer and Dr. Joseph Woelfel, a University at Albany and University at Buffalo sociologist for the analysis of attitude formation and change in the sociological context. Contributions by Rob Zimmelman, an undergraduate and graduate student at the University of Albany, from 1981 to 1984 on the Univac 1100 mainframe, included the inclusion of the CATPAC software in the Galileo*Telegal system, text-labeling and porting of CATPAC output for the Galileo system of paired-comparison conceptual visualization. CATPAC and the Galileo system are still in commercial use today, and with recent data capture and visualization contributions, continues to grow. Contributions by other students at the university resulted in the software that is still in commercial use today. It uses text files as input and produces output such as word and alphabetical frequencies as well as various types of cluster analysis. [1]

Contents

Design

Catpac is a self-organizing, i.e. unsupervised, interactive activation and competition (IAC) artificial neural network used for text analysis. [2] [3] The program generates a multidimensional scalar output organizing words throughout the text by creating a weighted word-by-word matrix that establishes the eigenvector centralities of concepts. [4] The word-by-word matrix represents the relationship between one word and the occurrence of another. [5] Catpac identifies important words and patterns based on the organization of the text. [2] This process mimics the connections between neurons in a human brain, strengthening connections through conditioning to generate a pattern of similarities among all words within a body of text. [2]

Use

Catpac has been used in commercial studies, in academic scholarship to investigate massive textual data sets, [6] [7] as a strong semantic network analysis tool, [4] [5] [8] for longitudinal analyses, [4] [8] [9] [10] [11] for multilingual analyses, [12] [13] as a predictor of media usage [14] and as a powerful content analysis tool. [15] [16]

Availability

Catpac, conceived as an improvement to simple word-count software more than 30 years ago, is currently available in windows 32 bit format. [2]

Related Research Articles

Natural language processing Field of linguistics and computer science

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Semantic network Knowledge base that represents semantic relations between concepts in a network

A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, which represent concepts, and edges, which represent semantic relations between concepts, mapping or connecting semantic fields. A semantic network may be instantiated as, for example, a graph database or a concept map. Typical standardized semantic networks are expressed as semantic triples.

Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language prosessing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power. These activities can be viewed as two facets of the same field of application, and they have undergone substantial development over the past few decades.

Unsupervised learning Machine learning technique

Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a compact internal representation of its world and then generate imaginative content from it. In contrast to supervised learning where data is tagged by an expert, e.g. as a "ball" or "fish", unsupervised methods exhibit self-organization that captures patterns as probability densities or a combination of neural feature preferences. The other levels in the supervision spectrum are reinforcement learning where the machine is given only a numerical performance score as guidance, and semi-supervised learning where a smaller portion of the data is tagged.

Text mining, also referred to as text data mining, similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a KDD process. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.

Content analysis Research method for studying documents and communication artifacts

Content analysis is the study of documents and communication artifacts, which might be texts of various formats, pictures, audio or video. Social scientists use content analysis to examine patterns in communication in a replicable and systematic manner. One of the key advantages of using content analysis to analyse social phenomena is its non-invasive nature, in contrast to simulating social experiences or collecting survey answers.

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".

Neural network Structure in biology and artificial intelligence

A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological neurons, or an artificial neural network, used for solving artificial intelligence (AI) problems. The connections of the biological neuron are modeled in artificial neural networks as weights between nodes. A positive weight reflects an excitatory connection, while negative values mean inhibitory connections. All inputs are modified by a weight and summed. This activity is referred to as a linear combination. Finally, an activation function controls the amplitude of the output. For example, an acceptable range of output is usually between 0 and 1, or it could be −1 and 1.

A language model is a probability distribution over sequences of words. Given such a sequence of length m, a language model assigns a probability to the whole sequence. Language models generate probabilities by training on text corpora in one or many languages. Given that languages can be used to express an infinite variety of valid sentences, language modelling faces the problem of assigning non-zero probabilities to linguistically valid sequences that may never be encountered in the training data. Several modelling approaches have been designed to surmount this problem, such as applying the Markov assumption or using neural architectures such as recurrent neural networks or transformers.

Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly.

There are many types of artificial neural networks (ANN).

SemEval is an ongoing series of evaluations of computational semantic analysis systems; it evolved from the Senseval word sense evaluation series. The evaluations are intended to explore the nature of meaning in language. While meaning is intuitive to humans, transferring those intuitions to computational analysis has proved elusive.

Joseph Woelfel is an American sociologist. Born in Buffalo, New York, he is currently professor in the Department of Communication at the University at Buffalo, The State University of New York.

In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectoral representation of text that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text corpus and a document is represented as the centroid of the vectors representing its words. Typically, the text corpus is English Wikipedia, though other corpora including the Open Directory Project have been used.

In natural language processing (NLP), a text graph is a graph representation of a text item. It is typically created as a preprocessing step to support NLP tasks such as text condensation term disambiguation (topic-based) text summarization, relation extraction and textual entailment.

Word embedding Method in natural language processing

In natural language processing (NLP), word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using a set of language modeling and feature learning techniques where words or phrases from the vocabulary are mapped to vectors of real numbers.

Online content analysis or online textual analysis refers to a collection of research techniques used to describe and make inferences about online material through systematic coding and interpretation. Online content analysis is a form of content analysis for analysis of Internet-based communication.

Word2vec Models used to produce word embeddings

Word2vec is a technique for natural language processing published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that a simple mathematical function indicates the level of semantic similarity between the words represented by those vectors.

GPT-2 2019 text-generating language model

Generative Pre-trained Transformer 2 (GPT-2) is an open-source artificial intelligence created by OpenAI in February 2019. GPT-2 translates text, answers questions, summarizes passages, and generates text output on a level that, while sometimes indistinguishable from that of humans, can become repetitive or nonsensical when generating long passages. It is a general-purpose learner; it was not specifically trained to do any of these tasks, and its ability to perform them is an extension of its general ability to accurately synthesize the next item in an arbitrary sequence. GPT-2 was created as a "direct scale-up" of OpenAI's 2018 GPT model, with a ten-fold increase in both its parameter count and the size of its training dataset.

References

  1. "Quantitative Text Analysis Programs". Archived from the original on 2012-07-01. Retrieved 2010-11-26.
  2. 1 2 3 4 Woelfel, Joseph. "Catpac II User's Guide" (PDF) (Version 2.0 ed.). The Galileo Company.
  3. http://www.galileoco.com/literature/Wolfpak10a.pdf [ bare URL PDF ]
  4. 1 2 3 Egnoto, M.; Nam, Y.; Vishwanath, A (November 2010). A longitudinal analysis of the newspaper coverage of cell phones. National Communication Association Conference. San Francisco, CA.
  5. 1 2 Doerfel, M. L.; Barnett, G. A. (1999). "A semantic network analysis of the International Communication Association". Human Communication Research. 25 (4): 589–603. CiteSeerX   10.1.1.531.2227 . doi:10.1111/j.1468-2958.1999.tb00463.x.
  6. Chen, H.; Evans, C.; Battleson, B.; Zubrow, E.; Woelfel, J. (10 October 2011). "Procedures for the precise analysis of massive textual datasets". Communication & Science Journal.
  7. Doerfel, M. L.; Barnett, G. A. (1996). "The use of CATPAC for textual analysis". Field Methods. 8 (2): 4–7. doi:10.1177/1525822x960080020501.
  8. 1 2 Ortega, C.R.; Egnoto, M.J. (2011). Longitudinal analysis of press coverage of violent video games: Assessing agenda-setting via semantic and LIWC analyses. NYSCA conference.
  9. Kim, J.H.; Su, T-Y.; Hong, J. (2007). "The influence of geopolitics and foreign policy on the U.S. and Canadian media: An analysis of newspaper coverage of Sudan's Darfur conflict". The Harvard International Journal of Press/Politics. 12 (3): 87–95. doi:10.1177/1081180x07302972.
  10. Murphy, P.; Maynard, M. (2000). "Framing the genetic testing issue: Discourse and cultural clashes among policy communities". Science Communication. 22 (2): 133–153. doi:10.1177/1075547000022002002.
  11. Rosen, D.; Woelfel, J.; Krikorian, D.; Barnett, G.A. (2003). "Procedures for analyses of online communities". Journal of Computer-Mediated Communication. 8 (4).
  12. Evans, C.; Chen, H.; Battleson, B.; Wölfel, J.K.; Woelfel, J. (2008). Neural networks for pattern recognition in multilingual text. International Network for Social Network Analysis (INSNA) Sunbelt conference. St. Pete Beach, FL.
  13. Evans, C.; Chen, H.; Battleson, B.; Wölfel, J.K.; Woelfel, J. (2010). Unsupervised artificial neural networks for pattern recognition in multilingual text. Amherst, NY: RAH Press.
  14. Cheong, P.; Hwang, J.; Elbirt, B.; Chen, H.; Evans, C.; Woelfel, J (2010). "Media use as a function of identity: The role of the self concept in media usage". In Hinner, M. (ed.). Freiberger beiträge zur interkulturellen und wirtschaftskommunikation[A forum for general and intercultural business communication]. The interrelationship of business and communication. Vol. 6. Berlin: Peter Lang. pp. 365–381.
  15. Krippendorff, K. (2004). Content analysis: An introduction to its methodology (2nd ed.). Thousand Oaks, CA: SAGE Publications.
  16. Neuendorf, K. "Quantitative text analysis programs". The Content Analysis Guidebook Online. Archived from the original on 1 July 2012. Retrieved 26 November 2010.