Online content analysis

Last updated

Online content analysis or online textual analysis refers to a collection of research techniques used to describe and make inferences about online material through systematic coding and interpretation. Online content analysis is a form of content analysis for analysis of Internet-based communication.

Contents

History and definition

Content analysis as a systematic examination and interpretation of communication dates back to at least the 17th century. However, it was not until the rise of the newspaper in the early 20th century that the mass production of printed material created a demand for quantitative analysis of printed words. [1]

Berelson’s (1952) definition provides an underlying basis for textual analysis as a "research technique for the objective, systematic and quantitative description of the manifest content of communication." [2] Content analysis consists of categorizing units of texts (i.e. sentences, quasi-sentences, paragraphs, documents, web pages, etc.) according to their substantive characteristics in order to construct a dataset that allows the analyst to interpret texts and draw inferences. While content analysis is often quantitative, researchers conceptualize the technique as inherently mixed methods because textual coding requires a high degree of qualitative interpretation. [3] Social scientists have used this technique to investigate research questions concerning mass media, [1] media effects [4] and agenda setting. [5]

With the rise of online communication, content analysis techniques have been adapted and applied to internet research. As with the rise of newspapers, the proliferation of online content provides an expanded opportunity for researchers interested in content analysis. While the use of online sources presents new research problems and opportunities, the basic research procedure of online content analysis outlined by McMillan (2000) is virtually indistinguishable from content analysis using offline sources:

  1. Formulate a research question with a focus on identifying testable hypotheses that may lead to theoretical advancements.
  2. Define a sampling frame that a sample will be drawn from, and construct a sample (often called a ‘corpus’) of content to be analyzed.
  3. Develop and implement a coding scheme that can be used to categorize content in order to answer the question identified in step 1. This necessitates specifying a time period, a context unit in which content is embedded, and a coding unit which categorizes the content.
  4. Train coders to consistently implement the coding scheme and verify reliability among coders. This is a key step in ensuring replicability of the analysis.
  5. Analyze and interpret the data. Test hypotheses advanced in step 1 and draw conclusions about the content represented in the dataset.

Content analysis in internet research

Since the rise of online communication, scholars have discussed how to adapt textual analysis techniques to study web-based content. The nature of online sources necessitates particular care in many of the steps of a content analysis compared to offline sources.

While offline content such as printed text remains static once produced, online content can frequently change. The dynamic nature of online material combined with the large and increasing volume of online content can make it challenging to construct a sampling frame from which to draw a random sample. The content of a site may also differ across users, requiring careful specification of the sampling frame. Some researchers have used search engines to construct sampling frames. This technique has disadvantages because search engine results are unsystematic and non-random making them unreliable for obtaining an unbiased sample. The sampling frame issue can be circumvented by using an entire population of interest, such as tweets by particular Twitter users [6] or online archived content of certain newspapers as the sampling frame. [7] Changes to online material can make categorizing content (step 3) more challenging. Because online content can change frequently it is particularly important to note the time period over which the sample is collected. A useful step is to archive the sample content in order to prevent changes from being made.

Online content is also non-linear. Printed text has clearly delineated boundaries that can be used to identify context units (e.g., a newspaper article). The bounds of online content to be used in a sample are less easily defined. Early online content analysts often specified a ‘Web site’ as a context unit, without a clear definition of what they meant. [2] Researchers recommend clearly and consistently defining what a ‘web page’ consists of, or reducing the size of the context unit to a feature on a website. [2] [3] Researchers have also made use of more discrete units of online communication such as web comments [8] or tweets. [6]

King (2008) used an ontology of terms trained from many thousands of pre-classified documents to analyse the subject matter of a number of search engines. [9]

Automatic content analysis

The rise of online content has dramatically increased the amount of digital text that can be used in research. The quantity of text available has motivated methodological innovations in order to make sense of textual datasets that are too large to be practically hand-coded as had been the conventional methodological practice. [3] [7] Advances in methodology together with the increasing capacity and decreasing expense of computation has allowed researchers to use techniques that were previously unavailable to analyze large sets of textual content.

Automatic content analysis represents a slight departure from McMillan's online content analysis procedure in that human coders are being supplemented by a computational method, and some of these methods do not require categories to be defined in advanced. Quantitative textual analysis models often employ 'bag of words' methods that remove word ordering, delete words that are very common and very uncommon, and simplify words through lemmatisation or stemming that reduces the dimensionality of the text by reducing complex words to their root word. [10] While these methods are fundamentally reductionist in the way they interpret text, they can be very useful if they are correctly applied and validated.

Grimmer and Stewart (2013) identify two main categories of automatic textual analysis: supervised and unsupervised methods. Supervised methods involve creating a coding scheme and manually coding a sub-sample of the documents that the researcher wants to analyze. Ideally, the sub-sample, called a 'training set' is representative of the sample as a whole. The coded training set is then used to 'teach' an algorithm the how the words in the documents correspond to each coding category. The algorithm can be applied to automatically analyze the remained of the documents in the corpus. [10]

Unsupervised methods can be used when a set of categories for coding cannot be well-defined prior to analysis. Unlike supervised methods, human coders are not required to train the algorithm. One key choice for researchers when applying unsupervised methods is selecting the number of categories to sort documents into rather than defining what the categories are in advance.

Validation

Results of supervised methods can be validated by drawing a distinct sub-sample of the corpus, called a 'validation set'. Documents in the validation set can be hand-coded and compared to the automatic coding output to evaluate how well the algorithm replicated human coding. This comparison can take the form of inter-coder reliability scores like those used to validate the consistency of human coders in traditional textual analysis.

Validation of unsupervised methods can be carried out in several ways.

Challenges in online textual analysis

Despite the continuous evolution of text-analysis in the social science, there are still some unsolved methodological concerns. This is a (non-exclusive) list with some of this concerns:

See also

Related Research Articles

<span class="mw-page-title-main">Semantic network</span> Knowledge base that represents semantic relations between concepts in a network

A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, which represent concepts, and edges, which represent semantic relations between concepts, mapping or connecting semantic fields. A semantic network may be instantiated as, for example, a graph database or a concept map. Typical standardized semantic networks are expressed as semantic triples.

Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

<span class="mw-page-title-main">Unsupervised learning</span> Machine learning task

Unsupervised learning is a type of algorithm that learns patterns from untagged data. The goal is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and then generate imaginative content from it.

Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.

<span class="mw-page-title-main">Content analysis</span> Research method for studying documents and communication artifacts

Content analysis is the study of documents and communication artifacts, which might be texts of various formats, pictures, audio or video. Social scientists use content analysis to examine patterns in communication in a replicable and systematic manner. One of the key advantages of using content analysis to analyse social phenomena is its non-invasive nature, in contrast to simulating social experiences or collecting survey answers.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification.

Biclustering, block clustering, Co-clustering or two-mode clustering is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. The term was first introduced by Boris Mirkin to name a technique introduced many years earlier, in 1972, by J. A. Hartigan.

Unstructured data is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated in documents.

Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly.

Document clustering is the application of cluster analysis to textual documents. It has applications in automatic document organization, topic extraction and fast information retrieval or filtering.

Determining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem.

Fraud represents a significant problem for governments and businesses and specialized analysis techniques for discovering fraud using them are required. Some of these methods include knowledge discovery in databases (KDD), data mining, machine learning and statistics. They offer applicable and successful solutions in different areas of electronic fraud crimes.

In statistics and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear approximately equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.

The following outline is provided as an overview of and topical guide to natural-language processing:

In geographic information systems, toponym resolution is the relationship process between a toponym, i.e. the mention of a place, and an unambiguous spatial footprint of the same place.

<span class="mw-page-title-main">Word2vec</span> Models used to produce word embeddings

Word2vec is a technique for natural language processing (NLP) published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that they capture the semantic and syntactic qualities of words; as such, a simple mathematical function can indicate the level of semantic similarity between the words represented by those vectors.

<span class="mw-page-title-main">Outline of machine learning</span> Overview of and topical guide to machine learning

The following outline is provided as an overview of and topical guide to machine learning. Machine learning is a subfield of soft computing within computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a "field of study that gives computers the ability to learn without being explicitly programmed". Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from an example training set of input observations in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions.

References

  1. 1 2 Krippendorff, Klaus (2012). Content Analysis: An introduction to its methodology. Thousand Oaks, CA: Sage.
  2. 1 2 3 McMillan, Sally J. (March 2000). "The Microscope and the Moving Target: The Challenge of Applying Content Analysis to the World Wide Web". Journalism and Mass Communication Quarterly. 77 (1): 80–98. doi:10.1177/107769900007700107. S2CID   143760798.
  3. 1 2 3 van Selm, Martine; Jankowski, Nick (2005). Content Analysis of Internet-Based Documents. Unpublished Manuscript.
  4. Riffe, Daniel; Lacy, Stephen; Fico, Frederick (1998). Analyzing Media Messages: Using Quantitative Content Analysis in Research. Mahwah, New Jersey, London: Lawrence Erlbaum.
  5. Baumgartner, Frank; Jones, Bryan (1993). Agendas and Instability in American Politics. Chicago. University of Chicao Press. ISBN   9780226039534.
  6. 1 2 3 Barberá, Pablo; Bonneau, Richard; Egan, Patrick; Jost, John; Nagler, Jonathan; Tucker, Joshua (2014). "Leaders or Followers? Measuring Political Responsiveness in the U.S. Congress Using Social Media Data". Prepared for Delivery at the Annual Meeting of the American Political Science Association.
  7. 1 2 3 DiMaggio, Paul; Nag, Manish; Blei, David (December 2013). "Exploiting affinities between topic modeling and the sociological perspective on culture: Application to newspaper coverage of U.S. government arts funding". Poetics. 41 (6): 570–606. doi:10.1016/j.poetic.2013.08.004.
  8. Mishne, Gilad; Glance, Natalie (2006). "Leave a reply: An analysis of weblog comments". Third Annual Conference on the Weblogging Ecosystem.
  9. King, John D. (2008). Search Engine Content Analysis (PhD). Queensland University of Techbology.
  10. 1 2 3 4 Grimmer, Justin; Stewart, Brandon (2013). "Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts". Political Analysis. 21 (3): 267–297. doi: 10.1093/pan/mps028 .
  11. Collingwood, Loren and John Wilkerson. (2011). Tradeoffs in Accuracy and Efficiency in supervised Learning Methods, in The Journal of Information Technology and Politics, Paper 4.
  12. Gerber, Elisabeth; Lewis, Jeff (2004). "Beyond the median: Voter preferences, district heterogeneity, and political representation" (PDF). Journal of Political Economy. 112 (6): 1364–83. CiteSeerX   10.1.1.320.8707 . doi:10.1086/424737. S2CID   16695697. Archived from the original (PDF) on October 1, 2015.
  13. Slapin, Jonathan, and Sven-Oliver Proksch. 2008. A scaling model for estimating time-series party positions from texts. American Journal of Political Science 52(3):705–22.
  14. King, Gary, Robert O. Keohane, & Sidney Verba. (1994). Designing Social Inquiry: Scientific Inference in Qualitative Research. Princeton: Prince University Press.
  15. Herring, Susan C. (2009). "Web Content Analysis: Expanding the Paradigm". In Hunsinger, Jeremy (ed.). International Handbook of Internet Research. Springer Netherlands. pp. 233–249. CiteSeerX   10.1.1.476.6090 . doi:10.1007/978-1-4020-9789-8_14. ISBN   978-1-4020-9788-1.
  16. Saldana Johnny. (2009). The Coding Manual for Qualitative Research. London: SAGE Publication Ltd.
  17. Chuang, Jason, John D. Wilkerson, Rebecca Weiss, Dustin Tingley, Brandon M. Stewart, Margaret E. Roberts, Forough Poursabzi-Sangdeh, Justin Grimmer, Leah Findlater, Jordan Boyd-Graber, and Jeffrey Heer. (2014). Computer-Assisted Content Analysis: Topic Models for Exploring Multiple Subjective Interpretations. Paper presented at the Conference on Neural Information Processing Systems (NIPS). Workshop on HumanPropelled Machine Learning. Montreal, Canada.