Part of a series on |
Linguistics |
---|
Portal |
Stylometry is the application of the study of linguistic style, usually to written language. [1] It has also been applied successfully to music, [2] paintings, [3] and chess. [4]
Stylometry is often used to attribute authorship to anonymous or disputed documents. [5] It has legal as well as academic and literary applications, ranging from the question of the authorship of Shakespeare's works to forensic linguistics and has methodological similarities with the analysis of text readability.
Stylometry may be used to unmask pseudonymous or anonymous authors, or to reveal some information about the author short of a full identification. Authors may use adversarial stylometry to resist this identification by eliminating their own stylistic characteristics without changing the meaningful content of their communications. It can defeat analyses that do not account for its possibility, but the ultimate effectiveness of stylometry in an adversarial environment is uncertain: stylometric identification may not be reliable, but nor can non-identification be guaranteed; adversarial stylometry's practice itself may be detectable.
Stylometry grew out of earlier techniques of analyzing texts for evidence of authenticity, author identity, and other questions.
The modern practice of the discipline received publicity from the study of authorship problems in English Renaissance drama. Researchers and readers observed that some playwrights of the era had distinctive patterns of language preferences, and attempted to use those patterns to identify authors of uncertain or collaborative works. Early efforts were not always successful: in 1901, one researcher attempted to use John Fletcher's preference for " 'em", the contractional form of "them", as a marker to distinguish between Fletcher and Philip Massinger in their collaborations—but he mistakenly employed an edition of Massinger's works in which the editor had expanded all instances of " 'em" to "them". [6]
The basics of stylometry were established by Polish philosopher Wincenty Lutosławski in Principes de stylométrie (1890). Lutosławski used this method to develop a chronology of Plato's Dialogues. [7]
The development of computers and their capacities for analyzing large quantities of data enhanced this type of effort by orders of magnitude. The great capacity of computers for data analysis, however, did not guarantee good quality output. During the early 1960s, Rev. A. Q. Morton produced a computer analysis of the fourteen Epistles of the New Testament attributed to St. Paul, which indicated that six different authors had written that body of work. A check of his method, applied to the works of James Joyce, gave the result that Ulysses , Joyce's multi-perspective, multi-style novel, was composed by five separate individuals, none of whom apparently had any part in the crafting of Joyce's first novel, A Portrait of the Artist as a Young Man. [8]
In time, however, and with practice, researchers and scholars have refined their methods, to yield better results. One notable early success was the resolution of disputed authorship of twelve of The Federalist Papers by Frederick Mosteller and David Wallace. [9] While there are still questions concerning initial assumptions and methods (and, perhaps, always will be), few now dispute the basic premise that linguistic analysis of written texts can produce valuable information and insight. (Indeed, this was apparent even before the advent of computers: the successful application of a textual/linguistic analysis to the Fletcher canon by Cyrus Hoy and others yielded clear results during the late 1950s and early 1960s.)
Applications of stylometry include literary studies, historical studies, social studies, information retrieval, and many forensic cases and studies. [10] [11] Recently, long-standing debates about anonymous medieval Icelandic sagas have been advanced through its utilisation. [12] [13] [14] It can also be applied to computer code [15] and intrinsic plagiarism detection, which is to detect plagiarism based on the writing style changes within the document. [16] Stylometry can also be used to predict whether someone is a native or non native English speaker by their typing speed. [17]
Stylometry as a method is vulnerable to the distortion of text during revision. [18] There is also the case of the author adopting different styles in the course of his career as was demonstrated in the case of Plato, who chose different stylistic policies such as those adopted for the early and middle dialogues addressing the Socratic problem. [19]
Textual features of interest for authorship attribution are on the one hand computing occurrences of idiosyncratic expressions or constructions (e.g. checking for how the author uses interpunction or how often the author uses agentless passive constructions) and on the other hand similar to those used for readability analysis such as measures of lexical variation and syntactic variation. [20] Since authors often have preferences for certain topics, research experiments in authorship attribution mostly remove content words such as nouns, adjectives, and verbs from the feature set, only retaining structural elements of the text to avoid overfitting their models to topic rather than author characteristics. [21] [22] Stylistic features are often computed as averages over a text or over the entire collected works of an author, yielding measures such as average word length or average sentence length. This enables a model to identify authors who have a clear preference for wordy or terse sentences but hides variation: an author with a mix of long and short sentences will have the same average as an author with consistent mid-length sentences. To capture such variation, some experiments use sequences or patterns over observations rather than average observed frequencies, noting e.g. that an author shows a preference for a certain stress or emphasis pattern, [23] [24] or that an author tends to follow a sequence of long sentences with a short one. [25] [26]
One of the first approaches to authorship identification, by Mendenhall, can be said to aggregate its observations without averaging them. [27]
More recent authorship attribution models use vector space models to automatically capture what is specific to an author's style, but they also rely on judicious feature engineering for the same reasons as more traditional models. [28] [29]
Adversarial stylometry is the practice of altering writing style to reduce the potential for stylometry to discover the author's identity or their characteristics. [30] This task is also known as authorship obfuscation or authorship anonymisation. Stylometry poses a significant privacy challenge in its ability to unmask anonymous authors or to link pseudonyms to an author's other identities, [31] which, for example, creates difficulties for whistleblowers, [32] activists, [33] and hoaxers and fraudsters. [34] The privacy risk is expected to grow as machine learning techniques and text corpora develop. [35]
All adversarial stylometry shares the core idea of faithfully paraphrasing the source text so that the meaning is unchanged but the stylistic signals are obscured. [36] [37] Such a faithful paraphrase is an adversarial example for a stylometric classifier. [38] Several broad approaches to this exist, with some overlap: imitation, substituting the author's own style for another's; translation, applying machine translation with the hope that this eliminates characteristic style in the source text; and obfuscation, deliberately modifying a text's style to make it not resemble the author's own. [36]
Manually obscuring style is possible, but laborious; [39] in some circumstances, it is preferable or necessary. [40] Automated tooling, either semi- or fully-automatic, could assist an author. [39] How best to perform the task and the design of such tools is an open research question. [41] [35] While some approaches have been shown to be able to defeat particular stylometric analyses, [42] particularly those that do not account for the potential of adversariality, [43] establishing safety in the face of unknown analyses is an issue. [44] Ensuring the faithfulness of the paraphrase is a critical challenge for automated tools. [35]
It is uncertain if the practice of adversarial stylometry is detectable in itself. Some studies have found that particular methods produced signals in the output text, but a stylometrist who is uncertain of what methods may have been used may not be able to reliably detect them. [35]
Modern stylometry uses computers for statistical analysis, and artificial intelligence and access to the growing corpus of texts available via the Internet. [45] Software systems such as Signature [46] (freeware produced by Peter Millican of Oxford University), JGAAP [47] (the Java Graphical Authorship Attribution Program—freeware produced by Dr Patrick Juola of Duquesne University), stylo [48] [49] (an open-source R package for a variety of stylometric analyses, including authorship attribution, developed by Maciej Eder, Jan Rybicki and Mike Kestemont) and Stylene [50] for Dutch (online freeware by Prof Walter Daelemans of University of Antwerp and Dr Véronique Hoste of University of Ghent) make its use increasingly practicable, even for the non-expert.
Stylometric methods are used for several academic topics, as an application of linguistics, lexicography, or literary study, [1] in conjunction with natural language processing and machine learning, and applied to plagiarism detection, authorship analysis, or information retrieval. [45]
The International Association of Forensic Linguists (IAFL) organises the Biennial Conference of the International Association of Forensic Linguists (13th edition in 2016 in Porto) and publishes The International Journal of Speech, Language and the Law with forensic stylistics as one of its central topics.
The Association for the Advancement of Artificial Intelligence (AAAI) has hosted several events on subjective and stylistic analysis of text. [51] [52] [53]
PAN workshops (originally, plagiarism analysis, authorship identification, and near-duplicate detection, later more generally workshop on uncovering plagiarism, authorship, and social software misuse) organised since 2007 mainly in conjunction with information access conferences such as ACM SIGIR, FIRE, and CLEF. PAN formulates shared challenge tasks for plagiarism detection, [54] authorship identification, [55] author gender identification, [56] author profiling, [57] vandalism detection, [58] and other related text analysis tasks, many of which hinge on stylometry.
Since stylometry has both descriptive use cases, used to characterise the content of a collection, and identificatory use cases, e.g. identifying authors or categories of texts, the methods used to analyse the data and features above range from those built to classify items into sets or to distribute items in a space of feature variation. Most methods are statistical in nature, such as cluster analysis and discriminant analysis, are typically based on philological data and features, and are fruitful application domains for modern machine learning methods.
Whereas in the past, stylometry emphasized the rarest or most striking elements of a text, contemporary techniques can isolate identifying patterns even in common parts of speech. Most systems are based on lexical statistics, i.e. using the frequencies of words and terms in the text to characterise the text (or its author). In this context, unlike for information retrieval, the observed occurrence patterns of the most common words are more interesting than the topical terms which are less frequent. [91] [92]
The primary stylometric method is the writer invariant: a property held in common by all texts, or at least all texts long enough to admit of analysis yielding statistically significant results, written by a given author. An example of a writer invariant is frequency of function words used by the writer.
In one such method, the text is analyzed to find the 50 most common words. The text is then divided into 5,000 word chunks and each of the chunks is analyzed to find the frequency of those 50 words in that chunk. This generates a unique 50-number identifier for each chunk. These numbers place each chunk of text into a point in a 50-dimensional space. This 50-dimensional space is flattened into a plane using principal components analysis (PCA). This results in a display of points that correspond to an author's style. If two literary works are placed on the same plane, the resulting pattern may show if both works were by the same author or different authors.
Stylometric data are distributed according to the Zipf–Mandelbrot law. The distribution is extremely spiky and leptokurtic, the reason why researchers could not use statistics to solve e.g. authorship attribution problems. Nevertheless, usage of Gaussian statistics is perfectly possible by applying data transformation. [93]
Neural networks, a special case of statistical machine learning methods, have been used to analyze authorship of texts. Texts of undisputed authorship are used to train a neural network by processes such as backpropagation, such that training error is calculated and used to update the process to increase accuracy. Through a process akin to non-linear regression, the network gains the ability to generalize its recognition ability to new texts to which it has not yet been exposed, classifying them to a stated degree of confidence. Such techniques were applied to the long-standing claims of collaboration of Shakespeare with his contemporaries John Fletcher and Christopher Marlowe, [94] [95] and confirmed the opinion, based on more conventional scholarship, that such collaboration had indeed occurred.
A 1999 study showed that a neural network program reached 70% accuracy in determining the authorship of poems it had not yet analyzed. This study from Vrije Universiteit examined identification of poems by three Dutch authors using only letter sequences such as "den". [96]
A study used deep belief networks (DBN) for authorship verification model applicable for continuous authentication (CA). [97]
One problem with this method of analysis is that the network can become biased based on its training set, possibly selecting authors the network has analyzed more often. [96]
The genetic algorithm is another machine learning technique used for stylometry. This involves a method that starts with a set of rules. An example rule might be, "If but appears more than 1.7 times in every thousand words, then the text is author X". The program is presented with text and uses the rules to determine authorship. The rules are tested against a set of known texts and each rule is given a fitness score. The 50 rules with the lowest scores are not used. The remaining 50 rules are given small changes and 50 new rules are introduced. This is repeated until the evolved rules attribute the texts correctly.
One method for identifying style is termed "rare pairs" and relies upon individual habits of collocation. The use of certain words may, for a particular author, be associated idiosyncratically with the use of other, predictable words.[ citation needed ]
The diffusion of the internet has shifted the authorship attribution attention towards online texts (web pages, blogs, etc.) electronic messages (e-mails, tweets, posts, etc.), and other types of written information that are far shorter than an average book, much less formal and more diverse in terms of expressive elements such as colors, layout, fonts, graphics, emoticons, etc. Efforts to take into account such aspects at the level of both structure and syntax were reported in. [98] In addition, content-specific and idiosyncratic cues (e.g., topic models and grammar checking tools) were introduced to unveil deliberate stylistic choices. [99]
Standard stylometric features have been employed to categorize the content of a chat by instant messaging, [100] or the behavior of the participants, [101] but attempts of identifying chat participants are still few and early. Furthermore, the similarity between spoken conversations and chat interactions has been neglected while being a major difference between chat data and any other type of written information.
{{cite book}}
: |journal=
ignored (help){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link)Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.
Cognitive linguistics is an interdisciplinary branch of linguistics, combining knowledge and research from cognitive science, cognitive psychology, neuropsychology and linguistics. Models and theoretical accounts of cognitive linguistics are considered as psychologically real, and research in cognitive linguistics aims to help understand cognition in general and is seen as a road into the human mind.
Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.
Forensic linguistics, legal linguistics, or language and the law is the application of linguistic knowledge, methods, and insights to the forensic context of law, language, crime investigation, trial, and judicial procedure. It is a branch of applied linguistics.
A paraphrase or rephrase is the rendering of the same text in different words without losing the meaning of the text itself. More often than not, a paraphrased text can convey its meaning better than the original words. In other words, it is a copy of the text in meaning, but which is different from the original. For example, when someone tells a story they heard in own words, they paraphrase, with the meaning being the same. The term itself is derived via Latin paraphrasis, from Ancient Greek παράφρασις (paráphrasis) 'additional manner of expression'. The act of paraphrasing is also called paraphrasis.
Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly.
Plagiarism detection or content similarity detection is the process of locating instances of plagiarism or copyright infringement within a work or document. The widespread use of computers and the advent of the Internet have made it easier to plagiarize the work of others.
Writeprint is a method in forensic linguistics of establishing author identification over the internet, likened to a digital fingerprint. Identity is established through a comparison of distinguishing stylometric characteristics of an unknown written text with known samples of the suspected author. Even without a suspect, writeprint provides potential background characteristics of the author, such as nationality and education.
Linguistics is the scientific study of language. Linguistics is based on a theoretical as well as a descriptive study of language and is also interlinked with the applied fields of language studies and language learning, which entails the study of specific languages. Before the 20th century, linguistics evolved in conjunction with literary study and did not employ scientific methods. Modern-day linguistics is considered a science because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language – i.e., the cognitive, the social, the cultural, the psychological, the environmental, the biological, the literary, the grammatical, the paleographical, and the structural.
Carole Elisabeth Chaski is a forensic linguist who is considered one of the leading experts in the field. Her research has led to improvements in the methodology and reliability of stylometric analysis and inspired further research on the use of this approach for authorship identification. Her contributions have served as expert testimony in several federal and state court cases in the United States and Canada. She is president of ALIAS Technology and executive director of the Institute for Linguistic Evidence, a non-profit research organization devoted to linguistic evidence.
Shakespeare attribution studies is the scholarly attempt to determine the authorial boundaries of the William Shakespeare canon, the extent of his possible collaborative works, and the identity of his collaborators. The studies, which began in the late 17th century, are based on the axiom that every writer has a unique, measurable style that can be discriminated from that of other writers using techniques of textual criticism originally developed for biblical and classical studies. The studies include the assessment of different types of evidence, generally classified as internal, external, and stylistic, of which all are further categorised as traditional and non-traditional.
In natural language processing, textual entailment (TE), also known as natural language inference (NLI), is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text.
In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.
Native-language identification (NLI) is the task of determining an author's native language (L1) based only on their writings in a second language (L2). NLI works through identifying language-usage patterns that are common to specific L1 groups and then applying this knowledge to predict the native language of previously unseen texts. This is motivated in part by applications in second-language acquisition, language teaching and forensic linguistics, amongst others.
Jussi Karlgren is a Swedish computational linguist, research scientist at Spotify, and co-founder of text analytics company Gavagai AB. He holds a PhD in computational linguistics from Stockholm University, and the title of docent of language technology at Helsinki University.
Author profiling is the analysis of a given set of texts in an attempt to uncover various characteristics of the author based on stylistic- and content-based features, or to identify the author. Characteristics analysed commonly include age and gender, though more recent studies have looked at other characteristics like personality traits and occupation
Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection. Paraphrasing is also useful in the evaluation of machine translation, as well as semantic parsing and generation of new samples to expand existing corpora.
Code stylometry is the application of stylometry to computer code to attribute authorship to anonymous binary or source code. It often involves breaking down and examining the distinctive patterns and characteristics of the programming code and then comparing them to computer code whose authorship is known. Unlike software forensics, code stylometry attributes authorship for purposes other than intellectual property infringement, including plagiarism detection, copyright investigation, and authorship verification.
Shlomo Argamon is an American/Israeli computer scientist and forensic linguist. He is the associate provost for artificial intelligence and professor of computer science at Touro University.
Adversarial stylometry is the practice of altering writing style to reduce the potential for stylometry to discover the author's identity or their characteristics. This task is also known as authorship obfuscation or authorship anonymisation. Stylometry poses a significant privacy challenge in its ability to unmask anonymous authors or to link pseudonyms to an author's other identities, which, for example, creates difficulties for whistleblowers, activists, and hoaxers and fraudsters. The privacy risk is expected to grow as machine learning techniques and text corpora develop.
{{cite journal}}
: Cite journal requires |journal=
(help)See also the academic journal Literary and Linguistic Computing, now Digital Scholarship in the Humanities (published by the University of Oxford) and the Language Resources and Evaluation journal (previously Computers and the Humanities).