Writer invariant

Last updated

Writer invariant, also called authorial invariant or author's invariant, is a property of a text which is invariant of its author, that is, it will be similar in all texts of a given author and different in texts of different authors. It can be used to find plagiarism or discover who is real author of anonymously published text. Writer invariant is also an author's pattern of writing a letter in handwritten text recognition. [1]

While it is generally recognised that writer invariants exist, [2] it is not agreed what properties of a text should be used. [2] [3] Among the first ones used was distribution of word lengths; [2] other proposed invariants include average sentence length, [2] [3] average word length, [2] [3] noun, verb or adjective usage frequency, [3] vocabulary richness, [2] and frequency of function words, [2] [3] or specific function words. [3]

Of these, average sentence lengths can be very similar in works of different authors [2] [3] or vary significantly even within a single work; [3] average word lengths likewise turn out to be very similar in works of different authors. [3] Analysis of function words shows promise because they are used by authors unconsciously. [2] [3] [4]

See also

Related Research Articles

Natural language processing Field of computer science and linguistics

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Invariant and invariance may refer to:

In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.

Readability Level of ease with which a reader can understand written text

Readability is the ease with which a reader can understand a written text. In natural language, the readability of text depends on its content and its presentation. Researchers have used various factors to measure readability, such as:

Forensic linguistics

Forensic linguistics, legal linguistics, or language and the law, is the application of linguistic knowledge, methods, and insights to the forensic context of law, language, crime investigation, trial, and judicial procedure. It is a branch of applied linguistics.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content.

Autocomplete, or word completion, is a feature in which an application predicts the rest of a word a user is typing. In Android and iOS smartphones, this is called predictive text. In graphical user interfaces, users can typically press the tab key to accept a suggestion or the down arrow key to accept one of several.

Paradigmatic analysis is the analysis of paradigms embedded in the text rather than of the surface structure (syntax) of the text which is termed syntagmatic analysis. Paradigmatic analysis often uses commutation tests, i.e. analysis by substituting words of the same type or class to calibrate shifts in connotation.

In linguistics, prosody is concerned with elements of speech that are not individual phonetic segments but are properties of syllables and larger units of speech, including linguistic functions such as intonation, stress, and rhythm. Such elements are known as suprasegmentals.

In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient.

A language model is a probability distribution over sequences of words. Given such a sequence of length m, a language model assigns a probability to the whole sequence. Language models generate probabilities by training on text corpora in one or many languages. Given that languages can be used to express an infinite variety of valid sentences, language modelling faces the problem of assigning non-zero probabilities to linguistically valid sequences that may never be encountered in the training data. Several modelling approaches have been designed to surmount this problem, such as applying the Markov assumption or using neural architectures such as recurrent neural networks or transformers.

Stylometry is the application of the study of linguistic style, usually to written language, but it has been applied successfully to music and to fine-art paintings as well. Another conceptualization defines it as the linguistic discipline that evaluates an author's style through the application of statistical analysis to a body of their work.

Studies that estimate and rank the most common words in English examine texts written in English. Perhaps the most comprehensive such analysis is one that was conducted against the Oxford English Corpus (OEC), a massive text corpus that is written in the English language.

Letter frequency is the number of times letters of the alphabet appear on average in written language. Letter frequency analysis dates back to the Arab mathematician Al-Kindi (c. 801–873 AD), who formally developed the method to break ciphers. Letter frequency analysis gained importance in Europe with the development of movable type in 1450 AD, where one must estimate the amount of type required for each letterform. Linguists use letter frequency analysis as a rudimentary technique for language identification, where it is particularly effective as an indication of whether an unknown writing system is alphabetic, syllabic, or ideographic.

Writeprint is a method in forensic linguistics of establishing author identification over the internet, likened to a digital fingerprint. Identity is established through a comparison of distinguishing stylometric characteristics of an unknown written text with known samples of the suspected author. Even without a suspect, writeprint provides potential background characteristics of the author, such as nationality and education.

Fyodor Kryukov

Fyodor Dmitrievich Kryukov was a Cossack writer and soldier in the White Army, died in 1920 of typhoid fever. Various literary critics, most notably Aleksandr Solzhenitsyn and Roy Medvedev, claimed that Mikhail Sholokov plagiarised his work in order to write major parts of And Quiet Flows the Don. This was also the conclusion of a statistical analysis by V. P. and T. G. Fomenko. Their conclusion has been questioned by a more recent analysis. Ze'ev Bar-Sela believes that although the book was plagiarised, it was plagiarised from Venyamin Alekseevich Krasnushkin, and not from Kryukov. A 1984 monograph by Geir Kjetsaa and others concluded through statistical analyses that Sholokhov was the likely author of Don.

Time delay neural network

Time delay neural network (TDNN) is a multilayer artificial neural network architecture whose purpose is to 1) classify patterns with shift-invariance, and 2) model context at each layer of the network.

Word recognition

Word recognition, according to Literacy Information and Communication System (LINCS) is "the ability of a reader to recognize written words correctly and virtually effortlessly". It is sometimes referred to as "isolated word recognition" because it involves a reader's ability to recognize words individually from a list without needing similar words for contextual help. LINCS continues to say that "rapid and effortless word recognition is the main component of fluent reading" and explains that these skills can be improved by "practic[ing] with flashcards, lists, and word grids".

A heat kernel signature (HKS) is a feature descriptor for use in deformable shape analysis and belongs to the group of spectral shape analysis methods. For each point in the shape, HKS defines its feature vector representing the point's local and global geometric properties. Applications include segmentation, classification, structure discovery, shape matching and shape retrieval.

Word2vec Models used to produce word embeddings

Word2vec is a technique for natural language processing published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that a simple mathematical function indicates the level of semantic similarity between the words represented by those vectors.

References

  1. Ali Nosary, Laurent Heutte, Thierry Paquet and Yves Lecourtier. "A Step Towards the Use of Writer's Properties for Text Recognition" (PDF). Laboratoire Perception, Systèmes, Information (PSI), Université de Rouen. Retrieved 2007-09-06.{{cite journal}}: Cite journal requires |journal= (help)CS1 maint: multiple names: authors list (link)
  2. 1 2 3 4 5 6 7 8 9 Peng, Roger D.; Hengartner, Nicolas W. (August 2002). "Quantitative analysis of literary styles. (General)" (PDF). The American Statistician . 56 (3): 175–185. CiteSeerX   10.1.1.365.202 . doi:10.1198/000313002100. S2CID   16333538 . Retrieved 2009-05-27.
  3. 1 2 3 4 5 6 7 8 9 10 Fomenko, A. T.; V. P. Fomenko and T. G. Fomenko (2005) [2005]. "The authorial invariant in Russian literary texts. Its application: who was the real author of the "Quiet Don"?". History: Fiction or Science?. Bellevue, WA: Delamere. pp. 425–444. ISBN   978-2-913621-06-0.
  4. Buckland, Warren. "Forensic Semiotics". The Semiotic Review of Books . 10 (3). Retrieved 2007-09-06.