Author profiling

Last updated
Thomas Corwin Mendenhall, American physicist, 1841-1924 PSM V37 D594 Thomas Corwin Mendenhall.jpg
Thomas Corwin Mendenhall, American physicist, 1841–1924

Author profiling is the analysis of a given set of texts in an attempt to uncover various characteristics of the author based on stylistic- and content-based features, or to identify the author. Characteristics analysed commonly include age and gender, though more recent studies have looked at other characteristics like personality traits and occupation [1]

Contents

Author profiling is one of the three major fields in automatic authorship identification (AAI), the other two being authorship attribution and authorship identification. The process of AAI emerged at the end of the 19th century. Thomas Corwin Mendenhall, an American autodidact physicist and meteorologist, was the first to apply this process to the works of Francis Bacon, William Shakespeare, and Christopher Marlowe. From these three historic figures, Mendenhall sought to uncover their quantitative stylistic differences by inspecting word lengths. [2]

Although much progress has been made in the 21st century, the task of author profiling remains an unsolved problem due to its difficulty.

Techniques

Through the analysis of texts, various author profiling techniques can be applied to predict information about the author. For example, function words, as well as part-of-speech analysis, can be referenced to determine the author's gender and truth of a text. [3]

The process of author profiling usually involves the following steps: [4]

  1. Identifying specific features to be extracted from the text
  2. Building an adopted, standard representation (e.g.Bag-of-words model) for the target profile
  3. Building a classification model using a standard classifier (e.g. Support Vector Machines) for the target profile

Machine learning algorithms for author profiling have become increasingly complex over time. Algorithms used in author profiling include:

In the past, author profiling was limited to physical documents, often in the form of books and newspaper articles. Different combinations of textual attributes belonging to the authors were identified and analyzed using author profiling, including lexical and syntactical features. [4] Pioneering research in author profiling focused mostly on a single genre until the shift towards author profiling on social media and the Internet. [9] While attributes, such as content words and POS tags, are effective in author profile predictions on physical documents, their effectiveness in author profile predictions on digital texts is subjective and dependent on the type of online content being analyzed. [4]

With the advances in technology, author profiling on the Internet has become increasingly common. Digital texts, such as social media posts, blog posts and emails, are now being used. [4] This has sparked greater research efforts because of the advantages analysing digital texts can bring to sectors like marketing and business. [8] Author profiling on digital texts has also enabled predictions of a wider range of author characteristics such as personality, [8] income and occupation. [10]

The most effective attributes for author profiling on digital texts involve a combinations of stylistic and content features. [4] Author profiling on digital texts focuses on cross-genre author profiling, whereby one genre is used for training data and another genre is used for testing data, though both need to be relatively similar for good results. [9]

There are some problems [4] when performing author profiling techniques on online texts. These problems include:

Author profiling and the Internet

The rise of the internet in the 20th to 21st century catalysed an increase in author profiling research, since data could be mined from the web, including social media platforms, emails and blogs. Content from the web have been analysed in tasks of author profiling to identify the age, gender, geographic origins, nationality and psychometric traits of web users. The information obtained has been used to serve various applications, including marketing and forensics.

Social media

The increased integration of social media in people's daily lives have made them a rich source of textual data for author profiling. This is mainly because users frequently upload and share content for various purposes including self-expression, socialisation, and personal businesses. The Social bot is also a frequent feature of social media platforms, especially Twitter, generating content that may be analysed for author profiling. [11] While different platforms contain similar data they may also contain different features depending on the format and structure of the particular platform.

There are still limitations in using social media as data sources for author profiling, because data obtained may not always be reliable or accurate. Users sometimes provide false information about themselves or withhold information. [12] As a result, the training of algorithms for author profiling may be impeded by data that is less accurate. Another limitation is the irregularity of text in social media. Features of irregularity include deviation from normal linguistic standards such as spelling errors, unstandardised transliteration as with the substitution of letters with numbers, shorthands, user-created abbreviations for phrases and et cetera, which may pose a challenge to author profiling. [13] Researchers have adopted methods to overcome these limitations in training their algorithms for author profiling. [13]

Facebook

Facebook is useful for author profiling studies as a social networking service. This is because of how a social network may be built, expanded, and used for social action in the site. [14] In such processes, users share personal content that may be used for author profiling studies. Textual data is obtained from Facebook for author profiling from user's personal posts such as 'status updates'. [15] These are acquired to produce a corpus in the selected language(s) for author profiling, to create either a bilingual or multilingual database of content words, [15] [16] which may then be used for author profiling.

In the context of Facebook, author profiling mainly involves English textual data, but also uses non-english languages that include: Roman Urdu, Arabic, Brazilian Portuguese, Spanish. [16] [11] While author profiling studies on Facebook have been predominantly for gender and age-group identification, there have been attempts to derive attributes to predict religiosity, the IT background of users, and even basic emotions (as defined by Paul Ekman) among others. [15] [17]

Weibo

Sina Weibo is one of the few Asian social media platforms that contain texts in Asian languages to have been analysed for author profiling. Primary content of focus for author profiling on Weibo content include classical Chinese characters, hashtags, emoticons, kaomoji, homogenous punctuation, Latin sequences (due to the multilingualism of text) and even poetic formats. Particularly popular Chinese expressions, POS tags and word types are also tracked for author profiling. [18]

Author profiling for Weibo content requires algorithms different from those used for other social media platforms, mainly due to the linguistic differences between Mandarin Chinese and Western languages. For example, Chinese emotions involve Chinese characters describing the gesture or facial expression in brackets, such as: e.g. [哈哈] 'laughter', [泪] 'tears', [偷笑] 'giggle', [爱你] 'love', [心] 'heart'. [18] This differs from the use of punctuation symbols for emoticons in Western languages, or the common use of the Unicode emojis in other platforms such as Facebook, Instagram, et cetera. Further, while there are around 161 western emoticons, there are around 2900 emoticons regularly used in mainland China for web content as in Weibo. [19] To tackle these differences, author profiling algorithms have been trained on Chinese emoticons and linguistic features. For example, author profiling algorithms have been designed to detect Chinese stylistic expressions expressing formality and sentiment, in place of algorithms detecting English linguistic features such as capital letters. [19]

As compared to other more popular, globalised platforms, texts on Weibo are not as commonly used in the task of author profiling. This is likely due to the centralisation of Weibo in the Chinese population of mainland China, limiting its usage to predominantly China Nationals. Studies done for this platform have used bots, machine learning algorithms to identify authors' age and gender. Data is acquired from Weibo microblog posts of willing participants to be analysed, and used to train algorithms that build concept-based profiles of users to a certain accuracy. [18]

Chat logs

Chat logs have been studied for author profiling as they include much textual discourse, the analysis of which have contributed to applicational studies including social trends and forensic science. Sources of data for author profiling from chat logs include platforms such as Yahoo!, AIM (software) and WhatsApp. [20] Computational systems have been devised to produce concept-based profiles listing chat topics discussed in a single chat room or by independent users. [21]

Blogs

Author profiling can be used to identify characteristics of blog writers, such as their age, gender and geographical location, based on their different writing styles, [22] This is especially useful when it comes to anonymous blogs. The choice of content words, style-based features and topic-based features are analyzed to discover characteristics of the author. [23]

In general, features that are frequently occur in blogs include a high distribution of verbs per writing and a relatively high use of pronouns. The frequency of verbs, pronouns and other word classes are used to profile and classify emotions in the writings of authors, as well as their gender and age. [24] Author profiling using classification models that were used on physical documents in the past, such as Support Vector Machines, have also been tested on blogs. However, it has been proven to be unsuitable for the latter due to its low performance. [22]

The machine learning algorithms that work well for author profiling on blogs [22] include:

Email

Email has been a consistent focus for author profiling due to rich textual data that can be found in various sections of a typical emailing platform. These sections include the sent, inbox, spam, trash, and archived folders. [25] Multilingual approaches to author profiling for emails have included English, Spanish, and Arabic emails as data sources, among others. [25] [12] Through author profiling, details of email users may be identified, such as their age, gender, geographical origin, level of education, nationality and even psychometrics traits of personality, which includes neuroticism, agreeableness, conscientiousness and extraversion and introversion from the Big Five personality traits.[ citation needed ]

In author profiling for email, content is processed for important textual data, while unimportant features such as metadata and other hyper-text markup language (HTML) redundancies are excluded. Important parts of the Multi-purpose Internet Mail Extensions (MIME) that contain content of the emails are also included in the analysis. Obtained data is often parsed into various sections of content, including author text, signature text, advertisement, quoted text, and reply lines. [25] Further analysis of email textual content in author profiling tasks involves the extraction of tone of voice, sentiment, semantics and other linguistic features to be processed.

Applications

Author profiling has applications in various fields where there is a need to identify specific characteristics of an author of a text, with a growing importance in fields like forensics and marketing. [26] Depending on its application, the task of author profiling can vary in terms of the characteristics to be identified, number of authors studied and number of texts available for analysis.

Although its applications have traditionally been limited to written texts, such as literary works, this has extended to online texts with the advancement of the computer and the Internet.

Forensic linguistics

In the context of forensic linguistics, author profiling is used to identify characteristics of the author of anonymous, pseudonymous or forged text, based on the author's use of the language. Through linguistic analysis, forensic linguists seek to identify the suspect's motivation and ideology, along with other class features, such as the suspect's ethnicity or profession. While this does not always lead to decisive author identification, such information can help law enforcement narrow the pool of suspects. [27]

In most cases, author profiling in the context of forensic linguistics involves a single text problem, in which there is either no or few comparison texts available and no external evidence that points to the author. [28] Examples of text analysed by forensic linguists include blackmailing letters, confessions, testaments, suicide letters and plagiarised writing. [29] This has also extended to online texts as well, such as sexually explicit online chat logs between middle-aged men and underaged girls, [28] with the increasing number of cybercrimes committed on the Internet. [30]

One of the earliest and best-known examples of the use of author profiling is by Roger Shuy, who was asked to examine a ransom note linked to a notorious kidnapping case in 1979. Based on his analysis of the kidnapper's idiolect, Shuy was able to identify crucial elements of the kidnappers identity from his misspellings and a dialect item, that is, the kidnapper was well-educated and from Akron, Ohio. [31] This eventually led to a successful arrest and confession by the suspect.

However, there are criticisms that author profiling methods lack objectivity, since these methods are reliant on a forensic linguist's subjective identification of crucial sociolinguistic markers . These methods, such as those adopted by literary critic Donald Wayne Foster, are said to be speculative and based entirely on one's subjective experience, and therefore cannot be tested empirically. [32]

Bot detection

Author profiling is adopted in the identification of social bots, the most common being Twitter bots. Social bots have been deemed as a threat given their commercial, political and ideological influence, such as the 2016 United States presidential election, during which they polarised political conversations, and spread misinformation and unverified information. In the context of marketing, social bots can artificially inflate the popularity of a product by posting positive reviews, and undermine the reputation of competitive products with unfavourable reviews. [33] Therefore, bot detection from an author profiling perspective is a task of high importance. [33] [34]

Made to appear as human accounts, bots can mostly be identified by information on their profiles, like their username, profile photo and time of posting. [34] However, the task of identifying bots solely from textual data (i.e. without meta-data) is significantly more challenging, requiring author profiling techniques. [34] This usually involves a classification task based on semantic and syntactic features. [35] [36]

The task of bot and gender profiling was one of four shared tasks organised by PAN, which organises a series of scientific events and shared tasks of digital text forensics and stylometry, in its 2019 edition. [33] Participating teams had achieved much success, with the best results for bot detection for English and Spanish tweets at 95.95% and 93.33% respectively. [35]

Marketing

Author profiling is also useful from a marketing viewpoint, as it allows businesses to identify the demographics of people that like or dislike their products based on an analysis of blogs, online product reviews and social media content. [26] This is important since most individuals post their reviews on products anonymously. Author profiling techniques are helpful to business experts in making better informed strategic decisions based on the demographics of their target group. [37] In addition, businesses can target their marketing campaigns at groups of consumers who match the demographics and profile of current customers. [38]

Author identification and influence tracing

Crucifix, Rosary and Holy Bible with Apocrypha NRSV Crucifix, Rosary and Holy Bible with Apocrypha NRSV.png
Crucifix, Rosary and Holy Bible with Apocrypha NRSV

Author profiling techniques are used to study traditional media and literature to identify the writing style of various authors as well as their written topics of content. Author profiling for literature is also been done to deduce the social networks of authors and their literary influence based on their bibliographic records of co-authorship. In cases of anonymous or pseudepigraphic works, sometimes the technique has been used to attempt to identify the author or authors, or determine which works were written by the same person.

Some examples of author profiling studies on literature and traditional media include studies on the following: [39] [40]

Library cataloguing

Another application of author profiling is in devising strategies for cataloguing library resources based on standard attributes. [42] In this approach, author profiling techniques may improve the efficiency of library cataloguing in which library resources are automatically classified based on the authors' bibliographic records. This was a significant issue in the early 21st century when much of library cataloguing was still done manually.

In using author profiling for library cataloguing, researchers have used machine learning for automatic processes in the library, such as Support Vector Machine algorithms (SVMs). With the use of SVMs for author profiling, bibliographic records of authors within existing databases may be identified, tracked, and updated to identify an author based on her topics of literary content and expertise as indicated in his or her bibliographic records. In this case, author profiling uses the social structures of authors that may be derived from physical copies of published media to catalogue library resources. [42]

Author profiling has been featured in popular culture. The 2017 Discovery Channel mini-series Manhunt: Unabomber is a fictionalised account of the FBI investigation surrounding the Unabomber. It features a criminal profiler who identifies defining characteristics of the Unabomber's identity based on his analysis of the Unabomber's idiolect in his published manifesto and letters. The show highlighted the importance of author profiling in criminal forensics, as it was critical in the capture of the real Unabomber culprit in 1996. [43]

See also

Related subjects

Related Research Articles

Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others.

<span class="mw-page-title-main">Natural language processing</span> Field of linguistics and computer science

Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Idiolect is an individual's unique use of language, including speech. This unique usage encompasses vocabulary, grammar, and pronunciation. This differs from a dialect, a common set of linguistic characteristics shared among a group of people.

Natural language generation (NLG) is a software process that produces natural language output. A widely-cited survey of NLG methods describes NLG as "the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems than can produce understandable texts in English or other human languages from some underlying non-linguistic representation of information".

Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.

Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP) that is concerned with building systems that automatically answer questions that are posed by humans in a natural language.

<span class="mw-page-title-main">Forensic linguistics</span>

Forensic linguistics, legal linguistics, or language and the law, is the application of linguistic knowledge, methods, and insights to the forensic context of law, language, crime investigation, trial, and judicial procedure. It is a branch of applied linguistics.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification.

Stylometry is the application of the study of linguistic style, usually to written language. It has also been applied successfully to music, paintings, and chess. Another conceptualization defines it as the linguistic discipline that evaluates an author's style through the application of statistical analysis to a body of their work.

Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly. 1

In natural language processing, language identification or language guessing is the problem of determining which natural language given content is in. Computational approaches to this problem view it as a special case of text categorization, solved with various statistical methods.

Plagiarism detection or content similarity detection is the process of locating instances of plagiarism or copyright infringement within a work or document. The widespread use of computers and the advent of the Internet have made it easier to plagiarize the work of others.

Writeprint is a method in forensic linguistics of establishing author identification over the internet, likened to a digital fingerprint. Identity is established through a comparison of distinguishing stylometric characteristics of an unknown written text with known samples of the suspected author. Even without a suspect, writeprint provides potential background characteristics of the author, such as nationality and education.

Carole Elisabeth Chaski is a forensic linguist who is considered one of the leading experts in the field. Her research has led to improvements in the methodology and reliability of stylometric analysis and inspired further research on the use of this approach for authorship identification. Her contributions have served as expert testimony in several federal and state court cases in the United States and Canada. She is president of ALIAS Technology and executive director of the Institute for Linguistic Evidence, a non-profit research organization devoted to linguistic evidence.

The following outline is provided as an overview of and topical guide to natural-language processing:

<span class="mw-page-title-main">Entity linking</span> Concept in Natural Language Processing

In natural language processing, entity linking, also referred to as named-entity linking (NEL), named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD) or named-entity normalization (NEN) is the task of assigning a unique identity to entities mentioned in text. For example, given the sentence "Paris is the capital of France", the idea is to determine that "Paris" refers to the city of Paris and not to Paris Hilton or any other entity that could be referred to as "Paris". Entity linking is different from named-entity recognition (NER) in that NER identifies the occurrence of a named entity in text but it does not identify which specific entity it is.

Native-language identification (NLI) is the task of determining an author's native language (L1) based only on their writings in a second language (L2). NLI works through identifying language-usage patterns that are common to specific L1 groups and then applying this knowledge to predict the native language of previously unseen texts. This is motivated in part by applications in second-language acquisition, language teaching and forensic linguistics, amongst others.

Shlomo Argamon is an American/Israeli computer scientist and forensic linguist. He is currently the chair of the computer science department as well as a tenured professor of computer science and director of the Master of Data Science program at Illinois Institute of Technology in Chicago, IL.

Adversarial stylometry is the practice of altering writing style to reduce the potential for stylometry to discover the author's identity or their characteristics. This task is also known as authorship obfuscation or authorship anonymisation. Stylometry poses a significant privacy challenge in its ability to unmask anonymous authors or to link pseudonyms to an author's other identities, which, for example, creates difficulties for whistleblowers, activists, and hoaxers and fraudsters. The privacy risk is expected to grow as machine learning techniques and text corpora develop.

References

  1. Wiegmann, M., Stein, B. & Potthast, M. (2019). "Overview of the Celebrity Profiling Task at PAN 2019." CLEF.
  2. Mikros, G.K., & Perifanos, K. (2013). "Authorship attribution in Greek tweets using author's multilevel n-gram profiles." 2013 AAAI Spring Symposium Series.
  3. Koppel, M., Argamon, S., & Shimoni, A.R. (2013). "Automatically categorizing written texts by author gender." Literary and Linguistic Computing, 17, pg 401–412.
  4. 1 2 3 4 5 6 López-Monroy, A. P., Montes-y-Gómez, M., Escalante, H. J., Villaseñor-Pineda, L. & Stamatatos, E. (2015). "Discriminative subprofile-specific representations for author profiling in social media." In: Knowledge-Based Systems, 89, 134 – 147.
  5. 1 2 Lundeqvist, E. & Svensson, M. (2017). "Author profiling: A machine learning approach towards detecting gender, age and native language of users in social media." In: Department of Information Technology.
  6. Franco-Salvador, M., Plotnikova, N., Pawar, N., & Benajiba, Y. (2017). "Subword-based deep averaging networks for author profiling in social media." CLEF.
  7. Kurita, K. (2018). "Paper dissected: Deep unordered composition rivals syntactic methods for text classification explained." Machine Learning Explained.
  8. 1 2 3 Bsi, B. & Zrigui, M. (2018). "Deep learning techniques for author profiling in social media content." In: 31st IBIMA Conference.
  9. 1 2 Bilan, I. & Zhekova, D. (2016). "CAPS: A cross-genre author profiling system." CLEF.
  10. Schler, J., Koppel, M., Argamon, S., & Pennebaker, J.W. (2005). "Effects of Age and Gender on Blogging." AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.
  11. 1 2 Rangel, F., & Russo, P. (2019). "Overview of the 7th author profiling task at PAN 2019: Bots and gender profiling in Twitter." CLEF.
  12. 1 2 Rosso, P., Rangel, F., Farías, I. H., Cagnina, L., Zaghouani, W., & Charfi, A. (2018). "A survey on author profiling, deception, and irony detection for the Arabic language." Language and Linguistics Compass, 12(4).
  13. 1 2 Gómez-Adorno, H., Markov, I., Sidorov, G., Posadas-Durán, J.-P., Sanchez-Perez, M. A., & Chanona-Hernandez, L. (2016). "Improving Feature Representation Based on a Neural Network for Author Profiling in Social Media Texts". In: Computational Intelligence and Neuroscience, pg 1–13.
  14. Dam, J. W. V., & Velden, M. V. D. (2015). "Online profiling and clustering of Facebook users". In: Decision Support Systems, 70, 60–72.
  15. 1 2 3 Hsieh, F.C., Sandroni, R.F., & Paraboni, I. (2018). "Author Profiling from Facebook Corpora". LREC.
  16. 1 2 Fatima, M., Hasan, K., Anwar, S., & Nawab, R. M. A. (2017). "Multilingual author profiling on Facebook". In: Information Processing & Management, 53(4), 886–904.
  17. Rangel, F., & Rosso, P. (2013). "Use of Language and Author Profiling: Identification of Gender and Age."
  18. 1 2 3 Zhang, W., Caines, A., Alikaniotis, D., & Buttery, P. (2015). "Predicting author age from Weibo microblog posts." LREC.
  19. 1 2 Chen, L., Qian, T., Wang, F., You, Z., Peng, Q., & Zhong, M. (2015). "Age Detection for Chinese Users in Weibo." WAIM 2015, LNCS 9098, 83–95.
  20. Lin, J. (2007). "Automatic Author Profiling of Online Chat Logs"
  21. Bengel J., Gauch S., Mittur E., Vijayaraghavan R. (2004) ChatTrack: "Chat Room Topic Detection Using Classification." In: Chen H., Moore R., Zeng D.D., Leavitt J. (eds) Intelligence and Security Informatics. ISI 2004. Lecture Notes in Computer Science, 3073. Springer, Berlin, Heidelberg
  22. 1 2 3 Pham, D.D., Tran, G.B., & Pham, S.B. (2009). Author Profiling for Vietnamese Blogs. 2009 International Conference on Asian Language Processing, 190–194.
  23. Santosh, K., Bansal, R., Shekhar, M. & Varma, V. (2013). Author Profiling: Predicting Age and Gender from Blogs Notebook for PAN at CLEF 2013. CLEF.
  24. Rangel, F. & Rosso, P. (2013). Use of Language and Author Profiling: Identification of Gender and Age. Natural Language Processing and Cognitive Science 2013.
  25. 1 2 3 Estival, D., Gaustad, T., Pham, S. B., Radford, W., & Hutchinson, B. (2007). Author Profiling for English Emails.
  26. 1 2 Author Profiling 2018. (n.d.).
  27. Foster, D. (2000). Author Unknown: On the Trail of Anonymous. Henry Holt and Company
  28. 1 2 Grant, T. D. (2008). "Approaching questions in forensic authorship analysis." In Gibbons, J. & Turell, M. T. (eds.). Dimensions of Forensic Linguistics. John Benjamins.
  29. Kotzé, E. F. (2010). "Author identification from opposing perspectives in forensic linguistics". South African Linguistics and Applied Language Studies. 28(2). 185–197
  30. Yang, M. & Chow, K. P. (2014) "Authorship Attribution for Forensic Investigation with Thousands of Authors." In: Cuppens-Boulahia N., Cuppens F., Jajodia S., Abou El Kalam A., Sans T. (eds) ICT Systems Security and Privacy Protection. SEC 2014. IFIP Advances in Information and Communication Technology, vol 428. Springer, Berlin, Heidelberg.
  31. Leonard, R. A. (2005). "Applying the Scientific Principles of Language Analysis to Issues of the Law." International Journal of Humanities. 3. 1–9
  32. Chaski, C. E. (2001). "Empirical evaluations of language-based author identification techniques." Forensic Linguistics, 8, 1–65.
  33. 1 2 3 "Bots and Gender Profiling 2019". (n.d.).
  34. 1 2 3 Goubin, Régis & Lefeuvre, Dorian & Alhamzeh, Alaa & Mitrović, Jelena & Egyed-Zsigmond, El˝ & Fossi, Leopold. (2019). "Bots and Gender Profiling using a Multi-layer Architecture Notebook for PAN at CLEF 2019".
  35. 1 2 Daelemans W. et al. (2019) "Overview of PAN 2019: Bots and Gender Profiling, Celebrity Profiling, Cross-Domain Authorship Attribution and Style Change Detection." In: Crestani F. et al. (eds) Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2019. Lecture Notes in Computer Science, vol 11696. Springer, Cham.
  36. Kovács, G., Balogh, V., Mehta, P., Shridhar, K., Alonso, P., & Liwicki, M. (2019). "Author Profiling using Semantic and Syntactic Features: Notebook for PAN at CLEF 2019."
  37. Raghunadha Reddy T., Lakshminarayana M., Vishnu Vardhan B., Sai Prasad K., Amarnath Reddy E. (2019) "A New Document Representation Approach for Gender Prediction Using Author Profiles." In: Bapi R., Rao K., Prasad M. (eds) First International Conference on Artificial Intelligence and Cognitive Computing. Advances in Intelligent Systems and Computing, vol 815. Springer, Singapore
  38. Maharjan, Suraj & Shrestha, Prasha & Solorio, Thamar & Hasan, Ragib. (2014). "A Straightforward Author Profiling Approach in MapReduce." LNCS (LNAI).
  39. Company, J. S., & Wanner, L. (2017). "On the Relevance of Syntactic and Discourse Features for Author Profiling and Identification." Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, 2, 681–687.
  40. 1 2 Dzikiene. J. K., Utka, A., & Šarkute, L. (2015). "Authorship Attribution and Author Profiling of Lithuanian Literary Texts", 96–105.
  41. Ledger, G. (1994). "Shakespeare, Fletcher, and the Two Noble Kinsmen." Literary and Linguistic Computing, 9(3), 235–247.
  42. 1 2 Nomoto, T. (2009). "Classifying library catalogues by author profiling." In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval – SIGIR 09.
  43. Davies, D. (2017, August 22). "FBI Profiler Says Linguistic Work Was Pivotal In Capture Of Unabomber."