HOME

TheInfoList



OR:

Author profiling is the analysis of a given set of texts in an attempt to uncover various characteristics of the author based on stylistic- and content-based features, or to identify the author. Characteristics analysed commonly include
age Age or AGE may refer to: Time and its effects * Age, the amount of time someone or something has been alive or has existed ** East Asian age reckoning, an Asian system of marking age starting at 1 * Ageing or aging, the process of becoming older ...
and
gender Gender is the range of characteristics pertaining to femininity and masculinity and differentiating between them. Depending on the context, this may include sex-based social structures (i.e. gender roles) and gender identity. Most cultures u ...
, though more recent studies have looked at other characteristics like personality traits and occupation Author profiling is one of the three major fields in Automatic Authorship Identification (AAI), the other two being authorship attribution and authorship identification. The process of AAI emerged at the end of the 19th century.
Thomas Corwin Mendenhall Thomas Corwin Mendenhall (October 4, 1841 – March 23, 1924) was an American autodidact physicist and meteorologist. He was the first professor hired at Ohio State University in 1873 and the superintendent of the U.S. Coast and Geodetic Surve ...
, an American
autodidact Autodidacticism (also autodidactism) or self-education (also self-learning and self-teaching) is education without the guidance of masters (such as teachers and professors) or institutions (such as schools). Generally, autodidacts are individua ...
physicist and
meteorologist A meteorologist is a scientist who studies and works in the field of meteorology aiming to understand or predict Earth's atmospheric phenomena including the weather. Those who study meteorological phenomena are meteorologists in research, while t ...
, was the first to apply this process to the works of
Francis Bacon Francis Bacon, 1st Viscount St Alban (; 22 January 1561 – 9 April 1626), also known as Lord Verulam, was an English philosopher and statesman who served as Attorney General and Lord Chancellor of England. Bacon led the advancement of both ...
,
William Shakespeare William Shakespeare ( 26 April 1564 – 23 April 1616) was an English playwright, poet and actor. He is widely regarded as the greatest writer in the English language and the world's pre-eminent dramatist. He is often called England's nation ...
, and
Christopher Marlowe Christopher Marlowe, also known as Kit Marlowe (; baptised 26 February 156430 May 1593), was an English playwright, poet and translator of the Elizabethan era. Marlowe is among the most famous of the Elizabethan playwrights. Based upon the ...
. From these three historic figures, Mendenhall sought to uncover their quantitative stylistic differences by inspecting word lengths. Although much progress has been made in the 21st century, the task of author profiling remains an unsolved problem due to its difficulty.


Techniques

Through the analysis of texts, various author profiling techniques can be applied to predict information about the author. For example, function words, as well as part-of-speech analysis, can be referenced to determine the author's gender and truth of a text. The process of author profiling usually involves the following steps:López-Monroy, A. P., Montes-y-Gómez, M., Escalante, H. J., Villaseñor-Pineda, L. & Stamatatos, E. (2015)
"Discriminative subprofile-specific representations for author profiling in social media."
''In: Knowledge-Based Systems, 89,'' 134 – 147.
# Identifying specific features to be extracted from the text # Building an adopted, standard representation (e.g.
Bag-of-words model The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding g ...
) for the target profile # Building a classification model using a standard classifier (e.g.
Support Vector Machines In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...
) for the target profile
Machine learning algorithms The following outline is provided as an overview of and topical guide to machine learning. Machine learning is a subfield of soft computing within computer science that evolved from the study of pattern recognition and computational learning ...
for author profiling have become increasingly complex over time. Algorithms used in author profiling include: *Support Vector Machines Lundeqvist, E. & Svensson, M. (2017)
"Author profiling: A machine learning approach towards detecting gender, age and native language of users in social media."
''In: Department of Information Technology.''
*
Naive Bayes classifier In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features (see Bayes classifier). They are among the simplest Baye ...
s *Deep averaging networks, many layers in a cycle of machine learning that uses the mean of
word embedding In natural language processing (NLP), word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the v ...
s within a text * Long Short-Term memoryBsi, B. & Zrigui, M. (2018)
"Deep learning techniques for author profiling in social media content."
''In: 31st IBIMA Conference.''
In the past, author profiling was limited to physical documents, often in the form of books and
newspaper articles A newspaper is a periodical publication containing written information about current events and is often typed in black ink with a white or gray background. Newspapers can cover a wide variety of fields such as politics, business, sports a ...
. Different combinations of textual attributes belonging to the authors were identified and analyzed using author profiling, including
lexical Lexical may refer to: Linguistics * Lexical corpus or lexis, a complete set of all words in a language * Lexical item, a basic unit of lexicographical classification * Lexicon, the vocabulary of a person, language, or branch of knowledge * Lex ...
and syntactical features. Pioneering research in author profiling focused mostly on a single genre until the shift towards author profiling on social media and the Internet.Bilan, I. & Zhekova, D. (2016)
"CAPS: A cross-genre author profiling system."
''CLEF.''
While attributes, such as
content words Content words, in linguistics, are words that possess semantic content and contribute to the meaning of the sentence in which they occur. In a traditional approach, nouns were said to name objects and other entities, lexical verbs to indicate actio ...
and POS tags, are effective in author profile predictions on physical documents, their effectiveness in author profile predictions on digital texts is subjective and dependent on the type of online content being analyzed. With the advances in technology, author profiling on the Internet has become increasingly common. Digital texts, such as social media posts, blog posts and
emails Electronic mail (email or e-mail) is a method of exchanging messages ("mail") between people using electronic devices. Email was thus conceived as the electronic ( digital) version of, or counterpart to, mail, at a time when "mail" mean ...
, are now being used. This has sparked greater research efforts because of the advantages analysing digital texts can bring to sectors like marketing and business. Author profiling on digital texts has also enabled predictions of a wider range of author characteristics such as personality, income and occupation. The most effective attributes for author profiling on digital texts involve a combinations of stylistic and content features. Author profiling on digital texts focuses on cross-genre author profiling, whereby one genre is used for training data and another genre is used for testing data, though both need to be relatively similar for good results. There are some problems when performing author profiling techniques on online texts. These problems include: *Wide variation in lengths of texts used *Class imbalance in data


Author profiling and the Internet

The rise of the internet in the 20th to 21st century catalysed an increase in author profiling research, since data could be mined from the web, including social media platforms, emails and blogs. Content from the web have been analysed in tasks of author profiling to identify the age, gender, geographic origins, nationality and psychometric traits of web users. The information obtained has been used to serve various applications, including marketing and
forensics Forensic science, also known as criminalistics, is the application of science to criminal and civil laws, mainly—on the criminal side—during criminal investigation, as governed by the legal standards of admissible evidence and crimina ...
.


Social media

The increased integration of social media in people's daily lives have made them a rich source of textual data for author profiling. This is mainly because users frequently upload and share content for various purposes including self-expression, socialisation, and personal businesses. The
Social bot A social bot, or also described as a social AI or social algorithm, is a software agent that communicates autonomously on social media. The messages (e.g. tweets) it distributes can be simple and operate in groups and various configurations wit ...
is also a frequent feature of social media platforms, especially Twitter, generating content that may be analysed for author profiling.Rangel, F., & Russo, P. (2019).
Overview of the 7th author profiling task at PAN 2019: Bots and gender profiling in Twitter.
''CLEF.''
While different platforms contain similar data they may also contain different features depending on the format and structure of the particular platform. There are still limitations in using social media as data sources for author profiling, because data obtained may not always be reliable or accurate. Users sometimes provide false information about themselves or withhold information.Rosso, P., Rangel, F., Farías, I. H., Cagnina, L., Zaghouani, W., & Charfi, A. (2018).
A survey on author profiling, deception, and irony detection for the Arabic language.
''Language and Linguistics Compass, 12(4).''
As a result, the training of algorithms for author profiling may be impeded by data that is less accurate. Another limitation is the irregularity of text in social media. Features of irregularity include deviation from normal linguistic standards such as spelling errors, unstandardised transliteration as with the substitution of letters with numbers, shorthands, user-created abbreviations for phrases and et cetera, which may pose a challenge to author profiling.Gómez-Adorno, H., Markov, I., Sidorov, G., Posadas-Durán, J.-P., Sanchez-Perez, M. A., & Chanona-Hernandez, L. (2016)
"Improving Feature Representation Based on a Neural Network for Author Profiling in Social Media Texts"
''In: Computational Intelligence and Neuroscience'', pg 1–13.
Researchers have adopted methods to overcome these limitations in training their algorithms for author profiling.


Facebook

Facebook is useful for author profiling studies as a
social networking service A social networking service or SNS (sometimes called a social networking site) is an online platform which people use to build social networks or social relationships with other people who share similar personal or career content, interests, act ...
. This is because of how a
social network A social network is a social structure made up of a set of social actors (such as individuals or organizations), sets of dyadic ties, and other social interactions between actors. The social network perspective provides a set of methods for an ...
may be built, expanded, and used for social action in the site. In such processes, users share personal content that may be used for author profiling studies. Textual data is obtained from Facebook for author profiling from user's personal posts such as 'status updates'.Hsieh, F.C., Sandroni, R.F., & Paraboni, I. (2018).
Author Profiling from Facebook Corpora
. ''LREC.''
These are acquired to produce a corpus in the selected language(s) for author profiling, to create either a bilingual or multilingual database of content words,Fatima, M., Hasan, K., Anwar, S., & Nawab, R. M. A. (2017)
"Multilingual author profiling on Facebook"
''In: Information Processing & Management, 53(4)'', 886–904.
which may then be used for author profiling. In the context of Facebook, author profiling mainly involves English textual data, but also uses non-english languages that include:
Roman Urdu Roman Urdu ( ur, ) is the name used for the Urdu language written with the Latin script, also known as the Roman script. According to the Urdu scholar Habib R. Sulemani: "Roman Urdu is strongly opposed by the traditional Arabic script lovers. ...
,
Arabic Arabic (, ' ; , ' or ) is a Semitic languages, Semitic language spoken primarily across the Arab world.Semitic languages: an international handbook / edited by Stefan Weninger; in collaboration with Geoffrey Khan, Michael P. Streck, Janet C ...
,
Brazilian Portuguese Brazilian Portuguese (' ), also Portuguese of Brazil (', ) or South American Portuguese (') is the set of varieties of the Portuguese language native to Brazil and the most influential form of Portuguese worldwide. It is spoken by almost all of ...
, Spanish. While author profiling studies on Facebook have been predominantly for gender and age-group identification, there have been attempts to derive attributes to predict
religiosity In sociology, the concept of religiosity has proven difficult to define. The Oxford English Dictionary suggests: "Religiousness; religious feeling or belief. ..Affected or excessive religiousness". Different scholars have seen this concept as b ...
, the IT background of users, and even basic emotions (as defined by
Paul Ekman Paul Ekman (born February 15, 1934) is an American psychologist and professor emeritus at the University of California, San Francisco who is a pioneer in the study of emotions and their relation to facial expressions. He was ranked 59th out of ...
) among others.Rangel, F., & Rosso, P. (2013).
Use of Language and Author Profiling: Identification of Gender and Age.


Weibo

Sina Weibo Sina Weibo (新浪微博) is a Chinese microblogging ( weibo) website. Launched by Sina Corporation on 14 August 2009, it is one of the biggest social media platforms in China, with over 582 million monthly active users (252 million daily acti ...
is one of the few Asian social media platforms that contain texts in Asian languages to have been analysed for author profiling. Primary content of focus for author profiling on Weibo content include classical Chinese characters,
hashtag A hashtag is a metadata tag that is prefaced by the hash (also known as pound or octothorpe) sign, ''#''. On social media, hashtags are used on microblogging and photo-sharing services such as Twitter or Instagram as a form of user-generated ...
s,
emoticon An emoticon (, , rarely , ), short for "emotion icon", also known simply as an emote, is a pictorial representation of a facial expression using Character (symbol), characters—usually punctuation marks, numbers, and letters—to express a ...
s, kaomoji, homogenous
punctuation Punctuation (or sometimes interpunction) is the use of spacing, conventional signs (called punctuation marks), and certain typographical devices as aids to the understanding and correct reading of written text, whether read silently or aloud. An ...
,
Latin Latin (, or , ) is a classical language belonging to the Italic branch of the Indo-European languages. Latin was originally a dialect spoken in the lower Tiber area (then known as Latium) around present-day Rome, but through the power of the ...
sequences (due to the multilingualism of text) and even poetic formats. Particularly popular Chinese expressions, POS tags and word types are also tracked for author profiling.Zhang, W., Caines, A., Alikaniotis, D., & Buttery, P. (2015)
"Predicting author age from Weibo microblog posts."
''LREC.''
Author profiling for Weibo content requires algorithms different from those used for other social media platforms, mainly due to the linguistic differences between
Mandarin Chinese Mandarin (; ) is a group of Chinese (Sinitic) dialects that are natively spoken across most of northern and southwestern China. The group includes the Beijing dialect, the basis of the phonology of Standard Chinese, the official language of ...
and Western languages. For example, Chinese emotions involve Chinese characters describing the gesture or facial expression in brackets, such as: e.g. 'laughter', 'tears', 'giggle', 'love', 'heart'. This differs from the use of punctuation symbols for emoticons in Western languages, or the common use of the Unicode emojis in other platforms such as Facebook,
Instagram Instagram is a photo and video sharing social networking service owned by American company Meta Platforms. The app allows users to upload media that can be edited with filters and organized by hashtags and geographical tagging. Posts can ...
, et cetera. Further, while there are around 161 western emoticons, there are around 2900 emoticons regularly used in Mainland China for web content as in Weibo.Chen, L., Qian, T., Wang, F., You, Z., Peng, Q., & Zhong, M. (2015).
Age Detection for Chinese Users in Weibo
" ''WAIM 2015, LNCS 9098'', 83–95.
To tackle these differences, author profiling algorithms have been trained on Chinese emoticons and linguistic features. For example, author profiling algorithms have been designed to detect Chinese stylistic expressions expressing formality and
sentiment Sentiment may refer to: *Feelings, and emotions *Public opinion, also called sentiment *Sentimentality, an appeal to shallow, uncomplicated emotions at the expense of reason *Sentimental novel, an 18th-century literary genre *Market sentiment, opt ...
, in place of algorithms detecting English linguistic features such as capital letters. As compared to other more popular, globalised platforms, texts on Weibo are not as commonly used in the task of author profiling. This is likely due to the centralisation of Weibo in the Chinese population of
Mainland China "Mainland China" is a geopolitical term defined as the territory governed by the People's Republic of China (including islands like Hainan or Chongming), excluding dependent territories of the PRC, and other territories within Greater China. ...
, limiting its usage to predominantly China Nationals. Studies done for this platform have used bots, machine learning algorithms to identify authors' age and gender. Data is acquired from Weibo microblog posts of willing participants to be analysed, and used to train algorithms that build concept-based profiles of users to a certain accuracy.


Chat logs

Chat logs have been studied for author profiling as they include much textual
discourse Discourse is a generalization of the notion of a conversation to any form of communication. Discourse is a major topic in social theory, with work spanning fields such as sociology, anthropology, continental philosophy, and discourse analysis. ...
, the analysis of which have contributed to applicational studies including social trends and
forensic science Forensic science, also known as criminalistics, is the application of science to criminal and civil laws, mainly—on the criminal side—during criminal investigation, as governed by the legal standards of admissible evidence and criminal ...
. Sources of data for author profiling from
chat log A chat log is an archive of transcripts from online chat and instant messaging conversations. Many chat or IM applications allow for the client-side archiving of online chat conversations, while a subset of chat or IM clients (i.e., Google Talk and ...
s include platforms such as
Yahoo! Yahoo! (, styled yahoo''!'' in its logo) is an American web services provider. It is headquartered in Sunnyvale, California and operated by the namesake company Yahoo Inc., which is 90% owned by investment funds managed by Apollo Global Man ...
, AIM (software) and
WhatsApp WhatsApp (also called WhatsApp Messenger) is an internationally available freeware, cross-platform, centralized instant messaging (IM) and voice-over-IP (VoIP) service owned by American company Meta Platforms (formerly Facebook). It allows us ...
. Computational systems have been devised to produce concept-based profiles listing chat topics discussed in a single
chat room The term chat room, or chatroom (and sometimes group chat; abbreviated as GC), is primarily used to describe any form of synchronous conferencing, occasionally even asynchronous conferencing. The term can thus mean any technology, ranging from ...
or by independent users.


Blogs

Author profiling can be used to identify characteristics of blog writers, such as their age, gender and
geographical location In geography, location or place are used to denote a region (point, line, or area) on Earth's surface or elsewhere. The term ''location'' generally implies a higher degree of certainty than ''place'', the latter often indicating an entity with an ...
, based on their different writing styles,Pham, D.D., Tran, G.B., & Pham, S.B. (2009)
Author Profiling for Vietnamese Blogs.
''2009 International Conference on Asian Language Processing,'' 190–194.
This is especially useful when it comes to anonymous blogs. The choice of content words, style-based features and topic-based features are analyzed to discover characteristics of the author. In general, features that are frequently occur in blogs include a high distribution of
verbs A verb () is a word (part of speech) that in syntax generally conveys an action (''bring'', ''read'', ''walk'', ''run'', ''learn''), an occurrence (''happen'', ''become''), or a state of being (''be'', ''exist'', ''stand''). In the usual descrip ...
per writing and a relatively high use of
pronouns In linguistics and grammar, a pronoun (list of glossing abbreviations, abbreviated ) is a word or a group of words that one may substitute for a noun or noun phrase. Pronouns have traditionally been regarded as one of the part of speech, parts o ...
. The frequency of verbs, pronouns and other word classes are used to profile and classify emotions in the writings of authors, as well as their gender and age. Author profiling using classification models that were used on physical documents in the past, such as Support Vector Machines, have also been tested on blogs. However, it has been proven to be unsuitable for the latter due to its low performance. The machine learning algorithms that work well for author profiling on blogs include: *
Instance-based learning In machine learning, instance-based learning (sometimes called memory-based learning) is a family of learning algorithms that, instead of performing explicit generalization, compare new problem instances with instances seen in training, which have b ...
* Random Decision Forests


Email

Email has been a consistent focus for author profiling due to rich textual data that can be found in various sections of a typical emailing platform. These sections include the sent, inbox, spam, trash, and archived folders.Estival, D., Gaustad, T., Pham, S. B., Radford, W., & Hutchinson, B. (2007)
Author Profiling for English Emails
Multilingual approaches to author profiling for emails have included English, Spanish, and Arabic emails as data sources, among others. Through author profiling, details of email users may be identified, such as their age, gender, geographical origin, level of education, nationality and even
psychometrics Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally refers to specialized fields within psychology and education devoted to testing, measurement, assessment, and ...
traits of personality, which includes
neuroticism In the study of psychology, neuroticism has been considered a fundamental personality trait. For example, in the Big Five approach to personality trait theory, individuals with high scores for neuroticism are more likely than average to be moody ...
,
agreeableness Agreeableness is a personality trait manifesting itself in individual behavioral characteristics that are perceived as kind, sympathetic, cooperative, warm, and considerate. In contemporary personality psychology, agreeableness is one of the fiv ...
,
conscientiousness Conscientiousness is the personality trait of being careful, or diligent. Conscientiousness implies a desire to do a task well, and to take obligations to others seriously. Conscientious people tend to be efficient and organized as opposed to ...
and
extraversion and introversion The traits of extraversion (also spelled extroversion Retrieved 2018-02-21.) and introversion are a central dimension in some human personality theories. The terms ''introversion'' and ''extraversion'' were introduced into psychology by Carl J ...
from the
Big Five personality traits The Big Five personality traits is a suggested taxonomy, or grouping, for personality traits, developed from the 1980s onward in psychological trait theory. Starting in the 1990s, the theory identified five factors by labels, for the US English ...
. In author profiling for email, content is processed for important textual data, while unimportant features such as
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
and other hyper-text markup language (HTML) redundancies are excluded. Important parts of the Multi-purpose Internet Mail Extensions (MIME) that contain content of the emails are also included in the analysis. Obtained data is often parsed into various sections of content, including author text, signature text, advertisement, quoted text, and reply lines. Further analysis of email textual content in author profiling tasks involves the extraction of tone of voice,
sentiment Sentiment may refer to: *Feelings, and emotions *Public opinion, also called sentiment *Sentimentality, an appeal to shallow, uncomplicated emotions at the expense of reason *Sentimental novel, an 18th-century literary genre *Market sentiment, opt ...
,
semantics Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy Philosophy (f ...
and other
linguistic Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguis ...
features to be processed.


Applications

Author profiling has applications in various fields where there is a need to identify specific characteristics of an author of a text, with a growing importance in fields like forensics and marketing.Author Profiling 2018
(n.d.).
Depending on its application, the task of author profiling can vary in terms of the characteristics to be identified, number of authors studied and number of texts available for analysis. Although its applications have traditionally been limited to written texts, such as literary works, this has extended to online texts with the advancement of the computer and the Internet.


Forensic linguistics

In the context of
forensic linguistics Forensic linguistics, legal linguistics, or language and the law, is the application of linguistic knowledge, methods, and insights to the forensic context of law, language, crime investigation, trial, and judicial procedure. It is a branch of ap ...
, author profiling is used to identify characteristics of the author of anonymous, pseudonymous or
forged Forging is a manufacturing process involving the shaping of metal using localized compressive forces. The blows are delivered with a hammer (often a power hammer) or a die. Forging is often classified according to the temperature at which it ...
text, based on the author's use of the language. Through linguistic analysis, forensic linguists seek to identify the suspect's motivation and ideology, along with other class features, such as the suspect's ethnicity or profession. While this does not always lead to decisive author identification, such information can help
law enforcement Law enforcement is the activity of some members of government who act in an organized manner to enforce the law by discovering, deterring, rehabilitating, or punishing people who violate the rules and norms governing that society. The term en ...
narrow the pool of suspects. In most cases, author profiling in the context of forensic linguistics involves a single text problem, in which there is either no or few comparison texts available and no external evidence that points to the author.Grant, T. D. (2008).
Approaching questions in forensic authorship analysis
" ''In Gibbons, J. & Turell, M. T. (eds.). Dimensions of Forensic Linguistics.'' John Benjamins.
Examples of text analysed by forensic linguists include blackmailing letters, confessions, testaments, suicide letters and plagiarised writing. This has also extended to online texts as well, such as sexually explicit online chat logs between middle-aged men and underaged girls, with the increasing number of
cybercrime A cybercrime is a crime that involves a computer or a computer network.Moore, R. (2005) "Cyber crime: Investigating High-Technology Computer Crime," Cleveland, Mississippi: Anderson Publishing. The computer may have been used in committing the ...
s committed on the Internet. One of the earliest and best-known examples of the use of author profiling is by
Roger Shuy Roger Wellington Shuy (born January 5, 1931 in Akron, Ohio) is an American linguist best known for his work in sociolinguistics and forensic linguistics. He received his BA from Wheaton College in 1952, his MA from Kent State University in 195 ...
, who was asked to examine a ransom note linked to a notorious kidnapping case in 1979. Based on his analysis of the kidnapper's
idiolect Idiolect is an individual's unique use of language, including speech. This unique usage encompasses vocabulary, grammar, and pronunciation. This differs from a dialect, a common set of linguistic characteristics shared among a group of people. Th ...
, Shuy was able to identify crucial elements of the kidnappers identity from his misspellings and a
dialect The term dialect (from Latin , , from the Ancient Greek word , 'discourse', from , 'through' and , 'I speak') can refer to either of two distinctly different types of Linguistics, linguistic phenomena: One usage refers to a variety (linguisti ...
item, that is, the kidnapper was well-educated and from
Akron, Ohio Akron () is the fifth-largest city in the U.S. state of Ohio and is the county seat of Summit County, Ohio, Summit County. It is located on the western edge of the Glaciated Allegheny Plateau, about south of downtown Cleveland. As of the 2020 C ...
. This eventually led to a successful arrest and confession by the suspect. However, there are criticisms that author profiling methods lack objectivity, since these methods are reliant on a forensic linguist's subjective identification of crucial
sociolinguistic Sociolinguistics is the descriptive study of the effect of any or all aspects of society, including cultural norms, expectations, and context, on the way language is used, and society's effect on language. It can overlap with the sociology of l ...
markers . These methods, such as those adopted by literary critic
Donald Wayne Foster Donald Wayne Foster (born 1950) was a professor of English at Vassar College in New York. He is now retired. He is known for his work dealing with various issues of Shakespearean authorship through textual analysis. He has also applied these tec ...
, are said to be speculative and based entirely on one's subjective experience, and therefore cannot be tested
empirically In philosophy, empiricism is an epistemological theory that holds that knowledge or justification comes only or primarily from sensory experience. It is one of several views within epistemology, along with rationalism and skepticism. Empir ...
.


Bot detection

Author profiling is adopted in the identification of social bots, the most common being
Twitter bot A Twitter bot is a type of software bot that controls a Twitter account via the Twitter API. The social bot software may autonomously perform actions such as tweeting, re-tweeting, liking, following, unfollowing, or direct messaging other account ...
s. Social bots have been deemed as a threat given their commercial, political and ideological influence, such as the
2016 United States presidential election The 2016 United States presidential election was the 58th quadrennial presidential election, held on Tuesday, November 8, 2016. The Republican ticket of businessman Donald Trump and Indiana governor Mike Pence defeated the Democratic ticket ...
, during which they polarised political conversations, and spread misinformation and unverified information. In the context of marketing, social bots can artificially inflate the popularity of a product by posting positive reviews, and undermine the reputation of competitive products with unfavourable reviews.Bots and Gender Profiling 2019
. (n.d.).
Therefore, bot detection from an author profiling perspective is a task of high importance.Goubin, Régis & Lefeuvre, Dorian & Alhamzeh, Alaa & Mitrović, Jelena & Egyed-Zsigmond, El˝ & Fossi, Leopold. (2019).
Bots and Gender Profiling using a Multi-layer Architecture Notebook for PAN at CLEF 2019
.
Made to appear as human accounts, bots can mostly be identified by information on their profiles, like their username, profile photo and time of posting. However, the task of identifying bots solely from textual data (i.e. without meta-data) is significantly more challenging, requiring author profiling techniques. This usually involves a classification task based on semantic and syntactic features.Daelemans W. et al. (2019)
Overview of PAN 2019: Bots and Gender Profiling, Celebrity Profiling, Cross-Domain Authorship Attribution and Style Change Detection
" ''In: Crestani F. et al. (eds) Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2019. Lecture Notes in Computer Science'', vol 11696. Springer, Cham.
The task of bot and gender profiling was one of four shared tasks organised by PAN, which organises a series of scientific events and shared tasks of digital text forensics and stylometry, in its 2019 edition. Participating teams had achieved much success, with the best results for bot detection for English and Spanish tweets at 95.95% and 93.33% respectively.


Marketing

Author profiling is also useful from a marketing viewpoint, as it allows businesses to identify the
demographics Demography () is the statistical study of populations, especially human beings. Demographic analysis examines and measures the dimensions and dynamics of populations; it can cover whole societies or groups defined by criteria such as edu ...
of people that like or dislike their products based on an analysis of blogs, online product reviews and social media content. This is important since most individuals post their reviews on products anonymously. Author profiling techniques are helpful to business experts in making better informed strategic decisions based on the demographics of their target group. In addition, businesses can target their marketing campaigns at groups of consumers who match the demographics and profile of current customers.


Author identification and influence tracing

Author profiling techniques are used to study
traditional media Old media, or legacy media, are the mass media institutions that dominated prior to the Information Age; particularly print media, film studios, music studios, advertising agencies, radio broadcasting, and television. Old media institutions ar ...
and literature to identify the writing style of various authors as well as their written topics of content. Author profiling for literature is also been done to deduce the social networks of authors and their literary influence based on their bibliographic records of co-authorship. In cases of anonymous or
pseudepigraphic Pseudepigrapha (also anglicized as "pseudepigraph" or "pseudepigraphs") are falsely attributed works, texts whose claimed author is not the true author, or a work whose real author attributed it to a figure of the past.Bauckham, Richard; "Pseu ...
works, sometimes the technique has been used to attempt to identify the author or authors, or determine which works were written by the same person. Some examples of author profiling studies on literature and traditional media include studies on the following:Dzikiene. J. K., Utka, A., & Šarkute, L. (2015).
Authorship Attribution and Author Profiling of Lithuanian Literary Texts
, 96–105.
*The Bible *
Gospels Gospel originally meant the Christian message ("the gospel"), but in the 2nd century it came to be used also for the books in which the message was set out. In this sense a gospel can be defined as a loose-knit, episodic narrative of the words an ...
of the
New Testament The New Testament grc, Ἡ Καινὴ Διαθήκη, transl. ; la, Novum Testamentum. (NT) is the second division of the Christian biblical canon. It discusses the teachings and person of Jesus, as well as events in first-century Christ ...
*Shakespeare's works *
The Federalist Papers ''The Federalist Papers'' is a collection of 85 articles and essays written by Alexander Hamilton, James Madison, and John Jay under the collective pseudonym "Publius" to promote the ratification of the Constitution of the United States. The co ...
in the 1990s and 1960s *Author profiling studies for Lithuanian Literary Texts *''
Primary Colors A set of primary colors or primary colours (see spelling differences) consists of colorants or colored lights that can be mixed in varying amounts to produce a gamut of colors. This is the essential method used to create the perception of a br ...
'', 1996 novel whose author was for a time anonymous *'' A Warning'', a 2019 political book whose author was for a time anonymous


Library cataloguing

Another application of author profiling is in devising strategies for cataloguing library resources based on standard attributes.Nomoto, T. (2009).
Classifying library catalogues by author profiling
" ''In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval – SIGIR 09''.
In this approach, author profiling techniques may improve the efficiency of library cataloguing in which library resources are automatically classified based on the authors' bibliographic records. This was a significant issue in the early 21st century when much of library cataloguing was still done manually. In using author profiling for library cataloguing, researchers have used machine learning for automatic processes in the library, such as Support Vector Machine algorithms (SVMs). With the use of SVMs for author profiling, bibliographic records of authors within existing
databases In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
may be identified, tracked, and updated to identify an author based on her topics of literary content and
expertise An expert is somebody who has a broad and deep understanding and competence in terms of knowledge, skill and experience through practice and education in a particular field. Informally, an expert is someone widely recognized as a reliable s ...
as indicated in his or her bibliographic records. In this case, author profiling uses the
social structures In the social sciences, social structure is the aggregate of patterned social arrangements in society that are both emergent from and determinant of the actions of individuals. Likewise, society is believed to be grouped into structurally rel ...
of authors that may be derived from physical copies of published media to catalogue library resources.


In popular culture

Author profiling has been featured in popular culture. The 2017
Discovery Channel Discovery Channel (known as The Discovery Channel from 1985 to 1995, and often referred to as simply Discovery) is an American cable channel owned by Warner Bros. Discovery, a publicly traded company run by CEO David Zaslav. , Discovery Channe ...
mini-series Manhunt: Unabomber is a fictionalised account of the
FBI The Federal Bureau of Investigation (FBI) is the domestic Intelligence agency, intelligence and Security agency, security service of the United States and its principal Federal law enforcement in the United States, federal law enforcement age ...
investigation surrounding the
Unabomber Theodore John Kaczynski ( ; born May 22, 1942), also known as the Unabomber (), is an American domestic terrorist and former mathematics professor. Between 1978 and 1995, Kaczynski killed three people and injured 23 others in a nationwide ...
. It features a criminal profiler who identifies defining characteristics of the Unabomber's identity based on his analysis of the Unabomber's idiolect in his published manifesto and letters. The show highlighted the importance of author profiling in criminal forensics, as it was critical in the capture of the real Unabomber culprit in 1996.Davies, D. (2017, August 22).
FBI Profiler Says Linguistic Work Was Pivotal In Capture Of Unabomber
"


See also

;Related subjects *
Computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
*
Forensic linguistics Forensic linguistics, legal linguistics, or language and the law, is the application of linguistic knowledge, methods, and insights to the forensic context of law, language, crime investigation, trial, and judicial procedure. It is a branch of ap ...
*
Native-language identification Native-language identification (NLI) is the task of determining an author's first language, native language (L1) based only on their writings in a second language (L2). NLI works through identifying language-usage patterns that are common to specif ...
*
Social bot A social bot, or also described as a social AI or social algorithm, is a software agent that communicates autonomously on social media. The messages (e.g. tweets) it distributes can be simple and operate in groups and various configurations wit ...
*
Stylometry Stylometry is the application of Stylistics (linguistics), the study of linguistic style, usually to written language. It has also been applied successfully to music and to fine-art paintings as well.Shlomo Argamon, Argamon, Shlomo, Kevin Burns, ...


References

{{reflist Authorship debates Computational fields of study