Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality
information
Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random ...
from
text
Text may refer to:
Written word
* Text (literary theory), any object that can be read, including:
**Religious text, a writing that a religious tradition considers to be sacred
**Text, a verse or passage from scripture used in expository preachin ...
. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include
website
A website (also written as a web site) is a collection of web pages and related content that is identified by a common domain name and published on at least one web server. Examples of notable websites are Google Search, Google, Facebook, Amaz ...
s,
book
A book is a medium for recording information in the form of writing or images, typically composed of many pages (made of papyrus, parchment, vellum, or paper) bound together and protected by a cover. The technical term for this physical arr ...
s,
email
Electronic mail (email or e-mail) is a method of exchanging messages ("mail") between people using electronic devices. Email was thus conceived as the electronic ( digital) version of, or counterpart to, mail, at a time when "mail" meant ...
s,
review
A review is an evaluation of a publication, product, service, or company or a critical take on current affairs in literature, politics or culture. In addition to a critical evaluation, the review's author may assign the work a content rating, ...
s, and articles. High-quality information is typically obtained by devising patterns and trends by means such as
statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining:
information extraction
Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
,
data mining, and a
KDD (Knowledge Discovery in Databases) process. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a
database
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
), deriving patterns within the
structured data
A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be co ...
, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of
relevance
Relevance is the concept of one topic being connected to another topic in a way that makes it useful to consider the second topic when considering the first. The concept of relevance is studied in many different fields, including cognitive sci ...
,
novelty
Novelty (derived from Latin word ''novus'' for "new") is the quality of being new, or following from that, of being striking, original or unusual. Novelty may be the shared experience of a new cultural phenomenon or the subjective perception of an ...
, and interest. Typical text mining tasks include
text categorization
Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") ...
,
text clustering, concept/entity extraction, production of granular taxonomies,
sentiment analysis
Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjec ...
,
document summarization
Automatic summarization is the process of shortening a set of data computationally, to create a subset (a Abstract (summary), summary) that represents the most important or relevant information within the original content. Artificial intelligence ...
, and entity relation modeling (''i.e.'', learning relations between
named entities).
Text analysis involves
information retrieval
Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
,
lexical analysis
In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of ''lexical tokens'' ( strings with an assigned and thus identified ...
to study word frequency distributions,
pattern recognition
Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphi ...
,
tagging/
annotation
An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For anno ...
,
information extraction
Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
,
data mining techniques including link and association analysis,
visualization
Visualization or visualisation may refer to:
*Visualization (graphics), the physical or imagining creation of images, diagrams, or animations to communicate a message
* Data visualization, the graphic representation of data
* Information visualiz ...
, and
predictive analytics
Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or otherwise unknown events.
In business ...
. The overarching goal is, essentially, to turn text into data for analysis, via application of
natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
(NLP), different types of
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
s and analytical methods. An important phase of this process is the interpretation of the gathered information.
A typical application is to scan a set of documents written in a
natural language
In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages ...
and either model the
document
A document is a written, drawn, presented, or memorialized representation of thought, often the manifestation of non-fictional, as well as fictional, content. The word originates from the Latin ''Documentum'', which denotes a "teaching" or ...
set for
predictive classification
Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or otherwise unknown events.
In business ...
purposes or populate a database or search index with the information extracted.
The
document
A document is a written, drawn, presented, or memorialized representation of thought, often the manifestation of non-fictional, as well as fictional, content. The word originates from the Latin ''Documentum'', which denotes a "teaching" or ...
is the basic element while starting with text mining. Here, we define a document as a unit of textual data, which normally exists in many types of collections.
Text analytics
The term text analytics describes a set of
linguistic
Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguis ...
,
statistical
Statistics (from German: ''Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industria ...
, and
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
techniques that model and structure the information content of textual sources for
business intelligence
Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis and management of business information. Common functions of business intelligence technologies include reporting, online analytical pr ...
,
exploratory data analysis
In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but pr ...
,
research
Research is "creativity, creative and systematic work undertaken to increase the stock of knowledge". It involves the collection, organization and analysis of evidence to increase understanding of a topic, characterized by a particular att ...
, or investigation. The term is roughly synonymous with text mining; indeed,
Ronen Feldman modified a 2000 description of "text mining" in 2004 to describe "text analytics". The latter term is now used more frequently in business settings while "text mining" is used in some of the earliest application areas, dating to the 1980s, notably life-sciences research and government intelligence.
The term text analytics also describes that application of text analytics to respond to business problems, whether independently or in conjunction with query and analysis of fielded, numerical data. It is a truism that 80 percent of business-relevant information originates in
unstructured form, primarily text.
These techniques and processes discover and present knowledge – facts,
business rule A business rule defines or constrains some aspect of business. It may be expressed to specify an action to be taken when certain conditions are true or may be phrased so it can only resolve to either true or false. Business rules are intended to ass ...
s, and relationships – that is otherwise locked in textual form, impenetrable to automated processing.
Text analysis processes
Subtasks—components of a larger text-analytics effort—typically include:
*
Dimensionality reduction
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
is important technique for pre-processing data. Technique is used to identify the root word for actual words and reduce the size of the text data.
*
Information retrieval
Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
or identification of a
corpus
Corpus is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of linguistics
Music
* ...
is a preparatory step: collecting or identifying a set of textual materials, on the Web or held in a file system, database, or content
corpus manager, for analysis.
* Although some text analytics systems apply exclusively advanced statistical methods, many others apply more extensive
natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
, such as
part of speech tagging
In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definit ...
, syntactic
parsing
Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lati ...
, and other types of linguistic analysis.
*
Named entity recognition
Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre ...
is the use of gazetteers or statistical techniques to identify named text features: people, organizations, place names, stock ticker symbols, certain abbreviations, and so on.
* Disambiguation—the use of
contextual clues—may be required to decide where, for instance, "Ford" can refer to a former U.S. president, a vehicle manufacturer, a movie star, a river crossing, or some other entity.
* Recognition of Pattern Identified Entities: Features such as telephone numbers, e-mail addresses, quantities (with units) can be discerned via regular expression or other pattern matches.
*Document clustering: identification of sets of similar text documents.
*
Coreference
In linguistics, coreference, sometimes written co-reference, occurs when two or more expressions refer to the same person or thing; they have the same referent. For example, in ''Bill said Alice would arrive soon, and she did'', the words ''Alice'' ...
: identification of
noun phrase
In linguistics, a noun phrase, or nominal (phrase), is a phrase that has a noun or pronoun as its head or performs the same grammatical function as a noun. Noun phrases are very common cross-linguistically, and they may be the most frequently oc ...
s and other terms that refer to the same object.
* Relationship, fact, and event Extraction: identification of associations among entities and other information in text
*
Sentiment analysis
Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjec ...
involves discerning subjective (as opposed to factual) material and extracting various forms of attitudinal information: sentiment, opinion, mood, and emotion. Text analytics techniques are helpful in analyzing sentiment at the entity, concept, or topic level and in distinguishing opinion holder and opinion object.
* Quantitative text analysis is a set of techniques stemming from the social sciences where either a human judge or a computer extracts semantic or grammatical relationships between words in order to find out the meaning or stylistic patterns of, usually, a casual personal text for the purpose of
psychological profiling
Psychology is the scientific study of mind and behavior. Psychology includes the study of conscious and unconscious phenomena, including feelings and thoughts. It is an academic discipline of immense scope, crossing the boundaries between ...
etc.
* Pre-processing usually involves tasks such as tokenization, filtering and stemming.
Applications
Text mining technology is now broadly applied to a wide variety of government, research, and business needs. All these groups may use text mining for records management and searching documents relevant to their daily activities. Legal professionals may use text mining for
e-discovery
Electronic discovery (also ediscovery or e-discovery) refers to discovery in legal proceedings such as litigation, government investigations, or Freedom of Information Act requests, where the information sought is in electronic format (often refe ...
, for example. Governments and military groups use text mining for
national security
National security, or national defence, is the security and defence of a sovereign state, including its citizens, economy, and institutions, which is regarded as a duty of government. Originally conceived as protection against military atta ...
and intelligence purposes. Scientific researchers incorporate text mining approaches into efforts to organize large sets of text data (i.e., addressing the problem of
unstructured data
Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, num ...
), to determine ideas communicated through text (e.g.,
sentiment analysis
Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjec ...
in
social media
Social media are interactive media technologies that facilitate the creation and sharing of information, ideas, interests, and other forms of expression through virtual communities and networks. While challenges to the definition of ''social medi ...
) and to support
scientific discovery
Discovery is the act of detecting something new, or something previously unrecognized as meaningful. With reference to sciences and academic disciplines, discovery is the observation of new phenomena, new actions, or new events and providing ne ...
in fields such as the
life sciences
This list of life sciences comprises the branches of science that involve the scientific study of life – such as microorganisms, plants, and animals including human beings. This science is one of the two major branches of natural science, the ...
and
bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
. In business, applications are used to support
competitive intelligence
Competitive intelligence (CI) is the process and forward-looking practices used in producing knowledge about the competitive environment to improve organizational performance. It involves the systematic collection and analysis of information from ...
and automated
ad placement, among numerous other activities.
Security applications
Many text mining software packages are marketed for
security applications, especially monitoring and analysis of online plain text sources such as
Internet news
Digital journalism, also known as netizen journalism or online journalism, is a contemporary form of journalism where editorial content is distributed via the Internet, as opposed to publishing via print or broadcast. What constitutes digital ...
,
blog
A blog (a truncation of "weblog") is a discussion or informational website published on the World Wide Web consisting of discrete, often informal diary-style text entries (posts). Posts are typically displayed in reverse chronological order ...
s, etc. for
national security
National security, or national defence, is the security and defence of a sovereign state, including its citizens, economy, and institutions, which is regarded as a duty of government. Originally conceived as protection against military atta ...
purposes. It is also involved in the study of text
encryption
In cryptography, encryption is the process of encoding information. This process converts the original representation of the information, known as plaintext, into an alternative form known as ciphertext. Ideally, only authorized parties can decip ...
/
decryption
In cryptography, encryption is the process of encoding information. This process converts the original representation of the information, known as plaintext, into an alternative form known as ciphertext. Ideally, only authorized parties can decip ...
.
Biomedical applications
A range of text mining applications in the biomedical literature has been described, including computational approaches to assist with studies in
protein docking,
protein interactions
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respond ...
, and protein-disease associations. In addition, with large patient textual datasets in the clinical field, datasets of demographic information in population studies and adverse event reports, text mining can facilitate clinical studies and precision medicine. Text mining algorithms can facilitate the stratification and indexing of specific clinical events in large patient textual datasets of symptoms, side effects, and comorbidities from electronic health records, event reports, and reports from specific diagnostic tests. One online text mining application in the biomedical literature is
PubGene
PubGene AS is a bioinformatics company located in Oslo, Norway and is the daughter company of PubGene Inc.
In 2001, PubGene founders demonstrated one of the first
applications of text mining to research in biomedicine (i.e., biomedical text mini ...
, a publicly accessible
search engine
A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
that combines biomedical text mining with network visualization.
GoPubMed GoPubMed was a knowledge-based search engine for biomedical texts. The
Gene Ontology (GO) and Medical Subject Headings (MeSH) served as "Table of contents" in order to structure the millions of articles in the MEDLINE database. MeshPubMed was at on ...
is a knowledge-based search engine for biomedical texts. Text mining techniques also enable us to extract unknown knowledge from unstructured documents in the clinical domain
Software applications
Text mining methods and software is also being researched and developed by major firms, including
IBM and
Microsoft
Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washing ...
, to further automate the mining and analysis processes, and by different firms working in the area of search and indexing in general as a way to improve their results. Within public sector much effort has been concentrated on creating software for tracking and monitoring
terrorist activities. For study purposes,
Weka software is one of the most popular options in the scientific world, acting as an excellent entry point for beginners. For Python programmers, there is an excellent toolkit called
NLTK
The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and E ...
for more general purposes. For more advanced programmers, there's also the
Gensim
Gensim is an open-source library for unsupervised topic modeling, document indexing, retrieval by similarity, and other natural language processing functionalities, using modern statistical machine learning.
Gensim is implemented in Python and ...
library, which focuses on word embedding-based text representations.
Online media applications
Text mining is being used by large media companies, such as the
Tribune Company
Tribune Media Company, also known as Tribune Company, was an American multimedia conglomerate headquartered in Chicago, Illinois.
Through Tribune Broadcasting, Tribune Media was one of the largest television broadcasting companies, owning 39 ...
, to clarify information and to provide readers with greater search experiences, which in turn increases site "stickiness" and revenue. Additionally, on the back end, editors are benefiting by being able to share, associate and package news across properties, significantly increasing opportunities to monetize content.
Business and marketing applications
Text analytics is being used in business, particularly, in marketing, such as in
customer relationship management
Customer relationship management (CRM) is a process in which a business or other organization administers its interactions with customers, typically using data analysis to study large amounts of information.
CRM systems compile data from a ra ...
.
Coussement and Van den Poel (2008)
apply it to improve
predictive analytics
Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or otherwise unknown events.
In business ...
models for customer churn (
customer attrition
Customer attrition, also known as customer churn, customer turnover, or customer defection, is the loss of clients or customers.
Banks, telephone service companies, Internet service providers, pay TV companies, insurance firms, and alarm monitorin ...
).
Text mining is also being applied in stock returns prediction.
Sentiment analysis
Sentiment analysis
Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjec ...
may involve analysis of movie reviews for estimating how favorable a review is for a movie.
Such an analysis may need a labeled data set or labeling of the
affectivity
Affect, in psychology, refers to the underlying experience of feeling, emotion or mood.
History
The modern conception of affect developed in the 19th century with Wilhelm Wundt. The word comes from the German ''Gefühl'', meaning "feeling. ...
of words.
Resources for affectivity of words and concepts have been made for
WordNet
WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into '' synsets'' with short definition ...
and
ConceptNet
Open Mind Common Sense (OMCS) is an artificial intelligence project based at the Massachusetts Institute of Technology (MIT) Media Lab whose goal is to build and utilize a large commonsense knowledge base from the contributions of many thousands ...
,
respectively.
Text has been used to detect emotions in the related area of affective computing. Text based approaches to affective computing have been used on multiple corpora such as students evaluations, children stories and news stories.
Scientific literature mining and academic applications
The issue of text mining is of importance to publishers who hold large
database
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
s of information needing
indexing for retrieval. This is especially true in scientific disciplines, in which highly specific information is often contained within the written text. Therefore, initiatives have been taken such as
Nature's proposal for an Open Text Mining Interface (OTMI) and the
National Institutes of Health
The National Institutes of Health, commonly referred to as NIH (with each letter pronounced individually), is the primary agency of the United States government responsible for biomedical and public health research. It was founded in the late ...
's common Journal Publishing
Document Type Definition
A document type definition (DTD) is a set of ''markup declarations'' that define a ''document type'' for an SGML-family markup language ( GML, SGML, XML, HTML).
A DTD defines the valid building blocks of an XML document. It defines the document ...
(DTD) that would provide semantic cues to machines to answer specific queries contained within the text without removing publisher barriers to public access.
Academic institutions have also become involved in the text mining initiative:
* The
National Centre for Text Mining (NaCTeM), is the first publicly funded text mining centre in the world. NaCTeM is operated by the
University of Manchester
, mottoeng = Knowledge, Wisdom, Humanity
, established = 2004 – University of Manchester Predecessor institutions: 1956 – UMIST (as university college; university 1994) 1904 – Victoria University of Manchester 1880 – Victoria Univer ...
in close collaboration with the Tsujii Lab,
University of Tokyo
, abbreviated as or UTokyo, is a public research university located in Bunkyō, Tokyo, Japan. Established in 1877, the university was the first Imperial University and is currently a Top Type university of the Top Global University Project by ...
. NaCTeM provides customised tools, research facilities and offers advice to the academic community. They are funded by the
Joint Information Systems Committee
Jisc is a United Kingdom not-for-profit company that provides network and IT services and digital resources in support of further and higher education institutions and research as well as not-for-profits and the public sector.
History
T ...
(JISC) and two of the UK
research councils
Research funding is a term generally covering any funding for scientific research, in the areas of natural science, technology, and social science. Different methods can be used to disburse funding, but the term often connotes funding obtained thr ...
(
EPSRC
The Engineering and Physical Sciences Research Council (EPSRC) is a British Research Council that provides government funding for grants to undertake research and postgraduate degrees in engineering and the physical sciences, mainly to univers ...
&
BBSRC
Biotechnology and Biological Sciences Research Council (BBSRC), part of UK Research and Innovation, is a non-departmental public body (NDPB), and is the largest UK public funder of non-medical bioscience. It predominantly funds scientific rese ...
). With an initial focus on text mining in the
biological
Biology is the scientific study of life. It is a natural science with a broad scope but has several unifying themes that tie it together as a single, coherent field. For instance, all organisms are made up of cells that process hereditary in ...
and
biomedical
Biomedicine (also referred to as Western medicine, mainstream medicine or conventional medicine) sciences, research has since expanded into the areas of
social sciences
Social science is one of the branches of science, devoted to the study of societies and the relationships among individuals within those societies. The term was formerly used to refer to the field of sociology, the original "science of soci ...
.
* In the United States, the
School of Information at
University of California, Berkeley
The University of California, Berkeley (UC Berkeley, Berkeley, Cal, or California) is a public land-grant research university in Berkeley, California. Established in 1868 as the University of California, it is the state's first land-grant u ...
is developing a program called BioText to assist
biology
Biology is the scientific study of life. It is a natural science with a broad scope but has several unifying themes that tie it together as a single, coherent field. For instance, all organisms are made up of cells that process hereditary i ...
researchers in text mining and analysis.
* The
Text Analysis Portal for Research (TAPoR), currently housed at the
University of Alberta
The University of Alberta, also known as U of A or UAlberta, is a public research university located in Edmonton, Alberta, Canada. It was founded in 1908 by Alexander Cameron Rutherford,"A Gentleman of Strathcona – Alexander Cameron Rutherfor ...
, is a scholarly project to catalogue text analysis applications and create a gateway for researchers new to the practice.
Methods for scientific literature mining
Computational methods have been developed to assist with information retrieval from scientific literature. Published approaches include methods for searching, determining novelty, and clarifying
homonym
In linguistics, homonyms are words which are homographs (words that share the same spelling, regardless of pronunciation), or homophones (equivocal words, that share the same pronunciation, regardless of spelling), or both. Using this definition, ...
s among technical reports.
Digital humanities and computational sociology
The automatic analysis of vast textual corpora has created the possibility for scholars to analyze
millions of documents in multiple languages with very limited manual intervention. Key enabling technologies have been parsing,
machine translation
Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
, topic
categorization
Categorization is the ability and activity of recognizing shared features or similarities between the elements of the experience of the world (such as Object (philosophy), objects, events, or ideas), organizing and classifying experience by a ...
, and machine learning.
The automatic parsing of textual corpora has enabled the extraction of actors and their relational networks on a vast scale, turning textual data into network data. The resulting networks, which can contain thousands of nodes, are then analyzed by using tools from network theory to identify the key actors, the key communities or parties, and general properties such as robustness or structural stability of the overall network, or centrality of certain nodes. This automates the approach introduced by quantitative narrative analysis, whereby
subject-verb-object triplets are identified with pairs of actors linked by an action, or pairs formed by actor-object.
Content analysis
Content analysis is the study of documents and communication artifacts, which might be texts of various formats, pictures, audio or video. Social scientists use content analysis to examine patterns in communication in a replicable and systematic ...
has been a traditional part of social sciences and media studies for a long time. The automation of content analysis has allowed a "
big data
Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...
" revolution to take place in that field, with studies in social media and newspaper content that include millions of news items.
Gender bias
Sexism is prejudice or discrimination based on one's sex or gender. Sexism can affect anyone, but it primarily affects women and girls.There is a clear and broad consensus among academic scholars in multiple fields that sexism refers primaril ...
,
readability
Readability is the ease with which a reader can understand a written text. In natural language, the readability of text depends on its content (the complexity of its vocabulary and syntax) and its presentation (such as typographic aspects that a ...
, content similarity, reader preferences, and even mood have been analyzed based on text mining methods over millions of documents. The analysis of readability, gender bias and topic bias was demonstrated in Flaounas et al. showing how different topics have different gender biases and levels of readability; the possibility to detect mood patterns in a vast population by analyzing Twitter content was demonstrated as well.
Software
Text mining computer programs are available from many
commercial
Commercial may refer to:
* a dose of advertising conveyed through media (such as - for example - radio or television)
** Radio advertisement
** Television advertisement
* (adjective for:) commerce, a system of voluntary exchange of products and s ...
and
open source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
companies and sources. See
List of text mining software
Text mining computer programs are available from many commercial software, commercial and Open-source software, open source companies and sources.
Commercial
* Angoss – Angoss Text Analytics provides entity extraction, entity and theme extracti ...
.
Intellectual property law
Situation in Europe
Under
European copyright and
database laws, the mining of in-copyright works (such as by
web mining) without the permission of the copyright owner is illegal. In the UK in 2014, on the recommendation of the
Hargreaves review, the government amended copyright law to allow text mining as a
limitation and exception. It was the second country in the world to do so, following
Japan
Japan ( ja, 日本, or , and formally , ''Nihonkoku'') is an island country in East Asia. It is situated in the northwest Pacific Ocean, and is bordered on the west by the Sea of Japan, while extending from the Sea of Okhotsk in the north ...
, which introduced a mining-specific exception in 2009. However, owing to the restriction of the
Information Society Directive
The Information Society Directive (familiarly when first proposed, the Copyright Directive) is a directive of the European Union that was enacted to implement the WIPO Copyright Treaty and to harmonise aspects of copyright law across Europe, ...
(2001), the UK exception only allows content mining for non-commercial purposes. UK copyright law does not allow this provision to be overridden by contractual terms and conditions.
The
European Commission
The European Commission (EC) is the executive of the European Union (EU). It operates as a cabinet government, with 27 members of the Commission (informally known as "Commissioners") headed by a President. It includes an administrative body o ...
facilitated stakeholder discussion on text and
data mining in 2013, under the title of Licenses for Europe. The fact that the focus on the solution to this legal issue was licenses, and not limitations and exceptions to copyright law, led representatives of universities, researchers, libraries, civil society groups and
open access
Open access (OA) is a set of principles and a range of practices through which research outputs are distributed online, free of access charges or other barriers. With open access strictly defined (according to the 2001 definition), or libre op ...
publishers to leave the stakeholder dialogue in May 2013.
Situation in the United States
US copyright law
The copyright law of the United States grants monopoly protection for "original works of authorship". With the stated purpose to promote art and culture, copyright law assigns a set of exclusive rights to authors: to make and sell copies of thei ...
, and in particular its
fair use
Fair use is a doctrine in United States law that permits limited use of copyrighted material without having to first acquire permission from the copyright holder. Fair use is one of the limitations to copyright intended to balance the interests ...
provisions, means that text mining in America, as well as other fair use countries such as Israel, Taiwan and South Korea, is viewed as being legal. As text mining is transformative, meaning that it does not supplant the original work, it is viewed as being lawful under fair use. For example, as part of the
Google Book settlement
''Authors Guild v. Google'' 721 F.3d 132 (2d Cir. 2015) was a copyright case heard in the United States District Court for the Southern District of New York, and on appeal to the United States Court of Appeals for the Second Circuit between 2005 ...
the presiding judge on the case ruled that Google's digitization project of in-copyright books was lawful, in part because of the transformative uses that the digitization project displayed—one such use being text and data mining.
Implications
Until recently, websites most often used text-based searches, which only found documents containing specific user-defined words or phrases. Now, through use of a
semantic web, text mining can find content based on meaning and context (rather than just by a specific word). Additionally, text mining software can be used to build large dossiers of information about specific people and events. For example, large datasets based on data extracted from news reports can be built to facilitate social networks analysis or
counter-intelligence
Counterintelligence is an activity aimed at protecting an agency's intelligence program from an opposition's intelligence service. It includes gathering information and conducting activities to prevent espionage, sabotage, assassinations or ot ...
. In effect, the text mining software may act in a capacity similar to an
intelligence analyst
Intelligence analysis is the application of individual and collective cognitive methods to weigh data and test hypotheses within a secret socio-cultural context. The descriptions are drawn from what may only be available in the form of deliberate ...
or research librarian, albeit with a more limited scope of analysis. Text mining is also used in some email
spam filter
Email filtering is the processing of email to organize it according to specified criteria. The term can apply to the intervention of human intelligence, but most often refers to the automatic processing of messages at an SMTP server, possibly appl ...
s as a way of determining the characteristics of messages that are likely to be advertisements or other unwanted material. Text mining plays an important role in determining financial
market sentiment
Market sentiment, also known as investor attention, is the general prevailing attitude of investors as to anticipated price development in a market. This attitude is the accumulation of a variety of fundamental and technical factors, including p ...
.
Future
Increasing interest is being paid to multilingual data mining: the ability to gain information across languages and cluster similar items from different linguistic sources according to their meaning.
The challenge of exploiting the large proportion of enterprise information that originates in "unstructured" form has been recognized for decades. It is recognized in the earliest definition of
business intelligence
Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis and management of business information. Common functions of business intelligence technologies include reporting, online analytical pr ...
(BI), in an October 1958 IBM Journal article by H.P. Luhn, A Business Intelligence System, which describes a system that will:
"...utilize data-processing machines for auto-abstracting and auto-encoding of documents and for creating interest profiles for each of the 'action points' in an organization. Both incoming and internally generated documents are automatically abstracted, characterized by a word pattern, and sent automatically to appropriate action points."
Yet as management information systems developed starting in the 1960s, and as BI emerged in the '80s and '90s as a software category and field of practice, the emphasis was on numerical data stored in relational databases. This is not surprising: text in "unstructured" documents is hard to process. The emergence of text analytics in its current form stems from a refocusing of research in the late 1990s from algorithm development to application, as described by Prof.
Marti A. Hearst in the paper Untangling Text Data Mining:
For almost a decade the computational linguistics community has viewed large text collections as a resource to be tapped in order to produce better text analysis algorithms. In this paper, I have attempted to suggest a new emphasis: the use of large online text collections to discover new facts and trends about the world itself. I suggest that to make progress we do not need fully artificial intelligent text analysis; rather, a mixture of computationally-driven and user-guided analysis may open the door to exciting new results.
Hearst's 1999 statement of need fairly well describes the state of text analytics technology and practice a decade later.
See also
*
Concept mining
*
Document processing
*
Full text search
In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts ...
*
List of text mining software
Text mining computer programs are available from many commercial software, commercial and Open-source software, open source companies and sources.
Commercial
* Angoss – Angoss Text Analytics provides entity extraction, entity and theme extracti ...
*
Market sentiment
Market sentiment, also known as investor attention, is the general prevailing attitude of investors as to anticipated price development in a market. This attitude is the accumulation of a variety of fundamental and technical factors, including p ...
*
Name resolution (semantics and text extraction) In semantics and text extraction, name resolution refers to the ability of text mining software to determine which actual person, actor, or object a particular use of a name refers to. It can also be referred to as entity resolution.
Name resoluti ...
*
Named entity recognition
Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre ...
*
News analytics
*
Ontology learning
Ontology learning (ontology extraction, ontology generation, or ontology acquisition) is the automatic or semi-automatic creation of ontology (information science), ontologies, including extracting the corresponding Domain of discourse, domain's te ...
*
Record linkage
Record linkage (also known as data matching, data linkage, entity resolution, and many other terms) is the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, and da ...
*
Sequential pattern mining
Sequential pattern mining is a topic of data mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence. It is usually presumed that the values are discrete, and thus time serie ...
(string and sequence mining)
*
w-shingling
In natural language processing a ''w-shingling'' is a set of ''unique'' ''shingles'' (therefore ''n-grams'') each of which is composed of contiguous subsequences of tokens within a document, which can then be used to ascertain the similarity be ...
*
Web mining, a task that may involve text mining (e.g. first find appropriate web pages by classifying crawled web pages, then extract the desired information from the text content of these pages considered relevant)
References
Citations
Sources
* Ananiadou, S. and McNaught, J. (Editors) (2006). ''Text Mining for Biology and Biomedicine''. Artech House Books.
* Bilisoly, R. (2008). ''Practical Text Mining with Perl''. New York: John Wiley & Sons.
* Feldman, R., and Sanger, J. (2006). ''The Text Mining Handbook''. New York: Cambridge University Press.
* Hotho, A., Nürnberger, A. and Paaß, G. (2005). "A brief survey of text mining". In Ldv Forum, Vol. 20(1), p. 19-62
* Indurkhya, N., and Damerau, F. (2010). ''Handbook Of Natural Language Processing'', 2nd Edition. Boca Raton, FL: CRC Press.
* Kao, A., and Poteet, S. (Editors). ''Natural Language Processing and Text Mining''. Springer.
* Konchady, M. ''Text Mining Application Programming (Programming Series)''. Charles River Media.
* Manning, C., and Schutze, H. (1999). ''Foundations of Statistical Natural Language Processing''. Cambridge, MA: MIT Press.
* Miner, G., Elder, J., Hill. T, Nisbet, R., Delen, D. and Fast, A. (2012). ''Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications''. Elsevier Academic Press.
* McKnight, W. (2005). "Building business intelligence: Text data mining in business intelligence". ''DM Review'', 21-22.
* Srivastava, A., and Sahami. M. (2009). ''Text Mining: Classification, Clustering, and Applications''. Boca Raton, FL: CRC Press.
* Zanasi, A. (Editor) (2007). ''Text Mining and its Applications to Intelligence, CRM and Knowledge Management''. WIT Press.
External links
Marti Hearst: What Is Text Mining?(October, 2003)
Automatic Content Extraction, Linguistic Data ConsortiumAutomatic Content Extraction, NIST
{{Authority control
Applications of artificial intelligence
Applied data mining
Computational linguistics
Natural language processing
Statistical natural language processing
Text