Topic Modeling
   HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
and
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
, a topic model is a type of
statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repres ...
for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear approximately equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is. Topic models are also referred to as probabilistic topic models, which refers to statistical algorithms for discovering the latent semantic structures of an extensive text body. In the age of information, the amount of the written material we encounter each day is simply beyond our processing capacity. Topic models can help to organize and offer insights for us to understand large collections of unstructured text bodies. Originally developed as a text-mining tool, topic models have been used to detect instructive structures in data such as genetic information, images, and networks. They also have applications in other fields such as
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
and
computer vision Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...
.


History

An early topic model was described by Papadimitriou, Raghavan, Tamaki and Vempala in 1998. Another one, called
probabilistic latent semantic analysis Probabilistic latent semantic analysis (PLSA), also known as probabilistic latent semantic indexing (PLSI, especially in information retrieval circles) is a statistical technique for the analysis of two-mode and co-occurrence data. In effect, one c ...
(PLSA), was created by Thomas Hofmann in 1999.
Latent Dirichlet allocation In natural language processing, Latent Dirichlet Allocation (LDA) is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. The LDA is an exa ...
(LDA), perhaps the most common topic model currently in use, is a generalization of PLSA. Developed by
David Blei David Meir Blei is a professor in the Statistics and Computer Science departments at Columbia University. Prior to fall 2014 he was an associate professor in the Department of Computer Science at Princeton University. His work is primarily in mach ...
,
Andrew Ng Andrew Yan-Tak Ng (; born 1976) is a British-born American computer scientist and technology entrepreneur focusing on machine learning and AI. Ng was a co-founder and head of Google Brain and was the former Chief Scientist at Baidu, building ...
, and
Michael I. Jordan Michael Irwin Jordan (born February 25, 1956) is an American scientist, professor at the University of California, Berkeley and researcher in machine learning, statistics, and artificial intelligence. Jordan was elected a member of the Nation ...
in 2002, LDA introduces sparse Dirichlet prior distributions over document-topic and topic-word distributions, encoding the intuition that documents cover a small number of topics and that topics often use a small number of words. Other topic models are generally extensions on LDA, such as
Pachinko allocation In machine learning and natural language processing, the pachinko allocation model (PAM) is a topic model. Topic models are a suite of algorithms to uncover the hidden thematic structure of a collection of documents. The algorithm improves upon ...
, which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics. Hierarchical latent tree analysis
HLTA
is an alternative to LDA, which models word co-occurrence using a tree of latent variables and the states of the latent variables, which correspond to soft clusters of documents, are interpreted as topics.


Topic models for context information

Approaches for temporal information include Block and Newman's determination of the temporal dynamics of topics in the ''
Pennsylvania Gazette ''The Pennsylvania Gazette'' was one of the United States' most prominent newspapers from 1728 until 1800. In the several years leading up to the American Revolution the paper served as a voice for colonial opposition to British colonial rule, ...
'' during 1728–1800. Griffiths & Steyvers used topic modeling on abstracts from the journal ''
PNAS ''Proceedings of the National Academy of Sciences of the United States of America'' (often abbreviated ''PNAS'' or ''PNAS USA'') is a peer-reviewed multidisciplinary scientific journal. It is the official journal of the National Academy of Scien ...
'' to identify topics that rose or fell in popularity from 1991 to 2001 whereas Lamba & Madhusushan used topic modeling on full-text research articles retrieved from DJLIT journal from 1981–2018. In the field of library and information science, Lamba & Madhusudhan applied topic modeling on different Indian resources like journal articles and electronic theses and resources (ETDs). Nelson has been analyzing change in topics over time in the ''
Richmond Times-Dispatch The ''Richmond Times-Dispatch'' (''RTD'' or ''TD'' for short) is the primary daily newspaper in Richmond, Virginia, Richmond, the capital of Virginia, and the primary newspaper of record for the state of Virginia. Circulation The ''Times-Dispatc ...
'' to understand social and political changes and continuities in Richmond during the
American Civil War The American Civil War (April 12, 1861 – May 26, 1865; also known by other names) was a civil war in the United States. It was fought between the Union ("the North") and the Confederacy ("the South"), the latter formed by states th ...
. Yang, Torget and Mihalcea applied topic modeling methods to newspapers from 1829–2008. Mimno used topic modelling with 24 journals on classical philology and archaeology spanning 150 years to look at how topics in the journals change over time and how the journals become more different or similar over time. Yin et al. introduced a topic model for geographically distributed documents, where document positions are explained by latent regions which are detected during inference. Chang and Blei included network information between linked documents in the relational topic model, to model the links between websites. The author-topic model by Rosen-Zvi et al. models the topics associated with authors of documents to improve the topic detection for documents with authorship information. HLTA was applied to a collection of recent research papers published at major AI and Machine Learning venues. The resulting model is calle
The AI Tree
The resulting topics are used to index the papers a
aipano.cse.ust.hk
to help researcher
track research trends and identify papers to read
and help conference organizers and journal editor
identify reviewers for submissions
To improve the qualitative aspects and coherency of generated topics, some researchers have explored the efficacy of "coherence scores", or otherwise how computer-extracted clusters (i.e. topics) align with a human benchmark. Coherence scores are metrics for optimising the number of topics to extract from a document corpus.


Algorithms

In practice, researchers attempt to fit appropriate model parameters to the data corpus using one of several heuristics for maximum likelihood fit. A recent survey by Blei describes this suite of algorithms. Several groups of researchers starting with Papadimitriou et al. have attempted to design algorithms with probable guarantees. Assuming that the data were actually generated by the model in question, they try to design algorithms that probably find the model that was used to create the data. Techniques used here include
singular value decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is related ...
(SVD) and the method of moments. In 2012 an algorithm based upon
non-negative matrix factorization Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix is factorized into (usually) two matrices and , with the property that ...
(NMF) was introduced that also generalizes to topic models with correlations among topics. In 2018 a new approach to topic models was proposed: it is based on stochastic block model


Topic models for quantitative biomedicine

Topic models are being used also in other contexts. For examples uses of topic models in biology and bioinformatics research emerged. Recently topic models has been used to extract information from dataset of cancers' genomic samples. In this case topics are biological latent variables to be inferred.


See also

*
Explicit semantic analysis In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectoral representation of text (individual words or entire documents) that uses a document corpus as a knowledge base. Specifically, in ESA, a word i ...
*
Latent semantic analysis Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the do ...
*
Latent Dirichlet allocation In natural language processing, Latent Dirichlet Allocation (LDA) is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. The LDA is an exa ...
*
Hierarchical Dirichlet process In statistics and machine learning, the hierarchical Dirichlet process (HDP) is a nonparametric Bayesian approach to clustering grouped data. It uses a Dirichlet process for each group of data, with the Dirichlet processes for all groups sharing ...
*
Non-negative matrix factorization Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix is factorized into (usually) two matrices and , with the property that ...
*
Statistical classification In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation (or observations) belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagno ...
*
Unsupervised learning Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...
*
Mallet (software project) MALLET is a Java "Machine Learning for Language Toolkit". Description MALLET is an integrated collection of Java code useful for statistical natural language processing, document classification, cluster analysis, information extraction, topic ...
*
Gensim Gensim is an open-source library for unsupervised topic modeling, document indexing, retrieval by similarity, and other natural language processing functionalities, using modern statistical machine learning. Gensim is implemented in Python and ...


References


Further reading

* * * * * *Jockers, M. 201
Who's your DH Blog Mate: Match-Making the Day of DH Bloggers with Topic Modeling
Matthew L. Jockers, posted 19 March 2010 *Drouin, J. 201
Foray Into Topic Modeling
Ecclesiastical Proust Archive. posted 17 March 2011 *Templeton, C. 201
Topic Modeling in the Humanities: An Overview
Maryland Institute for Technology in the Humanities Blog. posted 1 August 2011 * *Yang, T., A Torget and R. Mihalcea (2011) Topic Modeling on Historical Newspapers
Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
The Association for Computational Linguistics, Madison, WI. pages 96–104. * *


External links

* *
Topic Models Applied to Online News and Reviews
Video of a Google Tech Talk presentation by Alice Oh on topic modeling with
LDA LDA may refer to: Aviation *Localizer type directional aid, an instrument approach to an airport *Landing distance available, the length of runway that is available for the ground run of an airplane landing Law *Legal document assistant, a non-la ...

Modeling Science: Dynamic Topic Models of Scholarly Research
Video of a Google Tech Talk presentation by David M. Blei
Automated Topic Models in Political Science
Video of a presentation by Brandon Stewart at th
Tools for Text Workshop
14 June 2010 *Shawn Graham, Ian Milligan, and Scott Weingart *Blei, David M


codedemo
- example of using LDA for topic modelling {{Natural Language Processing Statistical natural language processing Latent variable models Corpus linguistics