HOME

TheInfoList



OR:

Co-occurrence network, sometimes referred to as a
semantic network A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, ...
, is a method to analyze text that includes a graphic
visualization Visualization or visualisation may refer to: * Visualization (graphics), the physical or imagining creation of images, diagrams, or animations to communicate a message * Data visualization, the graphic representation of data * Information visuali ...
of potential relationships between people, organizations, concepts, biological organisms like bacteria or other entities represented within written material. The generation and visualization of
co-occurrence In linguistics, co-occurrence or cooccurrence is an above-chance frequency of occurrence of two terms (also known as coincidence or concurrence) from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense ca ...
networks has become practical with the advent of electronically stored text compliant to
text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...
. By way of definition, co-occurrence networks are the collective
interconnection In telecommunications, interconnection is the physical linking of a carrier's network with equipment or facilities not belonging to that network. The term may refer to a connection between a carrier's facilities and the equipment belonging to ...
of terms based on their paired presence within a specified unit of text. Networks are generated by connecting pairs of terms using a set of criteria defining co-occurrence. For example, terms A and B may be said to “co-occur” if they both appear in a particular article. Another article may contain terms B and C. Linking A to B and B to C creates a co-occurrence network of these three terms. Rules to define co-occurrence within a
text corpus In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical ...
can be set according to desired criteria. For example, a more stringent criteria for co-occurrence may require a pair of terms to appear in the same sentence. Co-occurrence networks were found to be particularly useful to analyze large text and big data, when identifying the main themes and topics (such as in a large number of social media posts), revealing biases in the text (such as biases in news coverage), or even mapping an entire research field.


Methods and development

The process of constructing co-occurrence networks includes identifying keywords in the text, calculating the frequencies of co-occurrences, and analyzing the networks to find central words and clusters of themes in the network. Co-occurrence networks can be created for any given list of terms (any dictionary) in relation to any collection of texts (any
text corpus In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical ...
). Co-occurring pairs of terms can be called “neighbors” and these often group into “neighborhoods” based on their interconnections. Individual terms may have several neighbors. Neighborhoods may connect to one another through at least one individual term or may remain unconnected. Individual terms are, within the context of text mining, symbolically represented as
text string In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed (after creation). ...
s. In the real world, the entity identified by a term normally has several symbolic representations. It is therefore useful to consider terms as being represented by one primary symbol and up to several synonymous alternative symbols. Occurrence of an individual term is established by searching for each known symbolic representations of the term. The process can be augmented through NLP ( natural language processing) algorithms that interrogate segments of text for possible alternatives such as
word order In linguistics, word order (also known as linear order) is the order of the syntactic constituents of a language. Word order typology studies it from a cross-linguistic perspective, and examines how different languages employ different orders. C ...
, spacing and
hyphen The hyphen is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation. ''Son-in-law'' is an example of a hyphenated word. The hyphen is sometimes confused with dashes ( figure ...
ation. NLP can also be used to identify sentence structure and categorize text strings according to grammar (for example, categorizing a string of text as a
noun A noun () is a word that generally functions as the name of a specific object or set of objects, such as living creatures, places, actions, qualities, states of existence, or ideas.Example nouns for: * Living creatures (including people, alive, ...
based on a preceding string of text known to be an article). Graphic representation of co-occurrence networks allow them to be visualized and inferences drawn regarding relationships between entities in the domain represented by the dictionary of terms applied to the text corpus. Meaningful visualization normally requires simplifications of the network. For example, networks may be drawn such that the number of neighbors connecting to each term is limited. The criteria for limiting neighbors might be based on the absolute number of co-occurrences or more subtle criteria such as “probability” of co-occurrence or the presence of an intervening descriptive term. Quantitative aspects of the underlying structure of a co-occurrence network might also be informative, such as the overall number of connections between entities, clustering of entities representing sub-domains, detecting synonyms, etc.


Applications and use

Some working applications of the co-occurrence approach are available to the public through the
internet The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a '' network of networks'' that consists of private, pub ...
.
PubGene PubGene AS is a bioinformatics company located in Oslo, Norway and is the daughter company of PubGene Inc. In 2001, PubGene founders demonstrated one of the first applications of text mining to research in biomedicine (i.e., biomedical text mini ...
is an example of an application that addresses the interests of biomedical community by presenting networks based on the co-occurrence of genetics related terms as these appear in
MEDLINE MEDLINE (Medical Literature Analysis and Retrieval System Online, or MEDLARS Online) is a bibliographic database of life sciences and biomedical information. It includes bibliographic information for articles from academic journals covering medic ...
records. The website
NameBase NameBase is a web-based cross-indexed database of names that focuses on individuals involved in the international intelligence community, U.S. foreign policy, crime, and business. The focus is on the post-World War II era and on left of center, ...
is an example of how human relationships can be inferred by examining networks constructed from the co-occurrence of personal names in newspapers and other texts (as in Ozgur et al.). Networks of information are also used to facilitate efforts to organize and focus publicly available information for law enforcement and intelligence purposes (so called "
open source intelligence Open-source intelligence (OSINT) is the collection and analysis of data gathered from open sources (covert and publicly available sources) to produce actionable intelligence. OSINT is primarily used in national security, law enforcement, and busi ...
" or OSINT). Related techniques include co-citation networks as well as the analysis of hyperlink and content structure on the internet (such as in the analysis of web sites connected to terrorism).


See also

*
Topic spotting Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") ...
*
Social network analysis Social network analysis (SNA) is the process of investigating social structures through the use of networks and graph theory. It characterizes networked structures in terms of ''nodes'' (individual actors, people, or things within the network) ...


References

* {{commons category, Co-occurrence networks Biological databases Computational linguistics Data mining Medical search engines Intelligence gathering disciplines Open-source intelligence