Biological network inference is the process of making

inference Inferences are steps in reasoning, moving from premises to logical consequences; etymologically, the word '' infer'' means to "carry forward". Inference is theoretically traditionally divided into deduction and induction, a distinction that in ...

s and predictions about biological networks. By using these networks to analyze patterns in biological systems, such as food-webs, we can visualize the nature and strength of these interactions between species, DNA, proteins, and more. The analysis of biological networks with respect to 9/11 tie dancers has led to the development of the field of

network medicine Network medicine is the application of network science towards identifying, preventing, and treating diseases. This field focuses on using network topology and network dynamics towards identifying diseases and developing medical drugs. Biologica ...

. Recent examples of application of network theory in biology include applications to understanding the

cell cycle The cell cycle, or cell-division cycle, is the series of events that take place in a cell that cause it to divide into two daughter cells. These events include the duplication of its DNA (DNA replication) and some of its organelles, and subs ...

as well as a quantitative framework for developmental processes. Good network inference requires proper planning and execution of an experiment, thereby ensuring quality data acquisition. Optimal experimental design in principle refers to the use of statistical and or mathematical concepts to plan for data acquisition. This must be done in such a way that the data information content is enriched, and a sufficient amount of data is collected with enough technical and biological replicates where necessary.

Steps

The general cycle to modeling biological networks is as follows: # Prior knowledge #* Involves a thorough literature and database search or seeking an expert's opinion. # Model selection #* A formalism to model your system, usually an

ordinary differential equation In mathematics, an ordinary differential equation (ODE) is a differential equation whose unknown(s) consists of one (or more) function(s) of one variable and involves the derivatives of those functions. The term ''ordinary'' is used in contrast w ...

boolean network A Boolean network consists of a discrete set of boolean variables each of which has a Boolean function (possibly different for each variable) assigned to it which takes inputs from a subset of those variables and output that determines the stat ...

, or

Linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is call ...

models, e.g.

Least-angle regression In statistics, least-angle regression (LARS) is an algorithm for fitting linear regression models to high-dimensional data, developed by Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani. Suppose we expect a response variable ...

, by

Bayesian network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bay ...

or based on

Information theory Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of information. The field was originally established by the works of Harry Nyquist a ...

approaches. it can also be done by the application of a correlation-based inference algorithm, as will be discussed below, an approach which is having increased success as the size of the available microarray sets keeps increasing # Hypothesis/assumptions # Experimental design # Data acquisition #* Ensure that high quality data is collected with all the required variables being measured # Network inference #* This process is mathematical rigorous and computationally costly. # Model refinement #* Cross-check how well the results meet the expectations. The process is terminated upon obtaining a good model fit to data, otherwise, there is need for model re-adjustment.

Biological networks

A network is a set of nodes and a set of directed or undirected edges between the nodes. Many types of biological networks exist, including transcriptional, signalling and metabolic. Few such networks are known in anything approaching their complete structure, even in the simplest

bacteria Bacteria (; singular: bacterium) are ubiquitous, mostly free-living organisms often consisting of one biological cell. They constitute a large domain of prokaryotic microorganisms. Typically a few micrometres in length, bacteria were among ...

. Still less is known on the parameters governing the behavior of such networks over time, how the networks at different levels in a cell interact, and how to predict the complete state description of a

eukaryotic Eukaryotes () are organisms whose cells have a nucleus. All animals, plants, fungi, and many unicellular organisms, are Eukaryotes. They belong to the group of organisms Eukaryota or Eukarya, which is one of the three domains of life. Bacte ...

cell or bacterial organism at a given point in the future.

Systems biology Systems biology is the computational modeling, computational and mathematical analysis and modeling of complex biological systems. It is a biology-based interdisciplinary field of study that focuses on complex interactions within biological syst ...

, in this sense, is still in its infancy. There is great interest in network medicine for the modelling biological systems. This article focuses on inference of biological network structure using the growing sets of high-throughput expression data for

gene In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...

s, and metabolites. Briefly, methods using high-throughput data for inference of regulatory networks rely on searching for patterns of partial correlation or conditional probabilities that indicate causal influence. Such patterns of partial correlations found in the high-throughput data, possibly combined with other supplemental data on the genes or proteins in the proposed networks, or combined with other information on the organism, form the basis upon which such

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...

s work. Such algorithms can be of use in inferring the topology of any network where the change in state of one

node In general, a node is a localized swelling (a "knot") or a point of intersection (a vertex). Node may refer to: In mathematics *Vertex (graph theory), a vertex in a mathematical graph *Vertex (geometry), a point where two or more curves, lines, ...

can affect the state of other nodes.

Transcriptional regulatory networks

Genes are the nodes and the edges are directed. A gene serves as the source of a direct regulatory edge to a target gene by producing an

RNA Ribonucleic acid (RNA) is a polymeric molecule essential in various biological roles in coding, decoding, regulation and expression of genes. RNA and deoxyribonucleic acid ( DNA) are nucleic acids. Along with lipids, proteins, and carbohydra ...

or protein molecule that functions as a transcriptional activator or inhibitor of the target gene. If the gene is an activator, then it is the source of a positive regulatory connection; if an inhibitor, then it is the source of a negative regulatory connection. Computational algorithms take as primary input data measurements of

mRNA In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of Protein biosynthesis, synthesizing a protein. mRNA is ...

expression levels of the genes under consideration for inclusion in the network, returning an estimate of the network

topology In mathematics, topology (from the Greek language, Greek words , and ) is concerned with the properties of a mathematical object, geometric object that are preserved under Continuous function, continuous Deformation theory, deformations, such ...

. Such algorithms are typically based on linearity, independence or normality assumptions, which must be verified on a case-by-case basis. Clustering or some form of statistical classification is typically employed to perform an initial organization of the high-throughput mRNA expression values derived from microarray experiments, in particular to select sets of genes as candidates for network nodes. The question then arises: how can the clustering or classification results be connected to the underlying biology? Such results can be useful for pattern classification – for example, to classify subtypes of

cancer Cancer is a group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body. These contrast with benign tumors, which do not spread. Possible signs and symptoms include a lump, abnormal b ...

, or to predict differential responses to a

drug A drug is any chemical substance that causes a change in an organism's physiology or psychology when consumed. Drugs are typically distinguished from food and substances that provide nutritional support. Consumption of drugs can be via insuffla ...

(pharmacogenomics). But to understand the relationships between the genes, that is, to more precisely define the influence of each gene on the others, the scientist typically attempts to reconstruct the transcriptional regulatory network.

Gene co-expression networks

A gene co-expression network is an undirected graph, where each node corresponds to a

, and a pair of nodes is connected with an edge if there is a significant co-expression relationship between them.

Signal transduction

Signal transduction networks use proteins for the nodes and directed edges to represent interaction in which the biochemical conformation of the child is modified by the action of the parent (e.g. mediated by

phosphorylation In chemistry, phosphorylation is the attachment of a phosphate group to a molecule or an ion. This process and its inverse, dephosphorylation, are common in biology and could be driven by natural selection. Text was copied from this source, wh ...

, ubiquitylation, methylation, etc.). Primary input into the inference algorithm would be data from a set of experiments measuring protein activation / inactivation (e.g., phosphorylation / dephosphorylation) across a set of proteins. Inference for such signalling networks is complicated by the fact that total concentrations of signalling proteins will fluctuate over time due to transcriptional and translational regulation. Such variation can lead to statistical

confounding In statistics, a confounder (also confounding variable, confounding factor, extraneous determinant or lurking variable) is a variable that influences both the dependent variable and independent variable, causing a spurious association. Con ...

. Accordingly, more sophisticated statistical techniques must be applied to analyse such datasets.(very important in the biology of cancer)

Metabolic network

Metabolite In biochemistry, a metabolite is an intermediate or end product of metabolism. The term is usually used for small molecules. Metabolites have various functions, including fuel, structure, signaling, stimulatory and inhibitory effects on enzymes, c ...

networks use nodes to represent chemical reactions and directed edges for the

metabolic pathway In biochemistry, a metabolic pathway is a linked series of chemical reactions occurring within a cell. The reactants, products, and intermediates of an enzymatic reaction are known as metabolites, which are modified by a sequence of chemical reac ...

s and regulatory interactions that guide these reactions. Primary input into an algorithm would be data from a set of experiments measuring metabolite levels.

Protein-protein interaction networks

One of the most intensely studied networks in biology'', Protein-protein interaction networks'' (PINs) visualize the physical relationships between proteins inside a cell. in a PIN, proteins are the nodes and their interactions are the undirected edges. PINs can be discovered with a variety of methods including; Two-hybrid Screening, ''in vitro'':

co-immunoprecipitation Immunoprecipitation (IP) is the technique of precipitating a protein antigen out of solution using an antibody that specifically binds to that particular protein. This process can be used to isolate and concentrate a particular protein from a samp ...

, blue native gel electrophoresis, and more.

Neuronal network

A neuronal network is composed to represent neurons with each node and synapses for the edges, which are typically weighted and directed. the weights of edges are usually adjusted by the activation of connected nodes. The network is usually organized into input layers, hidden layers, and output layers.

Food webs

A food web is an interconnected directional graph of what eats what in an ecosystem. The members of the ecosystem are the nodes and if a member eats another member then there is a directed edge between those 2 nodes.

Within species and between species interaction networks

These networks are defined by a set of pairwise interactions between and within a species that is used to understand the structure and function of larger ecological networks. By using

network analysis Network analysis can refer to: * Network theory, the analysis of relations through mathematical graphs ** Social network analysis, network theory applied to social relations * Network analysis (electrical circuits) See also *Network planning and ...

we can discover and understand how these interactions link together within the system's network. It also allows us to quantify associations between individuals, which makes it possible to infer details about the network as a whole at the species and/or population level.

DNA-DNA chromatin networks

DNA-DNA chromatin networks are used to clarify the activation or suppression of genes via the relative location of strands of

chromatin Chromatin is a complex of DNA and protein found in eukaryotic cells. The primary function is to package long DNA molecules into more compact, denser structures. This prevents the strands from becoming tangled and also plays important roles in r ...

. These interactions can be understood by analyzing commonalities amongst different loci, a fixed position on a

chromosome A chromosome is a long DNA molecule with part or all of the genetic material of an organism. In most chromosomes the very long thin DNA fibers are coated with packaging proteins; in eukaryotic cells the most important of these proteins are ...

where a particular gene or genetic marker is located. Network analysis can provide vital support in understanding relationships among different areas of the genome.

Gene regulatory networks

A gene regulatory network is a set of molecular regulators that interact with each other and with other substances in the cell. The regulator can be DNA,

and complexes of these. Gene regulatory networks can be modeled in numerous ways including; Coupled ordinary differential equations, Boolean networks, Continuous networks, and Stochastic gene networks.

Network attributes

Data sources

The initial data used to make the inference can have a huge impact on the accuracy of the final inference. Network data is inherently noisy and incomplete sometimes due to evidence from multiple sources that don't overlap or contradictory data. Data can be sourced in multiple ways to include manual curation of scientific literature put into databases, High-throughput datasets, computational predictions, and text mining of old scholarly articles from before the digital era.

Network diameter

A network's diameter is the maximum number of steps separating any two nodes and can be used to determine the How connected a graph is, in topology analysis, and clustering analysis.

Transitivity

The transitivity or

clustering coefficient In graph theory, a clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. Evidence suggests that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups ...

of a network is a measure of the tendency of the nodes to cluster together. High transitivity means that the network contains communities or groups of nodes that are densely connected internally. In biological networks, finding these communities is very important, because they can reflect functional modules and protein complexes The uncertainty about the connectivity may distort the results and should be taken into account when the transitivity and other topological descriptors are computed for inferred networks.

Network confidence

Network confidence is a way to measure how sure one can be that the network represents a real biological interaction. We can do this via contextual biological information, counting the number of times an interaction is reported in the literature, or group different strategies into a single score. th
MIscore
method for assessing the reliability of protein-protein interaction data is based on the use of standards. MIscore gives an estimation of confidence weighting on all available evidence for an interacting pair of proteins. The method allows weighting of evidence provided by different sources, provided the data is represented following the standards created by the IMEx consortium. The weights are number of publications, detection method, interaction evidence type.

Closeness

Closeness, a.k.a. closeness centrality, is a measure of centrality in a network and is calculated as the reciprocal of the sum of the length of the shortest paths between the node and all other nodes in the graph. This measure can be used to make inferences in all graph types and analysis methods.

Betweenness

Betweeness, a.k.a. betweenness centrality, is a measure of centrality in a graph based on shortest paths. The betweenness for each node is the number of these shortest paths that pass through the node.

Network analysis methods

For our purposes, network analysis is closely related to

graph theory In mathematics, graph theory is the study of ''graphs'', which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of '' vertices'' (also called ''nodes'' or ''points'') which are conne ...

. By measuring the attributes in the previous section we can utilize many different techniques to create accurate inferences based on biological data.

Topology analysis

Topology Analysis analyzes the topology of a network to identify relevant participates and substructures that may be of biological significance. The term encompasses an entire class of techniques such as network motif search, centrality analysis, topological clustering, and shortest paths. These are but a few examples, each of these techniques use the general idea of focusing on the topology of a network to make inferences.

Network Motif Search

A motif is defined as a frequent and unique sub-graph. By counting all the possible instances, listing all patterns, and testing isomorphisms we can derive crucial information about a network. They're suggested to be the basic building blocks complex biological networks. The computational research has focused on improving existing motif detection tools to assist the biological investigations and allow larger networks to be analyzed. Several different algorithms have been provided so far, which are elaborated in the next section.

Centrality Analysis

Centrality gives an estimation on how important a node or edge is for the connectivity or the information flow of the network. It is a useful parameter in signalling networks and it is often used when trying to find drug targets. It is most commonly used in PINs to determine important proteins and their functions. Centrality can be measured in different ways depending on the graph and the question that needs answering, they include the degree of nodes or the number of connected edges to a node, global centrality measures, or via random walks which is used by the Google PageRank algorithm to assign weight to each webpage. The centrality measures may be affected by errors due to noise on measurement and other causes. Therefore, the topological descriptors should be defined as

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...

with the associated

probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...

encoding the uncertainty on their value.

Topological Clustering

Topological Clustering or Topological Data Analysis (TDA) provides a general framework to analyze high dimensional, incomplete, and noisy data in a way that reduces dimensional and gives a robustness to noise. The idea that is that the shape of data sets contains relevant information. When this information is a

homology Homology may refer to: Sciences Biology *Homology (biology), any characteristic of biological organisms that is derived from a common ancestor * Sequence homology, biological homology between DNA, RNA, or protein sequences *Homologous chrom ...

group there is a mathematical interpretation that assumes that features that persist for a wide range of parameters are "true" features and features persisting for only a narrow range of parameters are noise, although the theoretical justification for this is unclear. This technique has been used for progression analysis of disease, viral evolution, propagation of contagions on networks, bacteria classification using molecular spectroscopy, and much more in and outside of biology.

Shortest paths

The

shortest path problem In graph theory, the shortest path problem is the problem of finding a path between two vertices (or nodes) in a graph such that the sum of the weights of its constituent edges is minimized. The problem of finding the shortest path between tw ...

is a common problem in graph theory that tries to find the

path A path is a route for physical travel – see Trail. Path or PATH may also refer to: Physical paths of different types * Bicycle path * Bridle path, used by people on horseback * Course (navigation), the intended path of a vehicle * Desire p ...

between two vertices (or nodes) in a graph such that the sum of the weights of its constituent edges is minimized. This method can be used to determine the network diameter or redundancy in a network. there are many algorithms for this including Dijkstra's algorithm,

Bellman–Ford algorithm The Bellman–Ford algorithm is an algorithm that computes shortest paths from a single source vertex to all of the other vertices in a weighted digraph. It is slower than Dijkstra's algorithm for the same problem, but more versatile, as it is ...

, and the

Floyd–Warshall algorithm In computer science, the Floyd–Warshall algorithm (also known as Floyd's algorithm, the Roy–Warshall algorithm, the Roy–Floyd algorithm, or the WFI algorithm) is an algorithm for finding shortest paths in a directed weighted graph with p ...

just to name a few.

Clustering analysis

Cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...

groups objects (nodes) such that objects in the same cluster are more similar to each other than to those in other clusters. This can be used to perform pattern recognition,

image analysis Image analysis or imagery analysis is the extraction of meaningful information from images; mainly from digital images by means of digital image processing techniques. Image analysis tasks can be as simple as reading bar coded tags or as sophi ...

information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...

statistical Statistics (from German: ''Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industria ...

data analysis Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, enco ...

, and so much more. It has applications in

Plant Plants are predominantly photosynthetic eukaryotes of the kingdom Plantae. Historically, the plant kingdom encompassed all living things that were not animals, and included algae and fungi; however, all current definitions of Plantae exclud ...

and

animal Animals are multicellular, eukaryotic organisms in the Kingdom (biology), biological kingdom Animalia. With few exceptions, animals Heterotroph, consume organic material, Cellular respiration#Aerobic respiration, breathe oxygen, are Motilit ...

ecology Ecology () is the study of the relationships between living organisms, including humans, and their physical environment. Ecology considers organisms at the individual, population, community, ecosystem, and biosphere level. Ecology overlaps wi ...

, Sequence analysis, antimicrobial activity analysis, and many other fields.

Cluster analysis algorithms may refer to: Science and technology Astronomy * Cluster (spacecraft), constellation of four European Space Agency spacecraft * Asteroid cluster, a small asteroid family * Cluster II (spacecraft), a European Space Agency mission to study t ...

come in many forms as well such as

Hierarchical clustering In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into ...

k-means clustering ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or ...

, Distribution-based clustering, Density-based clustering, and Grid-based clustering.

Annotation enrichment analysis

Gene annotation databases are commonly used to evaluate the functional properties of experimentally derived gene sets. Annotation Enrichment Analysis (AEA) is used to overcome biases from overlap statistical methods used to assess these associations. It does this by using gene/protein annotations to infer which annotations are over-represented in a list of genes/proteins taken from a network.

Network analysis tools

References

{{DEFAULTSORT:Biological Network Inference Bioinformatics Systems biology Inference