Biclustering, block clustering, Co-clustering or two-mode clustering is a data mining technique which allows simultaneous clustering of the rows and columns of a

matrix Matrix most commonly refers to: * ''The Matrix'' (franchise), an American media franchise ** ''The Matrix'', a 1999 science-fiction action film ** "The Matrix", a fictional setting, a virtual reality environment, within ''The Matrix'' (franchis ...

. The term was first introduced by Boris Mirkin to name a technique introduced many years earlier, in 1972, by J. A. Hartigan. Given a set of

m

samples represented by an

n

-dimensional feature vector, the entire dataset can be represented as

m

rows in

n

columns (i.e., an

m \times n

matrix). The Biclustering algorithm generates Biclusters, a subset of rows which exhibit similar behavior across a subset of columns, or vice versa.

Development

Biclustering was originally introduced by J. A. Hartigan in 1972. The term "Biclustering" was then later used and refined by Mirkin. This algorithm was not generalized until 2000, when Y. Cheng and G. M. Church proposed a Biclustering algorithm based on variance and applied it to biological gene expression data. In 2001 and 2003, I.S. Dhillon published two algorithms applying Biclustering to files and words. One version was based on bipartite spectral graph partitioning. The other was based on information theory. Dhillon assumed the loss of

mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...

during Biclustering was equal to the Kullback–Leibler-distance (KL-distance) between P and Q. P represents the distribution of files and feature words before Biclustering, while Q is the distribution after Biclustering. KL-distance is for measuring the difference between two random distributions. KL = 0 when the two distributions are the same and KL increases as the difference increases. Thus, the aim of the algorithm was to find the minimum KL-distance between P and Q. In 2004, Arindam Banerjee used a weighted-Bregman distance instead of KL-distance to design a Biclustering algorithm that was suitable for any kind of matrix, unlike the KL-distance algorithm. To cluster more than two types of objects, in 2005, Bekkerman expanded the mutual information in Dhillon's theorem from a single pair into multiple pairs.

Complexity

The complexity of the Biclustering problem depends on the exact problem formulation, and particularly on the merit function used to evaluate the quality of a given Bicluster. However, the most interesting variants of this problem are

NP-complete In computational complexity theory, a problem is NP-complete when: # it is a problem for which the correctness of each solution can be verified quickly (namely, in polynomial time) and a brute-force search algorithm can find a solution by tryi ...

. NP-complete has two conditions. In the simple case that there is an only element ''a''_{(''i'',''j'')} either 0 or 1 in the binary matrix A, a Bicluster is equal to a biclique in the corresponding

bipartite graph In the mathematical field of graph theory, a bipartite graph (or bigraph) is a graph whose vertices can be divided into two disjoint and independent sets U and V, that is every edge connects a vertex in U to one in V. Vertex sets U and V are ...

. The maximum size Bicluster is equivalent to the maximum edge biclique in the bipartite graph. In the complex case, the element in matrix A is used to compute the quality of a given Bicluster and solve the more restricted version of the problem. It requires either large

computational Computation is any type of arithmetic or non-arithmetic calculation that follows a well-defined model (e.g., an algorithm). Mechanical or electronic devices (or, historically, people) that perform computations are known as ''computers''. An espe ...

effort or the use of lossy

heuristics A heuristic (; ), or heuristic technique, is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediate, ...

to short-circuit the calculation.

Types of Biclusters

Bicluster with constant values (a) When a Biclustering algorithm tries to find a constant-value Bicluster, it reorders the rows and columns of the matrix to group together similar rows and columns, eventually grouping Biclusters with similar values. This method is sufficient when the data is normalized. A ''perfect constant Bicluster'' is a matrix(I,J) in which all values ''a(i,j)'' are equal to a given constant μ. In tangible data, these entries ''a(i,j)'' may be represented with the form ''n(i,j) + μ'' where ''n(i,j)'' denotes the

noise Noise is unwanted sound considered unpleasant, loud or disruptive to hearing. From a physics standpoint, there is no distinction between noise and desired sound, as both are vibrations through a medium, such as air or water. The difference arise ...

. According to Hartigan's algorithm, by splitting the original data matrix into a set of Biclusters,

variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...

is used to compute constant Biclusters. Hence, a perfect Bicluster may be equivalently defined as a matrix with a variance of zero. In order to prevent the partitioning of the data matrix into Biclusters with the only one row and one column; Hartigan assumes that there are, for example, ''K'' Biclusters within the data matrix. When the data matrix is partitioned into ''K'' Biclusters, the algorithm ends. Bicluster with constant values on rows (b) or columns (c) Unlike the constant-value Biclusters, these types of Biclusters cannot be evaluated solely based on the variance of their values. To finish the identification, the columns and the rows should be normalized first. There are, however, other algorithms, without the normalization step, that can find Biclusters which have rows and columns with different approaches. Bicluster with coherent values (d, e) For Biclusters with coherent values on rows and columns, an overall improvement over the algorithms for Biclusters with constant values on rows or on columns should be considered. This algorithm may contain analysis of variance between groups, using co-variance between both rows and columns. In Cheng and Church's theorem, a Bicluster is defined as a subset of rows and columns with almost the same score. The similarity score is used to measure the coherence of rows and columns. The relationship between these cluster models and other types of clustering such as

correlation clustering Clustering is the problem of partitioning data points into groups based on their similarity. Correlation clustering provides a method for clustering a set of objects into the optimum number of clusters without specifying that number in advance. De ...

is discussed in.

Algorithms

There are many Biclustering

algorithms In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing c ...

developed for

bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...

, including: block clustering, CTWC (Coupled Two-Way Clustering), ITWC (Interrelated Two-Way Clustering), δ-bicluster, δ-pCluster, δ-pattern, FLOC, OPC, Plaid Model, OPSMs (Order-preserving submatrixes), Gibbs, SAMBA (Statistical-Algorithmic Method for Bicluster Analysis), Robust Biclustering Algorithm (RoBA), Crossing Minimization, cMonkey, PRMs, DCC, LEB (Localize and Extract Biclusters), QUBIC (QUalitative BIClustering), BCCA (Bi-Correlation Clustering Algorithm) BIMAX, ISA and FABIA (

Factor analysis Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed ...

for Bicluster Acquisition), runibic, and recently proposed hybrid method EBIC (evolutionary-based Biclustering), which was shown to detect multiple patterns with very high accuracy. More recently, IMMD-CC is proposed that is developed based on the iterative complexity reduction concept. IMMD-CC is able to identify co-cluster centroids from highly sparse transformation obtained by iterative multi-mode discretization. Biclustering algorithms have also been proposed and used in other application fields under the names co-clustering, bi-dimensional clustering, and subspace clustering. Given the known importance of discovering local patterns in

time-series data In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Exa ...

. Recent proposals have addressed the Biclustering problem in the specific case of time-series

gene expression Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product that enables it to produce end products, protein or non-coding RNA, and ultimately affect a phenotype, as the final effect. The ...

data. In this case, the interesting Biclusters can be restricted to those with

contiguous Contiguity or contiguous may refer to: *Contiguous data storage, in computer science *Contiguity (probability theory) *Contiguity (psychology) *Contiguous distribution of species, in biogeography *Geographic contiguity of territorial land *Contigu ...

columns. This restriction leads to a

tractable problem In theoretical computer science and mathematics, computational complexity theory focuses on classifying computational problems according to their resource usage, and relating these classes to each other. A computational problem is a task solved by ...

and enables the development of efficient exhaustive

enumeration An enumeration is a complete, ordered listing of all the items in a collection. The term is commonly used in mathematics and computer science to refer to a listing of all of the elements of a set. The precise requirements for an enumeration (fo ...

algorithms such as CCC-Biclustering and ''e''-CCC-Biclustering. The approximate patterns in CCC-Biclustering algorithms allow a given number of errors, per gene, relatively to an expression profile representing the expression pattern in the Bicluster. The e-CCC-Biclustering algorithm uses approximate expressions to find and report all maximal CCC-Bicluster's by a discretized matrix A and efficient string processing techniques. These

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...

s find and report all maximal Biclusters with coherent and contiguous columns with perfect/approximate expression patterns, in time linear/

polynomial In mathematics, a polynomial is an expression consisting of indeterminates (also called variables) and coefficients, that involves only the operations of addition, subtraction, multiplication, and positive-integer powers of variables. An exa ...

which is obtained by manipulating a discretized version of original expression matrix in the size of the time-series gene expression

using efficient

string processing String functions are used in computer programming languages to manipulate a string or query information about a string (some do both). Most programming languages that have a string datatype will have some string functions although there may ...

techniques based on

suffix tree In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particu ...

s. These algorithms are also applied to solve problems and sketch the analysis of computational complexity. Some recent algorithms have attempted to include additional support for Biclustering rectangular matrices in the form of other

datatype In computer science and computer programming, a data type (or simply type) is a set of possible values and a set of allowed operations on it. A data type tells the compiler or interpreter how the programmer intends to use the data. Most progra ...

s, including cMonkey. There is an ongoing debate about how to judge the results of these methods, as Biclustering allows overlap between clusters and some

allow the exclusion of hard-to-reconcile columns/conditions. Not all of the available algorithms are deterministic and the analyst must pay attention to the degree to which results represent stable

minima In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given ran ...

. Because this is an

unsupervised classification Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...

problem, the lack of a

gold standard A gold standard is a monetary system in which the standard economic unit of account is based on a fixed quantity of gold. The gold standard was the basis for the international monetary system from the 1870s to the early 1920s, and from the la ...

makes it difficult to spot errors in the results. One approach is to utilize multiple Biclustering algorithms, with the majority or

super-majority A supermajority, supra-majority, qualified majority, or special majority is a requirement for a proposal to gain a specified level of support which is greater than the threshold of more than one-half used for a simple majority. Supermajority ru ...

voting amongst them to decide the best result. Another way is to analyze the quality of shifting and scaling patterns in Biclusters. Biclustering has been used in the domain of

text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...

(or classification) which is popularly known as co-clustering. Text corpora are represented in a vectoral form as a

D whose rows denote the documents and whose columns denote the words in the dictionary. Matrix elements D_ij denote occurrence of word j in document i. Co-clustering algorithms are then applied to discover blocks in D that correspond to a group of documents (rows) characterized by a group of words(columns). Text clustering can solve the high-dimensional sparse problem, which means clustering text and words at the same time. When clustering text, we need to think about not only the words information, but also the information of words clusters that was composed by words. Then, according to similarity of feature words in the text, will eventually cluster the feature words. This is called co-clustering. There are two advantages of co-clustering: one is clustering the test based on words clusters can extremely decrease the dimension of clustering, it can also appropriate to measure the distance between the tests. Second is mining more useful information and can get the corresponding information in test clusters and words clusters. This corresponding information can be used to describe the type of texts and words, at the same time, the result of words clustering can be also used to text mining and information retrieval. Several approaches have been proposed based on the information contents of the resulting blocks: matrix-based approaches such as

SVD ''Svenska Dagbladet'' (, "The Swedish Daily News"), abbreviated SvD, is a daily newspaper published in Stockholm, Sweden. History and profile The first issue of ''Svenska Dagbladet'' appeared on 18 December 1884. During the beginning of the ...

and BVD, and graph-based approaches.

Information-theoretic Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. T ...

algorithms

iterative Iteration is the repetition of a process in order to generate a (possibly unbounded) sequence of outcomes. Each repetition of the process is a single iteration, and the outcome of each iteration is then the starting point of the next iteration. ...

ly assign each row to a cluster of documents and each column to a cluster of words such that the mutual information is maximized. Matrix-based methods focus on the decomposition of matrices into blocks such that the error between the original matrix and the regenerated matrices from the decomposition is minimized. Graph-based methods tend to minimize the cuts between the clusters. Given two groups of documents d₁ and d₂, the number of cuts can be measured as the number of words that occur in documents of groups d₁ and d₂. More recently (Bisson and Hussain) have proposed a new approach of using the similarity between words and the similarity between documents to co-cluster the matrix. Their method (known as χ-Sim, for cross similarity) is based on finding document-document similarity and word-word similarity, and then using classical clustering methods such as

hierarchical clustering In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into ...

. Instead of explicitly clustering rows and columns alternately, they consider higher-order occurrences of words, inherently taking into account the documents in which they occur. Thus, the similarity between two words is calculated based on the documents in which they occur and also the documents in which "similar" words occur. The idea here is that two documents about the same topic do not necessarily use the same set of words to describe it, but a subset of the words and other similar words that are characteristic of that topic. This approach of taking higher-order similarities takes the latent semantic structure of the whole corpus into consideration with the result of generating a better clustering of the documents and words. In text databases, for a document collection defined by a document by term D matrix (of size m by n, m: number of documents, n: number of terms) the cover-coefficient based clustering methodology yields the same number of clusters both for documents and terms (words) using a double-stage probability experiment. According to the cover coefficient concept number of clusters can also be roughly estimated by the following formula

(m \times n) / t

where t is the number of non-zero entries in D. Note that in D each row and each column must contain at least one non-zero element. In contrast to other approaches, FABIA is a multiplicative model that assumes realistic non-Gaussian signal distributions with

heavy tails In probability theory, heavy-tailed distributions are probability distributions whose tails are not exponentially bounded: that is, they have heavier tails than the exponential distribution. In many applications it is the right tail of the distrib ...

. FABIA utilizes well understood model selection techniques like variational approaches and applies the

Bayesian Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister. Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a follower ...

framework. The generative framework allows FABIA to determine the

information content In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative wa ...

of each Bicluster to separate spurious Biclusters from true Biclusters.

References

Others

* N.K. Verma, S. Bajpai, A. Singh, A. Nagrare, S. Meena, Yan Cui, "A Comparison of Biclustering Algorithms" in International conference on Systems in Medicine and Biology (ICSMB 2010)in IIT Kharagpur India, pp. 90–97, Dec. 16–18. * J. Gupta, S. Singh and N.K. Verma "MTBA: MATLAB Toolbox for Biclustering Analysis", IEEE Workshop on Computational Intelligence: Theories, Applications and Future Directions", IIT Kanpur India, pp. 148–152, Jul. 2013. * A. Tanay. R. Sharan, and R. Shamir, "Biclustering Algorithms: A Survey", In ''Handbook of Computational Molecular Biology'', Edited by

Srinivas Aluru Srinivas Aluru is a professor in the School of Computational Science and Engineering at Georgia Institute of Technology, and co-Executive Director for the Georgia Tech Interdisciplinary Research Institute in Data Engineering and Science. His main ...

, Chapman (2004) * {{cite journal , vauthors=Kluger Y, Basri R, Chang JT, Gerstein MB , year = 2003 , title = Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions , journal = Genome Research , volume = 13 , issue = 4, pages = 703–716 , doi = 10.1101/gr.648603 , pmid = 12671006 , pmc = 430175 * Adetayo Kasim, Ziv Shkedy, Sebastian Kaiser, Sepp Hochreiter, Willem Talloen (2016), Applied Biclustering Methods for Big and High-Dimensional Data Using R, Chapman & Hall/CRC Press * Orzechowski, P., Sipper, M., Huang, X., & Moore, J. H. (2018). EBIC: an evolutionary-based parallel biclustering algorithm for pattern discovery. ''Bioinformatics''.

External links

FABIA: Factor Analysis for Bicluster Acquisition, an R package
—software Cluster analysis Bioinformatics