Multifactor dimensionality reduction (MDR) is a statistical approach, also used in

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

automatic approaches, for detecting and characterizing combinations of

attribute Attribute may refer to: * Attribute (philosophy), an extrinsic property of an object * Attribute (research), a characteristic of an object * Grammatical modifier, in natural languages * Attribute (computing), a specification that defines a prope ...

s or independent variables that interact to influence a dependent or class variable. MDR was designed specifically to identify nonadditive

interaction Interaction is action that occurs between two or more objects, with broad use in philosophy and the sciences. It may refer to: Science * Interaction hypothesis, a theory of second language acquisition * Interaction (statistics) * Interactions o ...

s among

discrete Discrete may refer to: *Discrete particle or quantum in physics, for example in quantum theory *Discrete device, an electronic component with just one circuit element, either passive or active, other than an integrated circuit *Discrete group, a g ...

variables that influence a

binary Binary may refer to: Science and technology Mathematics * Binary number, a representation of numbers using only two digits (0 and 1) * Binary function, a function that takes two arguments * Binary operation, a mathematical operation that ta ...

outcome and is considered a

nonparametric Nonparametric statistics is the branch of statistics that is not based solely on Statistical parameter, parametrized families of probability distributions (common examples of parameters are the mean and variance). Nonparametric statistics is based ...

and model-free alternative to traditional statistical methods such as

logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression a ...

. The basis of the MDR method is a constructive induction or

feature engineering Feature engineering or feature extraction or feature discovery is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data. The motivation is to use these extra features to improve the qua ...

algorithm that converts two or more variables or attributes to a single attribute. This process of constructing a new attribute changes the representation space of the data. The end goal is to create or discover a representation that facilitates the detection of

nonlinear In mathematics and science, a nonlinear system is a system in which the change of the output is not proportional to the change of the input. Nonlinear problems are of interest to engineers, biologists, physicists, mathematicians, and many othe ...

or nonadditive interactions among the attributes such that prediction of the class variable is improved over that of the original representation of the data.

Illustrative example

Consider the following simple example using the

exclusive OR Exclusive or or exclusive disjunction is a logical operation that is true if and only if its arguments differ (one is true, the other is false). It is symbolized by the prefix operator J and by the infix operators XOR ( or ), EOR, EXOR, , ...

(XOR) function. XOR is a

logical operator In logic, a logical connective (also called a logical operator, sentential connective, or sentential operator) is a logical constant. They can be used to connect logical formulas. For instance in the syntax of propositional logic, the binary ...

that is commonly used in data mining and

as an example of a function that is not linearly separable. The table below represents a simple dataset where the relationship between the attributes (X1 and X2) and the class variable (Y) is defined by the XOR function such that Y = X1 XOR X2. Table 1 A

algorithm would need to discover or approximate the XOR function in order to accurately predict Y using information about X1 and X2. An alternative strategy would be to first change the representation of the data using constructive induction to facilitate predictive modeling. The MDR algorithm would change the representation of the data (X1 and X2) in the following manner. MDR starts by selecting two attributes. In this simple example, X1 and X2 are selected. Each combination of values for X1 and X2 are examined and the number of times Y=1 and/or Y=0 is counted. In this simple example, Y=1 occurs zero times and Y=0 occurs once for the combination of X1=0 and X2=0. With MDR, the ratio of these counts is computed and compared to a fixed threshold. Here, the ratio of counts is 0/1 which is less than our fixed threshold of 1. Since 0/1 < 1 we encode a new attribute (Z) as a 0. When the ratio is greater than one we encode Z as a 1. This process is repeated for all unique combinations of values for X1 and X2. Table 2 illustrates our new transformation of the data. Table 2 The machine learning algorithm now has much less work to do to find a good predictive function. In fact, in this very simple example, the function Y = Z has a classification accuracy of 1. A nice feature of constructive induction methods such as MDR is the ability to use any data mining or machine learning method to analyze the new representation of the data.

Decision trees A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...

, neural networks, or a naive Bayes classifier could be used in combination with measures of model quality such as balanced accuracy and mutual information.

Machine learning with MDR

As illustrated above, the basic constructive induction algorithm in MDR is very simple. However, its implementation for mining patterns from real data can be computationally complex. As with any machine learning algorithm there is always concern about overfitting. That is, machine learning algorithms are good at finding patterns in completely random data. It is often difficult to determine whether a reported pattern is an important signal or just chance. One approach is to estimate the generalizability of a model to independent datasets using methods such as cross-validation. Models that describe random data typically don't generalize. Another approach is to generate many random permutations of the data to see what the data mining algorithm finds when given the chance to overfit. Permutation testing makes it possible to generate an empirical p-value for the result. Replication in independent data may also provide evidence for an MDR model but can be sensitive to difference in the data sets. These approaches have all been shown to be useful for choosing and evaluating MDR models. An important step in a machine learning exercise is interpretation. Several approaches have been used with MDR including entropy analysis and pathway analysis. Tips and approaches for using MDR to model gene-gene interactions have been reviewed.

Extensions to MDR

Numerous extensions to MDR have been introduced. These include family-based methods, fuzzy methods, covariate adjustment, odds ratios, risk scores, survival methods, robust methods, methods for quantitative traits, and many others.

Applications of MDR

MDR has mostly been applied to detecting gene-gene interactions or epistasis in genetic studies of common human diseases such as atrial fibrillation, autism,

bladder cancer Bladder cancer is any of several types of cancer arising from the tissues of the urinary bladder. Symptoms include blood in the urine, pain with urination, and low back pain. It is caused when epithelial cells that line the bladder become ma ...

breast cancer Breast cancer is cancer that develops from breast tissue. Signs of breast cancer may include a lump in the breast, a change in breast shape, dimpling of the skin, milk rejection, fluid coming from the nipple, a newly inverted nipple, or a r ...

, cardiovascular disease, hypertension,

obesity Obesity is a medical condition, sometimes considered a disease, in which excess body fat has accumulated to such an extent that it may negatively affect health. People are classified as obese when their body mass index (BMI)—a person's ...

, pancreatic cancer, prostate cancer and

tuberculosis Tuberculosis (TB) is an infectious disease usually caused by '' Mycobacterium tuberculosis'' (MTB) bacteria. Tuberculosis generally affects the lungs, but it can also affect other parts of the body. Most infections show no symptoms, i ...

. It has also been applied to other biomedical problems such as the genetic analysis of pharmacology outcomes. A central challenge is the scaling of MDR to big data such as that from

genome-wide association studies In genomics, a genome-wide association study (GWA study, or GWAS), also known as whole genome association study (WGA study, or WGAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any varian ...

(GWAS). Several approaches have been used. One approach is to filter the features prior to MDR analysis. This can be done using biological knowledge through tools such as BioFilter. It can also be done using computational tools such as ReliefF. Another approach is to use

stochastic search Stochastic optimization (SO) methods are optimization methods that generate and use random variables. For stochastic problems, the random variables appear in the formulation of the optimization problem itself, which involves random objective functi ...

algorithms such as

genetic programming In artificial intelligence, genetic programming (GP) is a technique of evolving programs, starting from a population of unfit (usually random) programs, fit for a particular task by applying operations analogous to natural genetic processes to t ...

to explore the search space of feature combinations. Yet another approach is a brute-force search using

high-performance computing High-performance computing (HPC) uses supercomputers and computer clusters to solve advanced computation problems. Overview HPC integrates systems administration (including network and security knowledge) and parallel programming into a mult ...

Implementations

www.epistasis.org
provides an open-source and freely-available MDR software package. * A
R package
for MDR. * An sklearn-compatibl
Python implementation
* A

for Model-Based MDR. * MDR in Weka.
Generalized MDR

Illustrative example

Machine learning with MDR

Extensions to MDR

Applications of MDR

Implementations

See also

References

Further reading