HOME

TheInfoList



OR:

In
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
and
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant
features Feature may refer to: Computing * Feature (CAD), could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (software design) is an intentional distinguishing characteristic of a software item ...
(variables, predictors) for use in model construction. Feature selection techniques are used for several reasons: :* simplification of models to make them easier to interpret by researchers/users, :* shorter training times, :* to avoid the
curse of dimensionality The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The ...
, :*improve data's compatibility with a learning model class, :*encode inherent
symmetries Symmetry (from grc, συμμετρία "agreement in dimensions, due proportion, arrangement") in everyday language refers to a sense of harmonious and beautiful proportion and balance. In mathematics, "symmetry" has a more precise definiti ...
present in the input space. The central premise when using a feature selection technique is that the data contains some features that are either ''redundant'' or ''irrelevant'', and can thus be removed without incurring much loss of information. ''Redundant'' and ''irrelevant'' are two distinct notions, since one relevant feature may be redundant in the presence of another relevant feature with which it is strongly correlated. Feature selection techniques should be distinguished from
feature extraction In machine learning, pattern recognition, and image processing, feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning a ...
. Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features. Feature selection techniques are often used in domains where there are many features and comparatively few samples (or data points). Archetypal cases for the application of feature selection include the analysis of written texts and
DNA microarray A DNA microarray (also commonly known as DNA chip or biochip) is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to ...
data, where there are many thousands of features, and a few tens to hundreds of samples.


Introduction

A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. The simplest algorithm is to test each possible subset of features finding the one which minimizes the error rate. This is an exhaustive search of the space, and is computationally intractable for all but the smallest of feature sets. The choice of evaluation metric heavily influences the algorithm, and it is these evaluation metrics which distinguish between the three main categories of feature selection algorithms: wrappers, filters and embedded methods. * Wrapper methods use a predictive model to score feature subsets. Each new subset is used to train a model, which is tested on a hold-out set. Counting the number of mistakes made on that hold-out set (the error rate of the model) gives the score for that subset. As wrapper methods train a new model for each subset, they are very computationally intensive, but usually provide the best performing feature set for that particular type of model or typical problem. * Filter methods use a proxy measure instead of the error rate to score a feature subset. This measure is chosen to be fast to compute, while still capturing the usefulness of the feature set. Common measures include the
mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
, the
pointwise mutual information In statistics, probability theory and information theory, pointwise mutual information (PMI), or point mutual information, is a measure of association. It compares the probability of two events occurring together to what this probability would be i ...
,
Pearson product-moment correlation coefficient In statistics, the Pearson correlation coefficient (PCC, pronounced ) ― also known as Pearson's ''r'', the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ...
, Relief-based algorithms, and inter/intra class distance or the scores of significance tests for each class/feature combinations. Filters are usually less computationally intensive than wrappers, but they produce a feature set which is not tuned to a specific type of predictive model. This lack of tuning means a feature set from a filter is more general than the set from a wrapper, usually giving lower prediction performance than a wrapper. However the feature set doesn't contain the assumptions of a prediction model, and so is more useful for exposing the relationships between the features. Many filters provide a feature ranking rather than an explicit best feature subset, and the cut off point in the ranking is chosen via cross-validation. Filter methods have also been used as a preprocessing step for wrapper methods, allowing a wrapper to be used on larger problems. One other popular approach is the Recursive Feature Elimination algorithm, commonly used with
Support Vector Machines In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...
to repeatedly construct a model and remove features with low weights. * Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process. The exemplar of this approach is the
LASSO A lasso ( or ), also called lariat, riata, or reata (all from Castilian, la reata 're-tied rope'), is a loop of rope designed as a restraint to be thrown around a target and tightened when pulled. It is a well-known tool of the Spanish an ...
method for constructing a linear model, which penalizes the regression coefficients with an L1 penalty, shrinking many of them to zero. Any features which have non-zero regression coefficients are 'selected' by the LASSO algorithm. Improvements to the LASSO include Bolasso which bootstraps samples;
Elastic net regularization In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods. Specification The elas ...
, which combines the L1 penalty of LASSO with the L2 penalty of
ridge regression Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...
; and FeaLect which scores all the features based on combinatorial analysis of regression coefficients. AEFS further extends LASSO to nonlinear scenario with autoencoders. These approaches tend to be between filters and wrappers in terms of computational complexity. In traditional
regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
, the most popular form of feature selection is
stepwise regression In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of ...
, which is a wrapper technique. It is a
greedy algorithm A greedy algorithm is any algorithm that follows the problem-solving heuristic of making the locally optimal choice at each stage. In many problems, a greedy strategy does not produce an optimal solution, but a greedy heuristic can yield locally ...
that adds the best feature (or deletes the worst feature) at each round. The main control issue is deciding when to stop the algorithm. In machine learning, this is typically done by cross-validation. In statistics, some criteria are optimized. This leads to the inherent problem of nesting. More robust methods have been explored, such as
branch and bound Branch and bound (BB, B&B, or BnB) is an algorithm design paradigm for discrete and combinatorial optimization problems, as well as mathematical optimization. A branch-and-bound algorithm consists of a systematic enumeration of candidate soluti ...
and piecewise linear network.


Subset selection

Subset selection evaluates a subset of features as a group for suitability. Subset selection algorithms can be broken up into wrappers, filters, and embedded methods. Wrappers use a
search algorithm In computer science, a search algorithm is an algorithm designed to solve a search problem. Search algorithms work to retrieve information stored within particular data structure, or calculated in the search space of a problem domain, with eith ...
to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model. Filters are similar to wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Embedded techniques are embedded in, and specific to, a model. Many popular search approaches use greedy
hill climbing numerical analysis, hill climbing is a mathematical optimization technique which belongs to the family of local search. It is an iterative algorithm that starts with an arbitrary solution to a problem, then attempts to find a better solution ...
, which iteratively evaluates a candidate subset of features, then modifies the subset and evaluates if the new subset is an improvement over the old. Evaluation of the subsets requires a scoring
metric Metric or metrical may refer to: * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics In mathema ...
that grades a subset of features. Exhaustive search is generally impractical, so at some implementor (or operator) defined stopping point, the subset of features with the highest score discovered up to that point is selected as the satisfactory feature subset. The stopping criterion varies by algorithm; possible criteria include: a subset score exceeds a threshold, a program's maximum allowed run time has been surpassed, etc. Alternative search-based techniques are based on
targeted projection pursuit Targeted projection pursuit is a type of statistical technique used for exploratory data analysis, information visualization, and feature selection. It allows the user to interactively explore very complex data (typically having tens to hundreds o ...
which finds low-dimensional projections of the data that score highly: the features that have the largest projections in the lower-dimensional space are then selected. Search approaches include: * Exhaustive * Best first *
Simulated annealing Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space for an optimization problem. It ...
*
Genetic algorithm In computer science and operations research, a genetic algorithm (GA) is a metaheuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms (EA). Genetic algorithms are commonly used to gene ...
* Greedy forward selection * Greedy backward elimination * Particle swarm optimization *
Targeted projection pursuit Targeted projection pursuit is a type of statistical technique used for exploratory data analysis, information visualization, and feature selection. It allows the user to interactively explore very complex data (typically having tens to hundreds o ...
* Scatter search *
Variable neighborhood search Variable neighborhood search (VNS), proposed by Mladenović & Hansen in 1997, is a metaheuristic method for solving a set of combinatorial optimization and global optimization problems. It explores distant neighborhoods of the current incumbent sol ...
Two popular filter metrics for classification problems are
correlation In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...
and
mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
, although neither are true
metrics Metric or metrical may refer to: * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics In mathema ...
or 'distance measures' in the mathematical sense, since they fail to obey the
triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of degenerate triangles, but ...
and thus do not compute any actual 'distance' – they should rather be regarded as 'scores'. These scores are computed between a candidate feature (or set of features) and the desired output category. There are, however, true metrics that are a simple function of the mutual information; see
here Here is an adverb that means "in, on, or at this place". It may also refer to: Software * Here Technologies, a mapping company * Here WeGo (formerly Here Maps), a mobile app and map website by Here Television * Here TV (formerly "here!"), a TV ...
. Other available filter metrics include: * Class separability ** Error probability ** Inter-class distance ** Probabilistic distance **
Entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
* Consistency-based feature selection * Correlation-based feature selection


Optimality criteria

The choice of optimality criteria is difficult as there are multiple objectives in a feature selection task. Many common criteria incorporate a measure of accuracy, penalised by the number of features selected. Examples include
Akaike information criterion The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to e ...
(AIC) and Mallows's ''Cp'', which have a penalty of 2 for each added feature. AIC is based on
information theory Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of information. The field was originally established by the works of Harry Nyquist a ...
, and is effectively derived via the
maximum entropy principle The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge about a system is the one with largest entropy, in the context of precisely stated prior data (such as a proposition ...
. Other criteria are
Bayesian information criterion In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on ...
(BIC), which uses a penalty of \sqrt for each added feature,
minimum description length Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occam ...
(MDL) which asymptotically uses \sqrt,
Bonferroni Carlo Emilio Bonferroni (28 January 1892 – 18 August 1960) was an Italian mathematician who worked on probability theory. Biography Bonferroni studied piano and conducting in Turin Conservatory and at University of Turin under Giuseppe Peano ...
/ RIC which use \sqrt, maximum dependency feature selection, and a variety of new criteria that are motivated by
false discovery rate In statistics, the false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the FDR, which is the expe ...
(FDR), which use something close to \sqrt. A maximum
entropy rate In the mathematical theory of probability, the entropy rate or source information rate of a stochastic process is, informally, the time density of the average information in a stochastic process. For stochastic processes with a countable index, the ...
criterion may also be used to select the most relevant subset of features.


Structure learning

Filter feature selection is a specific case of a more general paradigm called structure learning. Feature selection finds the relevant feature set for a specific target variable whereas structure learning finds the relationships between all the variables, usually by expressing these relationships as a graph. The most common structure learning algorithms assume the data is generated by a
Bayesian Network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bay ...
, and so the structure is a
directed Director may refer to: Literature * ''Director'' (magazine), a British magazine * ''The Director'' (novel), a 1971 novel by Henry Denker * ''The Director'' (play), a 2000 play by Nancy Hasty Music * Director (band), an Irish rock band * ''Di ...
graphical model A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a Graph (discrete mathematics), graph expresses the conditional dependence structure between random variables. They are ...
. The optimal solution to the filter feature selection problem is the
Markov blanket In statistics and machine learning, when one wants to infer a random variable with a set of variables, usually a subset is enough, and other variables are useless. Such a subset that contains all the useful information is called a Markov blanket. ...
of the target node, and in a Bayesian Network, there is a unique Markov Blanket for each node.


Information Theory Based Feature Selection Mechanisms

There are different Feature Selection mechanisms around that utilize
mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
for scoring the different features. They usually use all the same algorithm: # Calculate the
mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
as score for between all features ( f_ \in F ) and the target class () # Select the feature with the largest score (e.g. \underset\operatorname(I(f_,c))) and add it to the set of selected features () # Calculate the score which might be derived from the
mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
# Select the feature with the largest score and add it to the set of select features (e.g. \underset\operatorname(I_(f_,c))) # Repeat 3. and 4. until a certain number of features is selected (e.g. , S, =l) The simplest approach uses the
mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
as the "derived" score.
/ref> However, there are different approaches, that try to reduce the redundancy between features.


Minimum-redundancy-maximum-relevance (mRMR) feature selection

Peng ''et al.'' proposed a feature selection method that can use either mutual information, correlation, or distance/similarity scores to select features. The aim is to penalise a feature's relevancy by its redundancy in the presence of the other selected features. The relevance of a feature set for the class is defined by the average value of all mutual information values between the individual feature and the class as follows: : D(S,c) = \frac\sum_I(f_;c) . The redundancy of all features in the set is the average value of all mutual information values between the feature and the feature : : R(S) = \frac\sum_I(f_;f_) The mRMR criterion is a combination of two measures given above and is defined as follows: :\mathrm= \max_ \left frac\sum_I(f_;c) - \frac\sum_I(f_;f_)\right Suppose that there are full-set features. Let be the set membership
indicator function In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , one has \mathbf_(x)=1 if x\i ...
for feature , so that indicates presence and indicates absence of the feature in the globally optimal feature set. Let c_i=I(f_i;c) and a_=I(f_i;f_j). The above may then be written as an optimization problem: :\mathrm= \max_ \left frac - \frac \right The mRMR algorithm is an approximation of the theoretically optimal maximum-dependency feature selection algorithm that maximizes the mutual information between the joint distribution of the selected features and the classification variable. As mRMR approximates the combinatorial estimation problem with a series of much smaller problems, each of which only involves two variables, it thus uses pairwise joint probabilities which are more robust. In certain situations the algorithm may underestimate the usefulness of features as it has no way to measure interactions between features which can increase relevancy. This can lead to poor performance when the features are individually useless, but are useful when combined (a pathological case is found when the class is a
parity function In Boolean algebra, a parity function is a Boolean function whose value is one if and only if the input vector has an odd number of ones. The parity function of two inputs is also known as the XOR function. The parity function is notable for its ...
of the features). Overall the algorithm is more efficient (in terms of the amount of data required) than the theoretically optimal max-dependency selection, yet produces a feature set with little pairwise redundancy. mRMR is an instance of a large class of filter methods which trade off between relevancy and redundancy in different ways.Nguyen, H., Franke, K., Petrovic, S. (2010). "Towards a Generic Feature-Selection Measure for Intrusion Detection", In Proc. International Conference on Pattern Recognition (ICPR), Istanbul, Turkey

/ref>


Quadratic programming feature selection

mRMR is a typical example of an incremental greedy strategy for feature selection: once a feature has been selected, it cannot be deselected at a later stage. While mRMR could be optimized using floating search to reduce some features, it might also be reformulated as a global
quadratic programming Quadratic programming (QP) is the process of solving certain mathematical optimization problems involving quadratic functions. Specifically, one seeks to optimize (minimize or maximize) a multivariate quadratic function subject to linear constra ...
optimization problem as follows: : \mathrm: \min_\mathbf \left\ \quad \mbox \ \sum_^n x_i=1, x_i\geq 0 where F_= (f_1;c),\ldots, I(f_n;c)T is the vector of feature relevancy assuming there are features in total, H_= (f_i;f_j) is the matrix of feature pairwise redundancy, and \mathbf_ represents relative feature weights. QPFS is solved via quadratic programming. It is recently shown that QFPS is biased towards features with smaller entropy, due to its placement of the feature self redundancy term I(f_i;f_i) on the diagonal of .


Conditional mutual information

Another score derived for the mutual information is based on the conditional relevancy:Nguyen X. Vinh, Jeffrey Chan, Simone Romano and James Bailey, "Effective Global Approaches for Mutual Information based Feature Selection". Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'14), August 24–27, New York City, 2014.

: \mathrm: \max_ \left\ \quad \mbox\ \, \mathbf\, =1, x_i\geq 0 where Q_=I(f_i;c) and Q_=I(f_i;c, f_j), i\ne j. An advantage of is that it can be solved simply via finding the dominant eigenvector of , thus is very scalable. also handles second-order feature interaction.


Joint mutual information

In a study of different scores Brown et al. recommended the joint mutual information as a good score for feature selection. The score tries to find the feature, that adds the most new information to the already selected features, in order to avoid redundancy. The score is formulated as follows: : \begin JMI(f_i) &= \sum_ (I(f_i;c) + I(f_i;c, f_j)) \\ &= \sum_ \bigl c)\bigr)\bigr\end The score uses the
conditional mutual information In probability theory, particularly information theory, the conditional mutual information is, in its most basic form, the expected value of the mutual information of two random variables given the value of a third. Definition For random var ...
and the
mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
to estimate the redundancy between the already selected features ( f_j \in S ) and the feature under investigation (f_i).


Hilbert-Schmidt Independence Criterion Lasso based feature selection

For high-dimensional and small sample data (e.g., dimensionality > and the number of samples < ), the Hilbert-Schmidt Independence Criterion Lasso (HSIC Lasso) is useful. HSIC Lasso optimization problem is given as : \mathrm: \min_ \frac\sum_^n x_k x_l (f_k,f_l) - \sum_^n x_k (f_k,c) + \lambda \, \mathbf\, _1, \quad \mbox \ x_1,\ldots, x_n \geq 0, where (f_k,c) =\mbox(\bar^ \bar) is a kernel-based independence measure called the (empirical) Hilbert-Schmidt independence criterion (HSIC), \mbox(\cdot) denotes the
trace Trace may refer to: Arts and entertainment Music * Trace (Son Volt album), ''Trace'' (Son Volt album), 1995 * Trace (Died Pretty album), ''Trace'' (Died Pretty album), 1993 * Trace (band), a Dutch progressive rock band * The Trace (album), ''The ...
, \lambda is the regularization parameter, \bar^ = \mathbf \mathbf^ \mathbf and \bar = \mathbf \mathbf \mathbf are input and output centered Gram matrices, K^_ = K(u_,u_) and L_ = L(c_i,c_j) are Gram matrices, K(u,u') and L(c,c') are kernel functions, \mathbf = \mathbf_m - \frac\mathbf_m \mathbf_m^T is the centering matrix, \mathbf_m is the -dimensional
identity matrix In linear algebra, the identity matrix of size n is the n\times n square matrix with ones on the main diagonal and zeros elsewhere. Terminology and notation The identity matrix is often denoted by I_n, or simply by I if the size is immaterial o ...
(: the number of samples), \mathbf_m is the -dimensional vector with all ones, and \, \cdot\, _ is the \ell_1-norm. HSIC always takes a non-negative value, and is zero if and only if two random variables are statistically independent when a universal reproducing kernel such as the Gaussian kernel is used. The HSIC Lasso can be written as : \mathrm: \min_ \frac\left\, \bar - \sum_^ x_k \bar^ \right\, ^2_ + \lambda \, \mathbf\, _1, \quad \mbox \ x_1,\ldots,x_n \geq 0, where \, \cdot\, _ is the
Frobenius norm In mathematics, a matrix norm is a vector norm in a vector space whose elements (vectors) are matrices (of given dimensions). Preliminaries Given a field K of either real or complex numbers, let K^ be the -vector space of matrices with m rows ...
. The optimization problem is a Lasso problem, and thus it can be efficiently solved with a state-of-the-art Lasso solver such as the dual
augmented Lagrangian method Augmented Lagrangian methods are a certain class of algorithms for solving constrained optimization problems. They have similarities to penalty methods in that they replace a constrained optimization problem by a series of unconstrained problems a ...
.


Correlation feature selection

The correlation feature selection (CFS) measure evaluates subsets of features on the basis of the following hypothesis: "Good feature subsets contain features highly correlated with the classification, yet uncorrelated to each other". The following equation gives the merit of a feature subset ''S'' consisting of ''k'' features: : \mathrm_ = \frac. Here, \overline is the average value of all feature-classification correlations, and \overline is the average value of all feature-feature correlations. The CFS criterion is defined as follows: :\mathrm = \max_ \left frac \right The r_ and r_ variables are referred to as correlations, but are not necessarily
Pearson's correlation coefficient In statistics, the Pearson correlation coefficient (PCC, pronounced ) ― also known as Pearson's ''r'', the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ...
or Spearman's ρ. Hall's dissertation uses neither of these, but uses three different measures of relatedness,
minimum description length Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occam ...
(MDL), symmetrical uncertainty, and
relief Relief is a sculptural method in which the sculpted pieces are bonded to a solid background of the same material. The term ''relief'' is from the Latin verb ''relevo'', to raise. To create a sculpture in relief is to give the impression that the ...
. Let ''xi'' be the set membership
indicator function In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , one has \mathbf_(x)=1 if x\i ...
for feature ''fi''; then the above can be rewritten as an optimization problem: :\mathrm = \max_ \left frac \right The combinatorial problems above are, in fact, mixed 0–1
linear programming Linear programming (LP), also called linear optimization, is a method to achieve the best outcome (such as maximum profit or lowest cost) in a mathematical model whose requirements are represented by linear function#As a polynomial function, li ...
problems that can be solved by using branch-and-bound algorithms.


Regularized trees

The features from a
decision tree A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...
or a tree
ensemble Ensemble may refer to: Art * Architectural ensemble * ''Ensemble'' (album), Kendji Girac 2015 album * Ensemble (band), a project of Olivier Alary * Ensemble cast (drama, comedy) * Ensemble (musical theatre), also known as the chorus * ''En ...
are shown to be redundant. A recent method called regularized treeH. Deng, G. Runger,
Feature Selection via Regularized Trees
, Proceedings of the 2012 International Joint Conference on Neural Networks (IJCNN), IEEE, 2012
can be used for feature subset selection. Regularized trees penalize using a variable similar to the variables selected at previous tree nodes for splitting the current node. Regularized trees only need build one tree model (or one tree ensemble model) and thus are computationally efficient. Regularized trees naturally handle numerical and categorical features, interactions and nonlinearities. They are invariant to attribute scales (units) and insensitive to
outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
s, and thus, require little
data preprocessing Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance, and is an important step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to ...
such as
normalization Normalization or normalisation refers to a process that makes something more normal or regular. Most commonly it refers to: * Normalization (sociology) or social normalization, the process through which ideas and behaviors that may fall outside of ...
. Regularized random forest (RRF)RRF: Regularized Random Forest
R package on CRAN
is one type of regularized trees. The guided RRF is an enhanced RRF which is guided by the importance scores from an ordinary random forest.


Overview on metaheuristics methods

A
metaheuristic In computer science and mathematical optimization, a metaheuristic is a higher-level procedure or heuristic designed to find, generate, or select a heuristic (partial search algorithm) that may provide a sufficiently good solution to an optimizati ...
is a general description of an algorithm dedicated to solve difficult (typically
NP-hard In computational complexity theory, NP-hardness ( non-deterministic polynomial-time hardness) is the defining property of a class of problems that are informally "at least as hard as the hardest problems in NP". A simple example of an NP-hard pr ...
problem) optimization problems for which there is no classical solving methods. Generally, a metaheuristic is a stochastic algorithm tending to reach a global optimum. There are many metaheuristics, from a simple local search to a complex global search algorithm.


Main principles

The feature selection methods are typically presented in three classes based on how they combine the selection algorithm and the model building.


Filter method

Filter type methods select variables regardless of the model. They are based only on general features like the correlation with the variable to predict. Filter methods suppress the least interesting variables. The other variables will be part of a classification or a regression model used to classify or to predict data. These methods are particularly effective in computation time and robust to overfitting. Filter methods tend to select redundant variables when they do not consider the relationships between variables. However, more elaborate features try to minimize this problem by removing variables highly correlated to each other, such as the Fast Correlation Based Filter (FCBF) algorithm.


Wrapper method

Wrapper methods evaluate subsets of variables which allows, unlike filter approaches, to detect the possible interactions amongst variables.T. M. Phuong, Z. Lin et R. B. Altman
Choosing SNPs using feature selection.
Proceedings / IEEE Computational Systems Bioinformatics Conference, CSB. IEEE Computational Systems Bioinformatics Conference, pages 301-309, 2005. .
The two main disadvantages of these methods are: * The increasing overfitting risk when the number of observations is insufficient. * The significant computation time when the number of variables is large.


Embedded method

Embedded methods have been recently proposed that try to combine the advantages of both previous methods. A learning algorithm takes advantage of its own variable selection process and performs feature selection and classification simultaneously, such as the FRMT algorithm.


Application of feature selection metaheuristics

This is a survey of the application of feature selection metaheuristics lately used in the literature. This survey was realized by J. Hammon in her 2013 thesis.


Feature selection embedded in learning algorithms

Some learning algorithms perform feature selection as part of their overall operation. These include: * -regularization techniques, such as sparse regression, LASSO, and -SVM * Regularized trees, e.g. regularized random forest implemented in the RRF package *
Decision tree A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...
*
Memetic algorithm A memetic algorithm (MA) in computer science and operations research, is an extension of the traditional genetic algorithm. It may provide a sufficiently good solution to an optimization problem. It uses a local search technique to reduce the like ...
*
Random multinomial logit Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of th ...
(RMNL) * Auto-encoding networks with a bottleneck-layer *
Submodular In mathematics, a submodular set function (also known as a submodular function) is a set function whose value, informally, has the property that the difference in the incremental value of the function that a single element makes when added to an ...
feature selection * Local learning based feature selection. Compared with traditional methods, it does not involve any heuristic search, can easily handle multi-class problems, and works for both linear and nonlinear problems. It is also supported by a strong theoretical foundation. Numeric experiments showed that the method can achieve a close-to-optimal solution even when data contains >1M irrelevant features. * Recommender system based on feature selection.D.H. Wang, Y.C. Liang, D.Xu, X.Y. Feng, R.C. Guan(2018),
A content-based recommender system for computer science publications
, ''
Knowledge-Based Systems A knowledge-based system (KBS) is a computer program that reasons and uses a knowledge base to solve complex problems. The term is broad and refers to many different kinds of systems. The one common theme that unites all knowledge based systems is ...
'', 157: 1-9
The feature selection methods are introduced into recommender system research.


See also

*
Cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
* Data mining *
Dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
*
Feature extraction In machine learning, pattern recognition, and image processing, feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning a ...
*
Hyperparameter optimization In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the va ...
*
Model selection Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the ...
*
Relief (feature selection) Relief is an algorithm developed by Kira and Rendell in 1992 that takes a filter-method approach to feature selection that is notably sensitive to feature interactions. It was originally designed for application to binary classification problems w ...


References


Further reading

* * * * {{cite journal , first1=Huan , last1=Liu , first2=Lei , last2=Yu , title=Toward Integrating Feature Selection Algorithms for Classification and Clustering , journal=IEEE Transactions on Knowledge and Data Engineering , volume=17 , issue=4 , year=2005 , pages=491–502 , doi=10.1109/TKDE.2005.66 , s2cid=1607600


External links


Feature Selection Package, Arizona State University (Matlab Code)

NIPS challenge 2003
(see also NIPS)
Naive Bayes implementation with feature selection in Visual Basic
(includes executable and source code)

* ttp://mloss.org/software/view/386/ FEAST(Open source Feature Selection algorithms in C and MATLAB) Model selection Dimension reduction