In
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
and
statistics
Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant
features
Feature may refer to:
Computing
* Feature (CAD), could be a hole, pocket, or notch
* Feature (computer vision), could be an edge, corner or blob
* Feature (software design) is an intentional distinguishing characteristic of a software item ...
(variables, predictors) for use in model construction. Feature selection techniques are used for several reasons:
:* simplification of models to make them easier to interpret by researchers/users,
:* shorter training times,
:* to avoid the
curse of dimensionality
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The ...
,
:*improve data's compatibility with a learning model class,
:*encode inherent
symmetries
Symmetry (from grc, συμμετρία "agreement in dimensions, due proportion, arrangement") in everyday language refers to a sense of harmonious and beautiful proportion and balance. In mathematics, "symmetry" has a more precise definiti ...
present in the input space.
The central premise when using a feature selection technique is that the data contains some features that are either ''redundant'' or ''irrelevant'', and can thus be removed without incurring much loss of information.
''Redundant'' and ''irrelevant'' are two distinct notions, since one relevant feature may be redundant in the presence of another relevant feature with which it is strongly correlated.
Feature selection techniques should be distinguished from
feature extraction
In machine learning, pattern recognition, and image processing, feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning a ...
. Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features. Feature selection techniques are often used in domains where there are many features and comparatively few samples (or data points). Archetypal cases for the application of feature selection include the analysis of
written texts and
DNA microarray
A DNA microarray (also commonly known as DNA chip or biochip) is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to ...
data, where there are many thousands of features, and a few tens to hundreds of samples.
Introduction
A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. The simplest algorithm is to test each possible subset of features finding the one which minimizes the error rate. This is an exhaustive search of the space, and is computationally intractable for all but the smallest of feature sets. The choice of evaluation metric heavily influences the algorithm, and it is these evaluation metrics which distinguish between the three main categories of feature selection algorithms: wrappers, filters and embedded methods.
* Wrapper methods use a predictive model to score feature subsets. Each new subset is used to train a model, which is tested on a hold-out set. Counting the number of mistakes made on that hold-out set (the error rate of the model) gives the score for that subset. As wrapper methods train a new model for each subset, they are very computationally intensive, but usually provide the best performing feature set for that particular type of model or typical problem.
* Filter methods use a proxy measure instead of the error rate to score a feature subset. This measure is chosen to be fast to compute, while still capturing the usefulness of the feature set. Common measures include the
mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
,
the
pointwise mutual information
In statistics, probability theory and information theory, pointwise mutual information (PMI), or point mutual information, is a measure of association. It compares the probability of two events occurring together to what this probability would be i ...
,
Pearson product-moment correlation coefficient
In statistics, the Pearson correlation coefficient (PCC, pronounced ) ― also known as Pearson's ''r'', the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ...
,
Relief-based algorithms, and inter/intra class distance or the scores of
significance tests for each class/feature combinations.
Filters are usually less computationally intensive than wrappers, but they produce a feature set which is not tuned to a specific type of predictive model. This lack of tuning means a feature set from a filter is more general than the set from a wrapper, usually giving lower prediction performance than a wrapper. However the feature set doesn't contain the assumptions of a prediction model, and so is more useful for exposing the relationships between the features. Many filters provide a feature ranking rather than an explicit best feature subset, and the cut off point in the ranking is chosen via
cross-validation. Filter methods have also been used as a preprocessing step for wrapper methods, allowing a wrapper to be used on larger problems. One other popular approach is the Recursive Feature Elimination algorithm, commonly used with
Support Vector Machines
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...
to repeatedly construct a model and remove features with low weights.
* Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process. The exemplar of this approach is the
LASSO
A lasso ( or ), also called lariat, riata, or reata (all from Castilian, la reata 're-tied rope'), is a loop of rope designed as a restraint to be thrown around a target and tightened when pulled. It is a well-known tool of the Spanish an ...
method for constructing a linear model, which penalizes the regression coefficients with an L1 penalty, shrinking many of them to zero. Any features which have non-zero regression coefficients are 'selected' by the LASSO algorithm. Improvements to the LASSO include Bolasso which bootstraps samples;
Elastic net regularization
In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods.
Specification
The elas ...
, which combines the L1 penalty of LASSO with the L2 penalty of
ridge regression
Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...
; and FeaLect which scores all the features based on combinatorial analysis of regression coefficients.
AEFS further extends LASSO to nonlinear scenario with autoencoders. These approaches tend to be between filters and wrappers in terms of computational complexity.
In traditional
regression analysis
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
, the most popular form of feature selection is
stepwise regression
In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of ...
, which is a wrapper technique. It is a
greedy algorithm
A greedy algorithm is any algorithm that follows the problem-solving heuristic of making the locally optimal choice at each stage. In many problems, a greedy strategy does not produce an optimal solution, but a greedy heuristic can yield locally ...
that adds the best feature (or deletes the worst feature) at each round. The main control issue is deciding when to stop the algorithm. In machine learning, this is typically done by
cross-validation. In statistics, some criteria are optimized. This leads to the inherent problem of nesting. More robust methods have been explored, such as
branch and bound
Branch and bound (BB, B&B, or BnB) is an algorithm design paradigm for discrete and combinatorial optimization problems, as well as mathematical optimization. A branch-and-bound algorithm consists of a systematic enumeration of candidate soluti ...
and piecewise linear network.
Subset selection
Subset selection evaluates a subset of features as a group for suitability. Subset selection algorithms can be broken up into wrappers, filters, and embedded methods. Wrappers use a
search algorithm
In computer science, a search algorithm is an algorithm designed to solve a search problem. Search algorithms work to retrieve information stored within particular data structure, or calculated in the search space of a problem domain, with eith ...
to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model. Filters are similar to wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Embedded techniques are embedded in, and specific to, a model.
Many popular search approaches use
greedy hill climbing
numerical analysis, hill climbing is a mathematical optimization technique which belongs to the family of local search. It is an iterative algorithm that starts with an arbitrary solution to a problem, then attempts to find a better solution ...
, which iteratively evaluates a candidate subset of features, then modifies the subset and evaluates if the new subset is an improvement over the old. Evaluation of the subsets requires a scoring
metric
Metric or metrical may refer to:
* Metric system, an internationally adopted decimal system of measurement
* An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement
Mathematics
In mathema ...
that grades a subset of features. Exhaustive search is generally impractical, so at some implementor (or operator) defined stopping point, the subset of features with the highest score discovered up to that point is selected as the satisfactory feature subset. The stopping criterion varies by algorithm; possible criteria include: a subset score exceeds a threshold, a program's maximum allowed run time has been surpassed, etc.
Alternative search-based techniques are based on
targeted projection pursuit
Targeted projection pursuit is a type of statistical technique used for exploratory data analysis, information visualization, and feature selection. It allows the user to interactively explore very complex data (typically having tens to hundreds o ...
which finds low-dimensional projections of the data that score highly: the features that have the largest projections in the lower-dimensional space are then selected.
Search approaches include:
* Exhaustive
*
Best first
*
Simulated annealing
Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space for an optimization problem. It ...
*
Genetic algorithm
In computer science and operations research, a genetic algorithm (GA) is a metaheuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms (EA). Genetic algorithms are commonly used to gene ...
*
Greedy forward selection
* Greedy backward elimination
*
Particle swarm optimization
*
Targeted projection pursuit
Targeted projection pursuit is a type of statistical technique used for exploratory data analysis, information visualization, and feature selection. It allows the user to interactively explore very complex data (typically having tens to hundreds o ...
* Scatter search
*
Variable neighborhood search Variable neighborhood search (VNS), proposed by Mladenović & Hansen in 1997,
is a metaheuristic method for solving a set of combinatorial optimization and global optimization problems.
It explores distant neighborhoods of the current incumbent sol ...
Two popular filter metrics for classification problems are
correlation
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...
and
mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
, although neither are true
metrics
Metric or metrical may refer to:
* Metric system, an internationally adopted decimal system of measurement
* An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement
Mathematics
In mathema ...
or 'distance measures' in the mathematical sense, since they fail to obey the
triangle inequality
In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side.
This statement permits the inclusion of degenerate triangles, but ...
and thus do not compute any actual 'distance' – they should rather be regarded as 'scores'. These scores are computed between a candidate feature (or set of features) and the desired output category. There are, however, true metrics that are a simple function of the mutual information; see
here
Here is an adverb that means "in, on, or at this place". It may also refer to:
Software
* Here Technologies, a mapping company
* Here WeGo (formerly Here Maps), a mobile app and map website by Here
Television
* Here TV (formerly "here!"), a TV ...
.
Other available filter metrics include:
* Class separability
** Error probability
** Inter-class distance
** Probabilistic distance
**
Entropy
Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
* Consistency-based feature selection
* Correlation-based feature selection
Optimality criteria
The choice of optimality criteria is difficult as there are multiple objectives in a feature selection task. Many common criteria incorporate a measure of accuracy, penalised by the number of features selected. Examples include
Akaike information criterion
The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to e ...
(AIC) and
Mallows's ''Cp'', which have a penalty of 2 for each added feature. AIC is based on
information theory
Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of information. The field was originally established by the works of Harry Nyquist a ...
, and is effectively derived via the
maximum entropy principle
The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge about a system is the one with largest entropy, in the context of precisely stated prior data (such as a proposition ...
.
Other criteria are
Bayesian information criterion
In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on ...
(BIC), which uses a penalty of
for each added feature,
minimum description length
Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occam ...
(MDL) which asymptotically uses
,
Bonferroni
Carlo Emilio Bonferroni (28 January 1892 – 18 August 1960) was an Italian mathematician who worked on probability theory.
Biography
Bonferroni studied piano and conducting in Turin Conservatory and at University of Turin under Giuseppe Peano ...
/ RIC which use
, maximum dependency feature selection, and a variety of new criteria that are motivated by
false discovery rate
In statistics, the false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the FDR, which is the expe ...
(FDR), which use something close to
. A maximum
entropy rate
In the mathematical theory of probability, the entropy rate or source information rate of a stochastic process is, informally, the time density of the average information in a stochastic process. For stochastic processes with a countable index, the ...
criterion may also be used to select the most relevant subset of features.
Structure learning
Filter feature selection is a specific case of a more general paradigm called
structure learning. Feature selection finds the relevant feature set for a specific target variable whereas structure learning finds the relationships between all the variables, usually by expressing these relationships as a graph. The most common structure learning algorithms assume the data is generated by a
Bayesian Network
A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bay ...
, and so the structure is a
directed
Director may refer to:
Literature
* ''Director'' (magazine), a British magazine
* ''The Director'' (novel), a 1971 novel by Henry Denker
* ''The Director'' (play), a 2000 play by Nancy Hasty
Music
* Director (band), an Irish rock band
* ''Di ...
graphical model
A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a Graph (discrete mathematics), graph expresses the conditional dependence structure between random variables. They are ...
. The optimal solution to the filter feature selection problem is the
Markov blanket
In statistics and machine learning, when one wants to infer a random variable with a set of variables, usually a subset is enough, and other variables are useless. Such a subset that contains all the useful information is called a Markov blanket. ...
of the target node, and in a Bayesian Network, there is a unique Markov Blanket for each node.
Information Theory Based Feature Selection Mechanisms
There are different Feature Selection mechanisms around that utilize
mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
for scoring the different features. They usually use all the same algorithm:
# Calculate the
mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
as score for between all features (
) and the target class ()
# Select the feature with the largest score (e.g.
) and add it to the set of selected features ()
# Calculate the score which might be derived from the
mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
# Select the feature with the largest score and add it to the set of select features (e.g.
)
# Repeat 3. and 4. until a certain number of features is selected (e.g.
)
The simplest approach uses the
mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
as the "derived" score.
/ref>
However, there are different approaches, that try to reduce the redundancy between features.
Minimum-redundancy-maximum-relevance (mRMR) feature selection
Peng ''et al.'' proposed a feature selection method that can use either mutual information, correlation, or distance/similarity scores to select features. The aim is to penalise a feature's relevancy by its redundancy in the presence of the other selected features. The relevance of a feature set for the class is defined by the average value of all mutual information values between the individual feature and the class as follows:
:.
The redundancy of all features in the set is the average value of all mutual information values between the feature and the feature :
:
The mRMR criterion is a combination of two measures given above and is defined as follows:
:
Suppose that there are full-set features. Let be the set membership indicator function
In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , one has \mathbf_(x)=1 if x\i ...
for feature , so that indicates presence and indicates absence of the feature in the globally optimal feature set. Let and . The above may then be written as an optimization problem:
:
The mRMR algorithm is an approximation of the theoretically optimal maximum-dependency feature selection algorithm that maximizes the mutual information between the joint distribution of the selected features and the classification variable. As mRMR approximates the combinatorial estimation problem with a series of much smaller problems, each of which only involves two variables, it thus uses pairwise joint probabilities which are more robust. In certain situations the algorithm may underestimate the usefulness of features as it has no way to measure interactions between features which can increase relevancy. This can lead to poor performance when the features are individually useless, but are useful when combined (a pathological case is found when the class is a parity function In Boolean algebra, a parity function is a Boolean function whose value is one if and only if the input vector has an odd number of ones. The parity function of two inputs is also known as the XOR function.
The parity function is notable for its ...
of the features). Overall the algorithm is more efficient (in terms of the amount of data required) than the theoretically optimal max-dependency selection, yet produces a feature set with little pairwise redundancy.
mRMR is an instance of a large class of filter methods which trade off between relevancy and redundancy in different ways.[Nguyen, H., Franke, K., Petrovic, S. (2010). "Towards a Generic Feature-Selection Measure for Intrusion Detection", In Proc. International Conference on Pattern Recognition (ICPR), Istanbul, Turkey]
/ref>
Quadratic programming feature selection
mRMR is a typical example of an incremental greedy strategy for feature selection: once a feature has been selected, it cannot be deselected at a later stage. While mRMR could be optimized using floating search to reduce some features, it might also be reformulated as a global quadratic programming
Quadratic programming (QP) is the process of solving certain mathematical optimization problems involving quadratic functions. Specifically, one seeks to optimize (minimize or maximize) a multivariate quadratic function subject to linear constra ...
optimization problem as follows:
:
where is the vector of feature relevancy assuming there are features in total, is the matrix of feature pairwise redundancy, and represents relative feature weights. QPFS is solved via quadratic programming. It is recently shown that QFPS is biased towards features with smaller entropy, due to its placement of the feature self redundancy term on the diagonal of .
Conditional mutual information
Another score derived for the mutual information is based on the conditional relevancy:[Nguyen X. Vinh, Jeffrey Chan, Simone Romano and James Bailey, "Effective Global Approaches for Mutual Information based Feature Selection". Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'14), August 24–27, New York City, 2014.]
:
where and .
An advantage of is that it can be solved simply via finding the dominant eigenvector of , thus is very scalable. also handles second-order feature interaction.
Joint mutual information
In a study of different scores Brown et al. recommended the joint mutual information as a good score for feature selection. The score tries to find the feature, that adds the most new information to the already selected features, in order to avoid redundancy. The score is formulated as follows:
:
The score uses the conditional mutual information
In probability theory, particularly information theory, the conditional mutual information is, in its most basic form, the expected value of the mutual information of two random variables given the value of a third.
Definition
For random var ...
and the mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
to estimate the redundancy between the already selected features () and the feature under investigation ().
Hilbert-Schmidt Independence Criterion Lasso based feature selection
For high-dimensional and small sample data (e.g., dimensionality > and the number of samples < ), the Hilbert-Schmidt Independence Criterion Lasso (HSIC Lasso) is useful. HSIC Lasso optimization problem is given as
:
where is a kernel-based independence measure called the (empirical) Hilbert-Schmidt independence criterion (HSIC), denotes the trace
Trace may refer to:
Arts and entertainment Music
* Trace (Son Volt album), ''Trace'' (Son Volt album), 1995
* Trace (Died Pretty album), ''Trace'' (Died Pretty album), 1993
* Trace (band), a Dutch progressive rock band
* The Trace (album), ''The ...
, is the regularization parameter, and are input and output centered Gram matrices, and are Gram matrices, and are kernel functions, is the centering matrix, is the -dimensional identity matrix
In linear algebra, the identity matrix of size n is the n\times n square matrix with ones on the main diagonal and zeros elsewhere.
Terminology and notation
The identity matrix is often denoted by I_n, or simply by I if the size is immaterial o ...
(: the number of samples), is the -dimensional vector with all ones, and is the -norm. HSIC always takes a non-negative value, and is zero if and only if two random variables are statistically independent when a universal reproducing kernel such as the Gaussian kernel is used.
The HSIC Lasso can be written as
:
where is the Frobenius norm
In mathematics, a matrix norm is a vector norm in a vector space whose elements (vectors) are matrices (of given dimensions).
Preliminaries
Given a field K of either real or complex numbers, let K^ be the -vector space of matrices with m rows ...
. The optimization problem is a Lasso problem, and thus it can be efficiently solved with a state-of-the-art Lasso solver such as the dual augmented Lagrangian method Augmented Lagrangian methods are a certain class of algorithms for solving constrained optimization problems. They have similarities to penalty methods in that they replace a constrained optimization problem by a series of unconstrained problems a ...
.
Correlation feature selection
The correlation feature selection (CFS) measure evaluates subsets of features on the basis of the following hypothesis: "Good feature subsets contain features highly correlated with the classification, yet uncorrelated to each other". The following equation gives the merit of a feature subset ''S'' consisting of ''k'' features:
:
Here, is the average value of all feature-classification correlations, and is the average value of all feature-feature correlations. The CFS criterion is defined as follows:
:
The and variables are referred to as correlations, but are not necessarily Pearson's correlation coefficient
In statistics, the Pearson correlation coefficient (PCC, pronounced ) ― also known as Pearson's ''r'', the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ...
or Spearman's ρ. Hall's dissertation uses neither of these, but uses three different measures of relatedness, minimum description length
Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occam ...
(MDL), symmetrical uncertainty, and relief
Relief is a sculptural method in which the sculpted pieces are bonded to a solid background of the same material. The term ''relief'' is from the Latin verb ''relevo'', to raise. To create a sculpture in relief is to give the impression that the ...
.
Let ''xi'' be the set membership indicator function
In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , one has \mathbf_(x)=1 if x\i ...
for feature ''fi''; then the above can be rewritten as an optimization problem:
:
The combinatorial problems above are, in fact, mixed 0–1 linear programming
Linear programming (LP), also called linear optimization, is a method to achieve the best outcome (such as maximum profit or lowest cost) in a mathematical model whose requirements are represented by linear function#As a polynomial function, li ...
problems that can be solved by using branch-and-bound algorithms.
Regularized trees
The features from a decision tree
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...
or a tree ensemble
Ensemble may refer to:
Art
* Architectural ensemble
* ''Ensemble'' (album), Kendji Girac 2015 album
* Ensemble (band), a project of Olivier Alary
* Ensemble cast (drama, comedy)
* Ensemble (musical theatre), also known as the chorus
* ''En ...
are shown to be redundant. A recent method called regularized tree[H. Deng, G. Runger,]
Feature Selection via Regularized Trees
, Proceedings of the 2012 International Joint Conference on Neural Networks (IJCNN), IEEE, 2012 can be used for feature subset selection. Regularized trees penalize using a variable similar to the variables selected at previous tree nodes for splitting the current node. Regularized trees only need build one tree model (or one tree ensemble model) and thus are computationally efficient.
Regularized trees naturally handle numerical and categorical features, interactions and nonlinearities. They are invariant to attribute scales (units) and insensitive to outlier
In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
s, and thus, require little data preprocessing Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance, and is an important step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to ...
such as normalization
Normalization or normalisation refers to a process that makes something more normal or regular. Most commonly it refers to:
* Normalization (sociology) or social normalization, the process through which ideas and behaviors that may fall outside of ...
. Regularized random forest (RRF)[RRF: Regularized Random Forest](_blank)
R package on CRAN is one type of regularized trees. The guided RRF is an enhanced RRF which is guided by the importance scores from an ordinary random forest.
Overview on metaheuristics methods
A metaheuristic
In computer science and mathematical optimization, a metaheuristic is a higher-level procedure or heuristic designed to find, generate, or select a heuristic (partial search algorithm) that may provide a sufficiently good solution to an optimizati ...
is a general description of an algorithm dedicated to solve difficult (typically NP-hard
In computational complexity theory, NP-hardness ( non-deterministic polynomial-time hardness) is the defining property of a class of problems that are informally "at least as hard as the hardest problems in NP". A simple example of an NP-hard pr ...
problem) optimization problems for which there is no classical solving methods. Generally, a metaheuristic is a stochastic algorithm tending to reach a global optimum. There are many metaheuristics, from a simple local search to a complex global search algorithm.
Main principles
The feature selection methods are typically presented in three classes based on how they combine the selection algorithm and the model building.
Filter method
Filter type methods select variables regardless of the model. They are based only on general features like the correlation with the variable to predict. Filter methods suppress the least interesting variables. The other variables will be part of a classification or a regression model used to classify or to predict data. These methods are particularly effective in computation time and robust to overfitting.
Filter methods tend to select redundant variables when they do not consider the relationships between variables. However, more elaborate features try to minimize this problem by removing variables highly correlated to each other, such as the Fast Correlation Based Filter (FCBF) algorithm.
Wrapper method
Wrapper methods evaluate subsets of variables which allows, unlike filter approaches, to detect the possible interactions amongst variables.[T. M. Phuong, Z. Lin et R. B. Altman]
Choosing SNPs using feature selection.
Proceedings / IEEE Computational Systems Bioinformatics Conference, CSB. IEEE Computational Systems Bioinformatics Conference, pages 301-309, 2005. . The two main disadvantages of these methods are:
* The increasing overfitting risk when the number of observations is insufficient.
* The significant computation time when the number of variables is large.
Embedded method
Embedded methods have been recently proposed that try to combine the advantages of both previous methods. A learning algorithm takes advantage of its own variable selection process and performs feature selection and classification simultaneously, such as the FRMT algorithm.
Application of feature selection metaheuristics
This is a survey of the application of feature selection metaheuristics lately used in the literature. This survey was realized by J. Hammon in her 2013 thesis.
Feature selection embedded in learning algorithms
Some learning algorithms perform feature selection as part of their overall operation. These include:
* -regularization techniques, such as sparse regression, LASSO, and -SVM
* Regularized trees, e.g. regularized random forest implemented in the RRF package
* Decision tree
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...
* Memetic algorithm
A memetic algorithm (MA) in computer science and operations research, is an extension of the traditional genetic algorithm. It may provide a sufficiently good solution to an optimization problem. It uses a local search technique to reduce the like ...
* Random multinomial logit
Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of th ...
(RMNL)
* Auto-encoding networks with a bottleneck-layer
* Submodular
In mathematics, a submodular set function (also known as a submodular function) is a set function whose value, informally, has the property that the difference in the incremental value of the function that a single element makes when added to an ...
feature selection
* Local learning based feature selection. Compared with traditional methods, it does not involve any heuristic search, can easily handle multi-class problems, and works for both linear and nonlinear problems. It is also supported by a strong theoretical foundation. Numeric experiments showed that the method can achieve a close-to-optimal solution even when data contains >1M irrelevant features.
* Recommender system based on feature selection.[D.H. Wang, Y.C. Liang, D.Xu, X.Y. Feng, R.C. Guan(2018),]
A content-based recommender system for computer science publications
, ''Knowledge-Based Systems
A knowledge-based system (KBS) is a computer program that reasons and uses a knowledge base to solve complex problems. The term is broad and refers to many different kinds of systems. The one common theme that unites all knowledge based systems is ...
'', 157: 1-9 The feature selection methods are introduced into recommender system research.
See also
* Cluster analysis
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
* Data mining
* Dimensionality reduction
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
* Feature extraction
In machine learning, pattern recognition, and image processing, feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning a ...
* Hyperparameter optimization In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the va ...
* Model selection
Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the ...
* Relief (feature selection)
Relief is an algorithm developed by Kira and Rendell in 1992 that takes a filter-method approach to feature selection that is notably sensitive to feature interactions. It was originally designed for application to binary classification problems w ...
References
Further reading
*
*
*
* {{cite journal , first1=Huan , last1=Liu , first2=Lei , last2=Yu , title=Toward Integrating Feature Selection Algorithms for Classification and Clustering , journal=IEEE Transactions on Knowledge and Data Engineering , volume=17 , issue=4 , year=2005 , pages=491–502 , doi=10.1109/TKDE.2005.66 , s2cid=1607600
External links
Feature Selection Package, Arizona State University (Matlab Code)
NIPS challenge 2003
(see also NIPS)
Naive Bayes implementation with feature selection in Visual Basic
(includes executable and source code)
* ttp://mloss.org/software/view/386/ FEAST(Open source Feature Selection algorithms in C and MATLAB)
Model selection
Dimension reduction