In machine learning, feature selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons: * simplification of models to make them easier to interpret, * shorter training times, * to avoid the

curse of dimensionality The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. T ...

, * improve the compatibility of the data with a certain learning model class, * to encode inherent

symmetries Symmetry () in everyday life refers to a sense of harmonious and beautiful proportion and balance. In mathematics, the term has a more precise definition and is usually used to refer to an object that is invariant under some transformations ...

present in the input space. The central premise when using feature selection is that data sometimes contains features that are ''redundant'' or ''irrelevant'', and can thus be removed without incurring much loss of information. Redundancy and irrelevance are two distinct notions, since one relevant feature may be redundant in the presence of another relevant feature with which it is strongly correlated.

Feature extraction Feature may refer to: Computing * Feature recognition, could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (machine learning), in statistics: individual measurable properties of the phenome ...

creates new features from functions of the original features, whereas feature selection finds a subset of the features. Feature selection techniques are often used in domains where there are many features and comparatively few samples (data points).

Introduction

A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. The simplest algorithm is to test each possible subset of features finding the one which minimizes the error rate. This is an exhaustive search of the space, and is computationally intractable for all but the smallest of feature sets. The choice of evaluation metric heavily influences the algorithm, and it is these evaluation metrics which distinguish between the three main categories of feature selection algorithms: wrappers, filters and embedded methods. * Wrapper methods use a predictive model to score feature subsets. Each new subset is used to train a model, which is tested on a hold-out set. Counting the number of mistakes made on that hold-out set (the error rate of the model) gives the score for that subset. As wrapper methods train a new model for each subset, they are very computationally intensive, but usually provide the best performing feature set for that particular type of model or typical problem. * Filter methods use a proxy measure instead of the error rate to score a feature subset. This measure is chosen to be fast to compute, while still capturing the usefulness of the feature set. Common measures include the

mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual Statistical dependence, dependence between the two variables. More specifically, it quantifies the "Information conten ...

, the

pointwise mutual information In statistics, probability theory and information theory, pointwise mutual information (PMI), or point mutual information, is a measure of Association (statistics), association. It compares the probability of two events occurring together to what t ...

Pearson product-moment correlation coefficient In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviation ...

, Relief-based algorithms, and inter/intra class distance or the scores of significance tests for each class/feature combinations. Filters are usually less computationally intensive than wrappers, but they produce a feature set which is not tuned to a specific type of predictive model. This lack of tuning means a feature set from a filter is more general than the set from a wrapper, usually giving lower prediction performance than a wrapper. However the feature set doesn't contain the assumptions of a prediction model, and so is more useful for exposing the relationships between the features. Many filters provide a feature ranking rather than an explicit best feature subset, and the cut off point in the ranking is chosen via cross-validation. Filter methods have also been used as a preprocessing step for wrapper methods, allowing a wrapper to be used on larger problems. One other popular approach is the Recursive Feature Elimination algorithm, commonly used with

Support Vector Machines In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...

to repeatedly construct a model and remove features with low weights. * Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process. The exemplar of this approach is the

LASSO A lasso or lazo ( or ), also called reata or la reata in Mexico, and in the United States riata or lariat (from Mexican Spanish lasso for roping cattle), is a loop of rope designed as a restraint to be thrown around a target and tightened when ...

method for constructing a linear model, which penalizes the regression coefficients with an L1 penalty, shrinking many of them to zero. Any features which have non-zero regression coefficients are 'selected' by the LASSO algorithm. Improvements to the LASSO include Bolasso which bootstraps samples; Elastic net regularization, which combines the L1 penalty of LASSO with the L2 penalty of ridge regression; and FeaLect which scores all the features based on combinatorial analysis of regression coefficients. AEFS further extends LASSO to nonlinear scenario with autoencoders. These approaches tend to be between filters and wrappers in terms of computational complexity. In traditional regression analysis, the most popular form of feature selection is

stepwise regression In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of ...

, which is a wrapper technique. It is a

greedy algorithm A greedy algorithm is any algorithm that follows the problem-solving heuristic of making the locally optimal choice at each stage. In many problems, a greedy strategy does not produce an optimal solution, but a greedy heuristic can yield locally ...

that adds the best feature (or deletes the worst feature) at each round. The main control issue is deciding when to stop the algorithm. In machine learning, this is typically done by cross-validation. In statistics, some criteria are optimized. This leads to the inherent problem of nesting. More robust methods have been explored, such as

branch and bound Branch and bound (BB, B&B, or BnB) is a method for solving optimization problems by breaking them down into smaller sub-problems and using a bounding function to eliminate sub-problems that cannot contain the optimal solution. It is an algorithm ...

and piecewise linear network.

Subset selection

Subset selection evaluates a subset of features as a group for suitability. Subset selection algorithms can be broken up into wrappers, filters, and embedded methods. Wrappers use a

search algorithm In computer science, a search algorithm is an algorithm designed to solve a search problem. Search algorithms work to retrieve information stored within particular data structure, or calculated in the Feasible region, search space of a problem do ...

to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model. Filters are similar to wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Embedded techniques are embedded in, and specific to, a model. Many popular search approaches use greedy

hill climbing numerical analysis, hill climbing is a mathematical optimization technique which belongs to the family of local search. It is an iterative algorithm that starts with an arbitrary solution to a problem, then attempts to find a better soluti ...

, which iteratively evaluates a candidate subset of features, then modifies the subset and evaluates if the new subset is an improvement over the old. Evaluation of the subsets requires a scoring

metric Metric or metrical may refer to: Measuring * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics ...

that grades a subset of features. Exhaustive search is generally impractical, so at some implementor (or operator) defined stopping point, the subset of features with the highest score discovered up to that point is selected as the satisfactory feature subset. The stopping criterion varies by algorithm; possible criteria include: a subset score exceeds a threshold, a program's maximum allowed run time has been surpassed, etc. Alternative search-based techniques are based on

targeted projection pursuit Targeted projection pursuit is a type of statistical technique used for exploratory data analysis, information visualization, and feature selection. It allows the user to interactively explore very complex data (typically having tens to hundreds o ...

which finds low-dimensional projections of the data that score highly: the features that have the largest projections in the lower-dimensional space are then selected. Search approaches include: * Exhaustive * Best first *

Simulated annealing Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space for an optimization problem. ...

Genetic algorithm In computer science and operations research, a genetic algorithm (GA) is a metaheuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms (EA). Genetic algorithms are commonly used to g ...

* Greedy forward selection * Greedy backward elimination *

Particle swarm optimization In computational science, particle swarm optimization (PSO) is a computational method that Mathematical optimization, optimizes a problem by iterative method, iteratively trying to improve a candidate solution with regard to a given measure of qu ...

Targeted projection pursuit Targeted projection pursuit is a type of statistical technique used for exploratory data analysis, information visualization, and feature selection. It allows the user to interactively explore very complex data (typically having tens to hundreds o ...

* Scatter search *

Variable neighborhood search Variable neighborhood search (VNS), proposed by Mladenović & Hansen in 1997, is a metaheuristic method for solving a set of combinatorial optimization and global optimization problems. It explores distant neighborhoods of the current incumbent so ...

Two popular filter metrics for classification problems are

correlation In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...

and

, although neither are true

metrics Metric or metrical may refer to: Measuring * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics ...

or 'distance measures' in the mathematical sense, since they fail to obey the

triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of Degeneracy (mathematics)#T ...

and thus do not compute any actual 'distance' – they should rather be regarded as 'scores'. These scores are computed between a candidate feature (or set of features) and the desired output category. There are, however, true metrics that are a simple function of the mutual information; see here. Other available filter metrics include: * Class separability ** Error probability ** Inter-class distance ** Probabilistic distance **

Entropy Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...

* Consistency-based feature selection * Correlation-based feature selection

Optimality criteria

The choice of optimality criteria is difficult as there are multiple objectives in a feature selection task. Many common criteria incorporate a measure of accuracy, penalised by the number of features selected. Examples include Akaike information criterion (AIC) and Mallows's ''C_p'', which have a penalty of 2 for each added feature. AIC is based on

information theory Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...

, and is effectively derived via the maximum entropy principle. Other criteria are

Bayesian information criterion In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on ...

(BIC), which uses a penalty of

\sqrt

for each added feature, minimum description length (MDL) which asymptotically uses

\sqrt

, Bonferroni / RIC which use

\sqrt

, maximum dependency feature selection, and a variety of new criteria that are motivated by

false discovery rate In statistics, the false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the FDR, which is the exp ...

(FDR), which use something close to

\sqrt

. A maximum entropy rate criterion may also be used to select the most relevant subset of features.

Structure learning

Filter feature selection is a specific case of a more general paradigm called structure learning. Feature selection finds the relevant feature set for a specific target variable whereas structure learning finds the relationships between all the variables, usually by expressing these relationships as a graph. The most common structure learning algorithms assume the data is generated by a

Bayesian Network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Whi ...

, and so the structure is a

directed Direct may refer to: Mathematics * Directed set, in order theory * Direct limit of (pre), sheaves * Direct sum of modules, a construction in abstract algebra which combines several vector spaces Computing * Direct access (disambiguation), a ...

graphical model. The optimal solution to the filter feature selection problem is the

Markov blanket In statistics and machine learning, a Markov blanket of a random variable is a minimal set of variables that renders the variable conditionally independent of all other variables in the system. This concept is central in probabilistic graphical ...

of the target node, and in a Bayesian Network, there is a unique Markov Blanket for each node.

Information Theory Based Feature Selection Mechanisms

There are different Feature Selection mechanisms around that utilize

for scoring the different features. They usually use all the same algorithm: # Calculate the

as score for between all features (

f_ \in F

) and the target class () # Select the feature with the largest score (e.g.

\underset\operatorname(I(f_,c))

) and add it to the set of selected features () # Calculate the score which might be derived from the

# Select the feature with the largest score and add it to the set of select features (e.g.

\underset\operatorname(I_(f_,c))

) # Repeat 3. and 4. until a certain number of features is selected (e.g.

, S, =l

) The simplest approach uses the

as the "derived" score.
/ref> However, there are different approaches, that try to reduce the redundancy between features.

Minimum-redundancy-maximum-relevance (mRMR) feature selection

Peng ''et al.'' proposed a feature selection method that can use either mutual information, correlation, or distance/similarity scores to select features. The aim is to penalise a feature's relevancy by its redundancy in the presence of the other selected features. The relevance of a feature set for the class is defined by the average value of all mutual information values between the individual feature and the class as follows: :

D(S,c) = \frac\sum_I(f_;c)

. The redundancy of all features in the set is the average value of all mutual information values between the feature and the feature : :

R(S) = \frac\sum_I(f_;f_)

The mRMR criterion is a combination of two measures given above and is defined as follows: :

\mathrm= \max_
\left frac\sum_I(f_;c) - 
\frac\sum_I(f_;f_)\right

Suppose that there are full-set features. Let be the set membership

indicator function In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , then the indicator functio ...

for feature , so that indicates presence and indicates absence of the feature in the globally optimal feature set. Let

c_i=I(f_i;c)

and

a_=I(f_i;f_j)

. The above may then be written as an optimization problem: :

\mathrm= \max_ 
\left frac -
\frac
\right

The mRMR algorithm is an approximation of the theoretically optimal maximum-dependency feature selection algorithm that maximizes the mutual information between the joint distribution of the selected features and the classification variable. As mRMR approximates the combinatorial estimation problem with a series of much smaller problems, each of which only involves two variables, it thus uses pairwise joint probabilities which are more robust. In certain situations the algorithm may underestimate the usefulness of features as it has no way to measure interactions between features which can increase relevancy. This can lead to poor performance when the features are individually useless, but are useful when combined (a pathological case is found when the class is a parity function of the features). Overall the algorithm is more efficient (in terms of the amount of data required) than the theoretically optimal max-dependency selection, yet produces a feature set with little pairwise redundancy. mRMR is an instance of a large class of filter methods which trade off between relevancy and redundancy in different ways.Nguyen, H., Franke, K., Petrovic, S. (2010). "Towards a Generic Feature-Selection Measure for Intrusion Detection", In Proc. International Conference on Pattern Recognition (ICPR), Istanbul, Turkey

/ref>

Quadratic programming feature selection

mRMR is a typical example of an incremental greedy strategy for feature selection: once a feature has been selected, it cannot be deselected at a later stage. While mRMR could be optimized using floating search to reduce some features, it might also be reformulated as a global

quadratic programming Quadratic programming (QP) is the process of solving certain mathematical optimization problems involving quadratic functions. Specifically, one seeks to optimize (minimize or maximize) a multivariate quadratic function subject to linear constr ...

optimization problem as follows: :

\mathrm: \min_\mathbf  \left\ \quad \mbox \ \sum_^n x_i=1, x_i\geq 0

where

F_= (f_1;c),\ldots, I(f_n;c) T

is the vector of feature relevancy assuming there are features in total,

H_= (f_i;f_j)

is the matrix of feature pairwise redundancy, and

\mathbf_

represents relative feature weights. QPFS is solved via quadratic programming. It is recently shown that QFPS is biased towards features with smaller entropy, due to its placement of the feature self redundancy term

I(f_i;f_i)

on the diagonal of .

Conditional mutual information

Another score derived for the mutual information is based on the conditional relevancy:Nguyen X. Vinh, Jeffrey Chan, Simone Romano and James Bailey, "Effective Global Approaches for Mutual Information based Feature Selection". Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'14), August 24–27, New York City, 2014.

:

\mathrm: \max_ \left\ \quad \mbox\ \, \mathbf\, =1, x_i\geq 0

where

Q_=I(f_i;c)

and

Q_=(I(f_i;c, f_j)+I(f_j;c, f_i))/2, i\ne j

. An advantage of is that it can be solved simply via finding the dominant eigenvector of , thus is very scalable. also handles second-order feature interaction.

Joint mutual information

In a study of different scores Brown et al. recommended the joint mutual information as a good score for feature selection. The score tries to find the feature, that adds the most new information to the already selected features, in order to avoid redundancy. The score is formulated as follows: :

\end

The score uses the conditional mutual information and the

to estimate the redundancy between the already selected features (

f_j \in S

) and the feature under investigation (

f_i

Hilbert-Schmidt Independence Criterion Lasso based feature selection

For high-dimensional and small sample data (e.g., dimensionality > and the number of samples < ), the Hilbert-Schmidt Independence Criterion Lasso (HSIC Lasso) is useful. HSIC Lasso optimization problem is given as :

\mathrm: \min_ \frac\sum_^n x_k x_l (f_k,f_l) - \sum_^n x_k (f_k,c) +  \lambda \, \mathbf\, _1, \quad \mbox \ x_1,\ldots, x_n \geq 0,

where

(f_k,c) =\mbox(\bar^  \bar)

is a kernel-based independence measure called the (empirical) Hilbert-Schmidt independence criterion (HSIC),

\mbox(\cdot)

denotes the trace,

\lambda

is the regularization parameter,

\bar^ = \mathbf \mathbf^ \mathbf

and

\bar = \mathbf \mathbf \mathbf

are input and output centered Gram matrices,

K^_ = K(u_,u_)

and

L_ = L(c_i,c_j)

are Gram matrices,

K(u,u')

and

L(c,c')

are kernel functions,

\mathbf = \mathbf_m - \frac\mathbf_m \mathbf_m^T

is the centering matrix,

\mathbf_m

is the -dimensional

identity matrix In linear algebra, the identity matrix of size n is the n\times n square matrix with ones on the main diagonal and zeros elsewhere. It has unique properties, for example when the identity matrix represents a geometric transformation, the obje ...

(: the number of samples),

\mathbf_m

is the -dimensional vector with all ones, and

\, \cdot\, _

is the

\ell_1

-norm. HSIC always takes a non-negative value, and is zero if and only if two random variables are statistically independent when a universal reproducing kernel such as the Gaussian kernel is used. The HSIC Lasso can be written as :

\mathrm: \min_ \frac\left\, \bar - \sum_^ x_k \bar^ \right\, ^2_  +  \lambda \, \mathbf\, _1, \quad \mbox \ x_1,\ldots,x_n \geq 0,

where

\, \cdot\, _

is the

Frobenius norm In the field of mathematics, norms are defined for elements within a vector space. Specifically, when the vector space comprises matrices, such norms are referred to as matrix norms. Matrix norms differ from vector norms in that they must also ...

. The optimization problem is a Lasso problem, and thus it can be efficiently solved with a state-of-the-art Lasso solver such as the dual augmented Lagrangian method.

Correlation feature selection

The correlation feature selection (CFS) measure evaluates subsets of features on the basis of the following hypothesis: "Good feature subsets contain features highly correlated with the classification, yet uncorrelated to each other". The following equation gives the merit of a feature subset ''S'' consisting of ''k'' features: :

\mathrm_ = \frac.

Here,

\overline

is the average value of all feature-classification correlations, and

\overline

is the average value of all feature-feature correlations. The CFS criterion is defined as follows: :

\mathrm = \max_
\left frac
\right

The

r_

and

r_

variables are referred to as correlations, but are not necessarily Pearson's correlation coefficient or Spearman's ρ. Hall's dissertation uses neither of these, but uses three different measures of relatedness, minimum description length (MDL), symmetrical uncertainty, and

relief Relief is a sculpture, sculptural method in which the sculpted pieces remain attached to a solid background of the same material. The term ''wikt:relief, relief'' is from the Latin verb , to raise (). To create a sculpture in relief is to give ...

. Let ''x_i'' be the set membership

for feature ''f_i''; then the above can be rewritten as an optimization problem: :

\mathrm = \max_ 
\left frac
\right

The combinatorial problems above are, in fact, mixed 0–1

linear programming Linear programming (LP), also called linear optimization, is a method to achieve the best outcome (such as maximum profit or lowest cost) in a mathematical model whose requirements and objective are represented by linear function#As a polynomia ...

problems that can be solved by using branch-and-bound algorithms.

Regularized trees

The features from a

decision tree A decision tree is a decision support system, decision support recursive partitioning structure that uses a Tree (graph theory), tree-like Causal model, model of decisions and their possible consequences, including probability, chance event ou ...

or a tree ensemble are shown to be redundant. A recent method called regularized treeH. Deng, G. Runger,
Feature Selection via Regularized Trees
, Proceedings of the 2012 International Joint Conference on Neural Networks (IJCNN), IEEE, 2012 can be used for feature subset selection. Regularized trees penalize using a variable similar to the variables selected at previous tree nodes for splitting the current node. Regularized trees only need build one tree model (or one tree ensemble model) and thus are computationally efficient. Regularized trees naturally handle numerical and categorical features, interactions and nonlinearities. They are invariant to attribute scales (units) and insensitive to

outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...

s, and thus, require little data preprocessing such as

normalization Normalization or normalisation refers to a process that makes something more normal or regular. Science * Normalization process theory, a sociological theory of the implementation of new technologies or innovations * Normalization model, used in ...

. Regularized random forest (RRF)RRF: Regularized Random Forest
R package on CRAN is one type of regularized trees. The guided RRF is an enhanced RRF which is guided by the importance scores from an ordinary random forest.

Overview on metaheuristics methods

metaheuristic In computer science and mathematical optimization, a metaheuristic is a higher-level procedure or heuristic designed to find, generate, tune, or select a heuristic (partial search algorithm) that may provide a sufficiently good solution to an op ...

is a general description of an algorithm dedicated to solve difficult (typically

NP-hard In computational complexity theory, a computational problem ''H'' is called NP-hard if, for every problem ''L'' which can be solved in non-deterministic polynomial-time, there is a polynomial-time reduction from ''L'' to ''H''. That is, assumi ...

problem) optimization problems for which there is no classical solving methods. Generally, a metaheuristic is a stochastic algorithm tending to reach a global optimum. There are many metaheuristics, from a simple local search to a complex global search algorithm.

Main principles

The feature selection methods are typically presented in three classes based on how they combine the selection algorithm and the model building.

Filter method

Filter type methods select variables regardless of the model. They are based only on general features like the correlation with the variable to predict. Filter methods suppress the least interesting variables. The other variables will be part of a classification or a regression model used to classify or to predict data. These methods are particularly effective in computation time and robust to overfitting. Filter methods tend to select redundant variables when they do not consider the relationships between variables. However, more elaborate features try to minimize this problem by removing variables highly correlated to each other, such as the Fast Correlation Based Filter (FCBF) algorithm.

Wrapper method

Wrapper methods evaluate subsets of variables which allows, unlike filter approaches, to detect the possible interactions amongst variables.T. M. Phuong, Z. Lin et R. B. Altman
Choosing SNPs using feature selection.
Proceedings / IEEE Computational Systems Bioinformatics Conference, CSB. IEEE Computational Systems Bioinformatics Conference, pages 301-309, 2005. . The two main disadvantages of these methods are: * The increasing overfitting risk when the number of observations is insufficient. * The significant computation time when the number of variables is large.

Embedded method

Embedded methods have been recently proposed that try to combine the advantages of both previous methods. A learning algorithm takes advantage of its own variable selection process and performs feature selection and classification simultaneously, such as the FRMT algorithm.

Application of feature selection metaheuristics

This is a survey of the application of feature selection metaheuristics lately used in the literature. This survey was realized by J. Hammon in her 2013 thesis.

Feature selection embedded in learning algorithms

Some learning algorithms perform feature selection as part of their overall operation. These include: * -regularization techniques, such as sparse regression, LASSO, and -SVM * Regularized trees, e.g. regularized random forest implemented in the RRF package *

Decision tree A decision tree is a decision support system, decision support recursive partitioning structure that uses a Tree (graph theory), tree-like Causal model, model of decisions and their possible consequences, including probability, chance event ou ...

* Memetic algorithm * Random multinomial logit (RMNL) * Auto-encoding networks with a bottleneck-layer *

Submodular In mathematics, a submodular set function (also known as a submodular function) is a set function that, informally, describes the relationship between a set of inputs and an output, where adding more of one input has a decreasing additional benefi ...

feature selection * Local learning based feature selection. Compared with traditional methods, it does not involve any heuristic search, can easily handle multi-class problems, and works for both linear and nonlinear problems. It is also supported by a strong theoretical foundation. Numeric experiments showed that the method can achieve a close-to-optimal solution even when data contains >1M irrelevant features. * Recommender system based on feature selection.D.H. Wang, Y.C. Liang, D.Xu, X.Y. Feng, R.C. Guan(2018),
A content-based recommender system for computer science publications
, ''

Knowledge-Based Systems A knowledge-based system (KBS) is a computer program that reasons and uses a knowledge base to solve complex problems. Knowledge-based systems were the focus of early artificial intelligence researchers in the 1980s. The term can refer to a b ...

'', 157: 1-9 The feature selection methods are introduced into recommender system research.

References

External links

Feature Selection Package, Arizona State University (Matlab Code)

NIPS challenge 2003
(see also NIPS)
Naive Bayes implementation with feature selection in Visual Basic
{{Webarchive, url=https://web.archive.org/web/20090214023159/http://paul.luminos.nl/documents/show_document.php?d=198 , date=2009-02-14 (includes executable and source code)

* ttp://mloss.org/software/view/386/ FEAST(Open source Feature Selection algorithms in C and MATLAB) Model selection Dimension reduction

Introduction

Subset selection

Optimality criteria

Structure learning

Information Theory Based Feature Selection Mechanisms

Minimum-redundancy-maximum-relevance (mRMR) feature selection

Quadratic programming feature selection

Conditional mutual information

Joint mutual information

Hilbert-Schmidt Independence Criterion Lasso based feature selection

Correlation feature selection

Regularized trees

Overview on metaheuristics methods

Main principles

Filter method

Wrapper method

Embedded method

Application of feature selection metaheuristics

Feature selection embedded in learning algorithms

See also

References

Further reading

External links