Feature engineering is a preprocessing step in
supervised machine learning
In machine learning, supervised learning (SL) is a paradigm where a model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often human-made labels. ...
and
statistical modeling
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form ...
which transforms raw data into a more effective set of inputs. Each input comprises several attributes, known as features. By providing models with relevant information, feature engineering significantly enhances their predictive accuracy and decision-making capability.
Beyond machine learning, the principles of feature engineering are applied in various scientific fields, including physics. For example, physicists construct
dimensionless numbers
Dimensionless quantities, or quantities of dimension one, are quantities implicitly defined in a manner that prevents their aggregation into units of measurement. ISBN 978-92-822-2272-0. Typically expressed as ratios that align with another sy ...
such as the
Reynolds number
In fluid dynamics, the Reynolds number () is a dimensionless quantity that helps predict fluid flow patterns in different situations by measuring the ratio between Inertia, inertial and viscous forces. At low Reynolds numbers, flows tend to ...
in
fluid dynamics
In physics, physical chemistry and engineering, fluid dynamics is a subdiscipline of fluid mechanics that describes the flow of fluids – liquids and gases. It has several subdisciplines, including (the study of air and other gases in motion ...
, the
Nusselt number
In thermal fluid dynamics, the Nusselt number (, after Wilhelm Nusselt) is the ratio of total heat transfer to conductive heat transfer at a boundary in a fluid. Total heat transfer combines conduction and convection. Convection includes both ...
in
heat transfer
Heat transfer is a discipline of thermal engineering that concerns the generation, use, conversion, and exchange of thermal energy (heat) between physical systems. Heat transfer is classified into various mechanisms, such as thermal conduction, ...
, and the
Archimedes number
In viscous fluid dynamics, the Archimedes number (Ar), is a dimensionless number used to determine the motion of fluids due to density differences, named after the ancient Greek scientist and mathematician Archimedes.
It is the ratio of gravita ...
in
sedimentation
Sedimentation is the deposition of sediments. It takes place when particles in suspension settle out of the fluid in which they are entrained and come to rest against a barrier. This is due to their motion through the fluid in response to th ...
. They also develop first approximations of solutions, such as analytical solutions for the
strength of materials
Strength may refer to:
Personal trait
*Physical strength, as in people or animals
*Character strengths like those listed in the Values in Action Inventory
*The exercise of willpower
Physics
* Mechanical strength, the ability to withstand ...
in mechanics.
Clustering
One of the applications of feature engineering has been clustering of feature-objects or sample-objects in a dataset. Especially, feature engineering based on
matrix decomposition
In the mathematical discipline of linear algebra, a matrix decomposition or matrix factorization is a factorization of a matrix into a product of matrices. There are many different matrix decompositions; each finds use among a particular class of ...
has been extensively used for data clustering under non-negativity constraints on the feature coefficients. These include ''Non-Negative Matrix Factorization'' (NMF), ''Non-Negative Matrix-Tri Factorization'' (NMTF), ''Non-Negative Tensor Decomposition/Factorization'' (NTF/NTD), etc. The non-negativity constraints on coefficients of the feature vectors mined by the above-stated algorithms yields a part-based representation, and different factor matrices exhibit natural clustering properties. Several extensions of the above-stated feature engineering methods have been reported in literature, including ''orthogonality-constrained factorization'' for hard clustering, and ''manifold learning'' to overcome inherent issues with these algorithms.
Other classes of feature engineering algorithms include leveraging a common hidden structure across multiple inter-related datasets to obtain a consensus (common) clustering scheme. An example is ''Multi-view Classification based on Consensus Matrix Decomposition'' (MCMD),
which mines a common clustering scheme across multiple datasets. MCMD is designed to output two types of class labels (scale-variant and scale-invariant clustering), and:
* is
computationally robust to missing information,
* can obtain shape- and scale-based outliers,
* and can handle high-dimensional data effectively.
Coupled matrix and tensor decompositions are popular in multi-view feature engineering.
Predictive modelling
Feature engineering in
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
and
statistical modeling
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form ...
involves selecting, creating, transforming, and extracting data features. Key components include feature creation from existing data, transforming and imputing missing or invalid features, reducing data dimensionality through methods like
Principal Components Analysis
Principal component analysis (PCA) is a Linear map, linear dimensionality reduction technique with applications in exploratory data analysis, visualization and Data Preprocessing, data preprocessing.
The data is linear map, linearly transformed ...
(PCA),
Independent Component Analysis
In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate statistics, multivariate signal into additive subcomponents. This is done by assuming that at most one subcomponent is Gaussian and ...
(ICA), and
Linear Discriminant Analysis
Linear discriminant analysis (LDA), normal discriminant analysis (NDA), canonical variates analysis (CVA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to fi ...
(LDA), and selecting the most relevant features for model training based on importance scores and
correlation matrices.
Features vary in significance. Even relatively insignificant features may contribute to a model.
Feature selection
In machine learning, feature selection is the process of selecting a subset of relevant Feature (machine learning), features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons:
* sim ...
can reduce the number of features to prevent a model from becoming too specific to the training data set (overfitting).
Feature explosion occurs when the number of identified features is too large for effective model estimation or optimization. Common causes include:
* Feature templates - implementing feature templates instead of coding new features
* Feature combinations - combinations that cannot be represented by a linear system
Feature explosion can be limited via techniques such as:
regularization
Regularization may refer to:
* Regularization (linguistics)
* Regularization (mathematics)
* Regularization (physics)
* Regularization (solid modeling)
* Regularization Law, an Israeli law intended to retroactively legalize settlements
See also ...
,
kernel method
In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of patt ...
s, and
feature selection
In machine learning, feature selection is the process of selecting a subset of relevant Feature (machine learning), features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons:
* sim ...
.
Automation
Automation of feature engineering is a research topic that dates back to the 1990s.
Machine learning software that incorporates
automated feature engineering has been commercially available since 2016. Related academic literature can be roughly separated into two types:
* Multi-relational decision tree learning (MRDTL) uses a supervised algorithm that is similar to a
decision tree
A decision tree is a decision support system, decision support recursive partitioning structure that uses a Tree (graph theory), tree-like Causal model, model of decisions and their possible consequences, including probability, chance event ou ...
.
* Deep Feature Synthesis uses simpler methods.
Multi-relational decision tree learning (MRDTL)
Multi-relational Decision Tree Learning (MRDTL) extends traditional decision tree methods to
relational databases
A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970.
A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured form ...
, handling complex data relationships across tables. It innovatively uses selection graphs as
decision nodes, refined systematically until a specific termination criterion is reached.
Most MRDTL studies base implementations on relational databases, which results in many redundant operations. These redundancies can be reduced by using techniques such as tuple id propagation.
Open-source implementations
There are a number of open-source libraries and tools that automate feature engineering on relational data and time series:
* featuretools is a
Python
Python may refer to:
Snakes
* Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia
** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia
* Python (mythology), a mythical serpent
Computing
* Python (prog ...
library for transforming time series and relational data into feature matrices for machine learning.
* MCMD: An open-source feature engineering algorithm for joint clustering of multiple datasets .
* OneBM or One-Button Machine combines feature transformations and feature selection on relational data with feature selection techniques.
* getML community is an open source tool for automated feature engineering on time series and relational data.
It is implemented in
C/
C++ with a Python interface.
It has been shown to be at least 60 times faster than tsflex, tsfresh, tsfel, featuretools or kats.
* tsfresh is a Python library for feature extraction on time series data. It evaluates the quality of the features using hypothesis testing.
* tsflex is an open source Python library for extracting features from time series data. Despite being 100% written in Python, it has been shown to be faster and more memory efficient than tsfresh, seglearn or tsfel.
* seglearn is an extension for multivariate, sequential time series data to the
scikit-learn
scikit-learn (formerly scikits.learn and also known as sklearn) is a free and open-source machine learning library for the Python programming language.
It features various classification, regression and clustering algorithms including support ...
Python library.
* tsfel is a Python package for feature extraction on time series data.
* kats is a Python toolkit for analyzing time series data.
Deep feature synthesis
The deep feature synthesis (DFS) algorithm beat 615 of 906 human teams in a competition.
Feature stores
The
feature store is where the features are stored and organized for the explicit purpose of being used to either train models (by data scientists) or make predictions (by applications that have a trained model). It is a central location where you can either create or update groups of features created from multiple different data sources, or create and update new datasets from those feature groups for training models or for use in applications that do not want to compute the features but just retrieve them when it needs them to make predictions.
A feature store includes the ability to store code used to generate features, apply the code to raw data, and serve those features to models upon request. Useful capabilities include feature versioning and policies governing the circumstances under which features can be used.
Feature stores can be standalone software tools or built into machine learning platforms.
Alternatives
Feature engineering can be a time-consuming and error-prone process, as it requires domain expertise and often involves trial and error.
Deep learning algorithms may be used to process a large raw dataset without having to resort to feature engineering. However, deep learning algorithms still require careful preprocessing and cleaning of the input data. In addition, choosing the right architecture, hyperparameters, and optimization algorithm for a deep neural network can be a challenging and iterative process.
See also
*
Covariate
A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function ...
*
Data transformation
In computing, data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integrationCIO.com. Agile Comes to Data Integration. Retrieved from: https ...
*
Feature extraction
Feature may refer to:
Computing
* Feature recognition, could be a hole, pocket, or notch
* Feature (computer vision), could be an edge, corner or blob
* Feature (machine learning), in statistics: individual measurable properties of the phenome ...
*
Feature learning
In machine learning (ML), feature learning or representation learning is a set of techniques that allow a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual fea ...
*
Hashing trick
In machine learning, feature hashing, also known as the hashing trick (by analogy to the kernel trick), is a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix. It works by appl ...
*
Instrumental variables estimation
*
Kernel method
In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of patt ...
*
List of datasets for machine learning research
These datasets are used in machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learni ...
*
Scale co-occurrence matrix
*
Space mapping
The space mapping methodology for modeling and design optimization of engineering systems was first discovered by John Bandler in 1993. It uses relevant existing knowledge to speed up model generation and design optimization of a system. The kn ...
References
Further reading
*
*
*
*
*
{{refend
Machine learning
Data analysis