Feature engineering is a preprocessing step in

supervised machine learning In machine learning, supervised learning (SL) is a paradigm where a model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often human-made labels. ...

and

statistical modeling A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form ...

which transforms raw data into a more effective set of inputs. Each input comprises several attributes, known as features. By providing models with relevant information, feature engineering significantly enhances their predictive accuracy and decision-making capability. Beyond machine learning, the principles of feature engineering are applied in various scientific fields, including physics. For example, physicists construct

dimensionless numbers Dimensionless quantities, or quantities of dimension one, are quantities implicitly defined in a manner that prevents their aggregation into units of measurement. ISBN 978-92-822-2272-0. Typically expressed as ratios that align with another sy ...

such as the

Reynolds number In fluid dynamics, the Reynolds number () is a dimensionless quantity that helps predict fluid flow patterns in different situations by measuring the ratio between Inertia, inertial and viscous forces. At low Reynolds numbers, flows tend to ...

fluid dynamics In physics, physical chemistry and engineering, fluid dynamics is a subdiscipline of fluid mechanics that describes the flow of fluids – liquids and gases. It has several subdisciplines, including (the study of air and other gases in motion ...

, the

Nusselt number In thermal fluid dynamics, the Nusselt number (, after Wilhelm Nusselt) is the ratio of total heat transfer to conductive heat transfer at a boundary in a fluid. Total heat transfer combines conduction and convection. Convection includes both ...

heat transfer Heat transfer is a discipline of thermal engineering that concerns the generation, use, conversion, and exchange of thermal energy (heat) between physical systems. Heat transfer is classified into various mechanisms, such as thermal conduction, ...

, and the

Archimedes number In viscous fluid dynamics, the Archimedes number (Ar), is a dimensionless number used to determine the motion of fluids due to density differences, named after the ancient Greek scientist and mathematician Archimedes. It is the ratio of gravita ...

sedimentation Sedimentation is the deposition of sediments. It takes place when particles in suspension settle out of the fluid in which they are entrained and come to rest against a barrier. This is due to their motion through the fluid in response to th ...

. They also develop first approximations of solutions, such as analytical solutions for the

strength of materials Strength may refer to: Personal trait *Physical strength, as in people or animals *Character strengths like those listed in the Values in Action Inventory *The exercise of willpower Physics * Mechanical strength, the ability to withstand ...

in mechanics.

Clustering

One of the applications of feature engineering has been clustering of feature-objects or sample-objects in a dataset. Especially, feature engineering based on

matrix decomposition In the mathematical discipline of linear algebra, a matrix decomposition or matrix factorization is a factorization of a matrix into a product of matrices. There are many different matrix decompositions; each finds use among a particular class of ...

has been extensively used for data clustering under non-negativity constraints on the feature coefficients. These include ''Non-Negative Matrix Factorization'' (NMF), ''Non-Negative Matrix-Tri Factorization'' (NMTF), ''Non-Negative Tensor Decomposition/Factorization'' (NTF/NTD), etc. The non-negativity constraints on coefficients of the feature vectors mined by the above-stated algorithms yields a part-based representation, and different factor matrices exhibit natural clustering properties. Several extensions of the above-stated feature engineering methods have been reported in literature, including ''orthogonality-constrained factorization'' for hard clustering, and ''manifold learning'' to overcome inherent issues with these algorithms. Other classes of feature engineering algorithms include leveraging a common hidden structure across multiple inter-related datasets to obtain a consensus (common) clustering scheme. An example is ''Multi-view Classification based on Consensus Matrix Decomposition'' (MCMD), which mines a common clustering scheme across multiple datasets. MCMD is designed to output two types of class labels (scale-variant and scale-invariant clustering), and: * is computationally robust to missing information, * can obtain shape- and scale-based outliers, * and can handle high-dimensional data effectively. Coupled matrix and tensor decompositions are popular in multi-view feature engineering.

Predictive modelling

Feature engineering in

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

and

involves selecting, creating, transforming, and extracting data features. Key components include feature creation from existing data, transforming and imputing missing or invalid features, reducing data dimensionality through methods like

Principal Components Analysis Principal component analysis (PCA) is a Linear map, linear dimensionality reduction technique with applications in exploratory data analysis, visualization and Data Preprocessing, data preprocessing. The data is linear map, linearly transformed ...

(PCA),

Independent Component Analysis In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate statistics, multivariate signal into additive subcomponents. This is done by assuming that at most one subcomponent is Gaussian and ...

(ICA), and

Linear Discriminant Analysis Linear discriminant analysis (LDA), normal discriminant analysis (NDA), canonical variates analysis (CVA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to fi ...

(LDA), and selecting the most relevant features for model training based on importance scores and correlation matrices. Features vary in significance. Even relatively insignificant features may contribute to a model.

Feature selection In machine learning, feature selection is the process of selecting a subset of relevant Feature (machine learning), features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons: * sim ...

can reduce the number of features to prevent a model from becoming too specific to the training data set (overfitting). Feature explosion occurs when the number of identified features is too large for effective model estimation or optimization. Common causes include: * Feature templates - implementing feature templates instead of coding new features * Feature combinations - combinations that cannot be represented by a linear system Feature explosion can be limited via techniques such as:

regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...

kernel method In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of patt ...

s, and

feature selection In machine learning, feature selection is the process of selecting a subset of relevant Feature (machine learning), features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons: * sim ...

Automation

Automation of feature engineering is a research topic that dates back to the 1990s. Machine learning software that incorporates automated feature engineering has been commercially available since 2016. Related academic literature can be roughly separated into two types: * Multi-relational decision tree learning (MRDTL) uses a supervised algorithm that is similar to a

decision tree A decision tree is a decision support system, decision support recursive partitioning structure that uses a Tree (graph theory), tree-like Causal model, model of decisions and their possible consequences, including probability, chance event ou ...

. * Deep Feature Synthesis uses simpler methods.

Multi-relational decision tree learning (MRDTL)

Multi-relational Decision Tree Learning (MRDTL) extends traditional decision tree methods to

relational databases A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured form ...

, handling complex data relationships across tables. It innovatively uses selection graphs as decision nodes, refined systematically until a specific termination criterion is reached. Most MRDTL studies base implementations on relational databases, which results in many redundant operations. These redundancies can be reduced by using techniques such as tuple id propagation.

Open-source implementations

There are a number of open-source libraries and tools that automate feature engineering on relational data and time series: * featuretools is a

Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...

library for transforming time series and relational data into feature matrices for machine learning. * MCMD: An open-source feature engineering algorithm for joint clustering of multiple datasets . * OneBM or One-Button Machine combines feature transformations and feature selection on relational data with feature selection techniques. * getML community is an open source tool for automated feature engineering on time series and relational data. It is implemented in C/ C++ with a Python interface. It has been shown to be at least 60 times faster than tsflex, tsfresh, tsfel, featuretools or kats. * tsfresh is a Python library for feature extraction on time series data. It evaluates the quality of the features using hypothesis testing. * tsflex is an open source Python library for extracting features from time series data. Despite being 100% written in Python, it has been shown to be faster and more memory efficient than tsfresh, seglearn or tsfel. * seglearn is an extension for multivariate, sequential time series data to the

scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free and open-source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support ...

Python library. * tsfel is a Python package for feature extraction on time series data. * kats is a Python toolkit for analyzing time series data.

Deep feature synthesis

The deep feature synthesis (DFS) algorithm beat 615 of 906 human teams in a competition.

Feature stores

The feature store is where the features are stored and organized for the explicit purpose of being used to either train models (by data scientists) or make predictions (by applications that have a trained model). It is a central location where you can either create or update groups of features created from multiple different data sources, or create and update new datasets from those feature groups for training models or for use in applications that do not want to compute the features but just retrieve them when it needs them to make predictions. A feature store includes the ability to store code used to generate features, apply the code to raw data, and serve those features to models upon request. Useful capabilities include feature versioning and policies governing the circumstances under which features can be used. Feature stores can be standalone software tools or built into machine learning platforms.

Alternatives

Feature engineering can be a time-consuming and error-prone process, as it requires domain expertise and often involves trial and error. Deep learning algorithms may be used to process a large raw dataset without having to resort to feature engineering. However, deep learning algorithms still require careful preprocessing and cleaning of the input data. In addition, choosing the right architecture, hyperparameters, and optimization algorithm for a deep neural network can be a challenging and iterative process.