In
predictive analytics
Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or otherwise unknown events.
In busine ...
and
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
, concept drift means that the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes.
The term ''concept'' refers to the quantity to be predicted. More generally, it can also refer to other phenomena of interest besides the target concept, such as an input, but, in the context of concept drift, the term commonly refers to the target variable.
Examples
In a
fraud detection
In law, fraud is intentional deception to secure unfair or unlawful gain, or to deprive a victim of a legal right. Fraud can violate civil law (e.g., a fraud victim may sue the fraud perpetrator to avoid the fraud or recover monetary compensa ...
application the target concept may be a
binary
Binary may refer to:
Science and technology Mathematics
* Binary number, a representation of numbers using only two digits (0 and 1)
* Binary function, a function that takes two arguments
* Binary operation, a mathematical operation that ta ...
attribute fraudulent with values "yes" or "no" that indicates whether a given transaction is fraudulent. Or, in a
weather prediction
Weather is the state of the atmosphere, describing for example the degree to which it is hot or cold, wet or dry, calm or stormy, clear or cloudy. On Earth, most weather phenomena occur in the lowest layer of the planet's atmosphere, the tr ...
application, there may be several target concepts such as temperature, pressure, and humidity.
The behavior of the customers in an
online shop
Online shopping is a form of electronic commerce which allows consumers to directly buy goods or services from a seller over the Internet using a web browser or a mobile app. Consumers find a product of interest by visiting the website of the ...
may change over time. For example, if weekly merchandise sales are to be predicted, and a
predictive model
Predictive modelling uses statistics to predict outcomes. Most often the event one wants to predict is in the future, but predictive modelling can be applied to any type of unknown event, regardless of when it occurred. For example, predictive mod ...
has been developed that works satisfactorily. The model may use inputs such as the amount of money spent on
advertising
Advertising is the practice and techniques employed to bring attention to a product or service. Advertising aims to put a product or service in the spotlight in hopes of drawing it attention from consumers. It is typically used to promote a ...
,
promotions being run, and other metrics that may affect sales. The model is likely to become less and less accurate over time – this is concept drift. In the merchandise sales application, one reason for concept drift may be seasonality, which means that shopping behavior changes seasonally. Perhaps there will be higher sales in the winter holiday season than during the summer, for example. Concept drift generally occurs when the covariates that comprise the data set begin to explain the variation of your target set less accurately — there may be some
confounding variables that have emerged, and that one simply cannot account for, which renders the model accuracy to progressively decrease with time. Generally, it is advised to perform health checks as part of the post-production analysis and to re-train the model with new assumptions upon signs of concept drift.
Possible remedies
To prevent deterioration in
prediction accuracy because of concept drift, ''reactive'' and ''tracking'' solutions can be adopted. Reactive solutions retrain the model in reaction to a triggering mechanism, such as a change-detection test, to explicitly detect concept drift as a change in the statistics of the data-generating process. When concept drift is detected, the current model is no longer up-to-date and must be replaced by a new one to restore prediction accuracy. A shortcoming of reactive approaches is that performance may decay until the change is detected. Tracking solutions seek to track the changes in the concept by continually updating the model. Methods for achieving this include
online machine learning
In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques whic ...
, frequent retraining on the most recently observed samples, and maintaining an ensemble of classifiers where one new classifier is trained on the most recent batch of examples and replaces the oldest classifier in the ensemble.
Contextual information, when available, can be used to better explain the causes of the concept drift: for instance, in the sales prediction application, concept drift might be compensated by adding information about the season to the model. By providing information about the time of the year, the rate of deterioration of your model is likely to decrease, concept drift is unlikely to be eliminated altogether. This is because actual shopping behavior does not follow any static,
finite model. New factors may arise at any time that influence shopping behavior, the influence of the known factors or their interactions may change.
Concept drift cannot be avoided for complex phenomena that are not governed by fixed
laws of nature. All processes that arise from human activity, such as
socioeconomic
Socioeconomics (also known as social economics) is the social science that studies how economic activity affects and is shaped by social processes. In general it analyzes how modern societies progress, stagnate, or regress because of their l ...
processes, and
biological processes
Biological processes are those processes that are vital for an organism to live, and that shape its capacities for interacting with its environment. Biological processes are made of many chemical reactions or other events that are involved in the ...
are likely to experience concept drift. Therefore, periodic retraining, also known as refreshing, of any model is necessary.
Software
NannyML An open-source
Python
Python may refer to:
Snakes
* Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia
** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia
* Python (mythology), a mythical serpent
Computing
* Python (pro ...
library for detecting
univariate
In mathematics, a univariate object is an expression, equation, function or polynomial involving only one variable. Objects involving more than one variable are multivariate. In some cases the distinction between the univariate and multivariate ...
and
multivariate distribution
Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...
drift and estimating
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
model performance without ground truth labels.
*
RapidMiner
RapidMiner is a data science platform designed for enterprises that analyses the collective impact of organizations’ employees, expertise and data. Rapid Miner's data science platform is intended to support many analytics users across a broad A ...
: Formerly ''Yet Another Learning Environment'' (YALE): free open-source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time-varying concepts, and tracking drifting concept. It is used in combination with its data stream mining plugin (formerly concept drift plugin).
* EDDM
Early Drift Detection Method: free open-source implementation of drift detection methods in
Weka
The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. It is the only extant member of the genus '' Gallirallus''. Four subspecies are recogni ...
.
*
MOA (Massive Online Analysis)
Massive Online Analysis (MOA) is a free open-source software project specific for data stream mining with concept drift. It is written in Java and developed at the University of Waikato, New Zealand.
Description
MOA is an open-source framework ...
: free open-source software specific for mining data streams with concept drift. It contains a prequential evaluation method, the EDDM concept drift methods, a reader of ARFF real datasets, and artificial stream generators as SEA concepts, STAGGER, rotating hyperplane, random tree, and random radius based functions. MOA supports bi-directional interaction with
Weka
The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. It is the only extant member of the genus '' Gallirallus''. Four subspecies are recogni ...
.
Datasets
Real
* USP Data Stream Repository, 27 real-world stream datasets with concept drift compiled by Souza et al. (2020)
Access* Airline, approximately 116 million flight arrival and departure records (cleaned and sorted) compiled by E. Ikonomovska. Reference: Data Expo 2009 Competitio
Access* Chess.com (online games) and Luxembourg (social survey) datasets compiled by I. Zliobaite
Access* ECUE spam 2 datasets each consisting of more than 10,000 emails collected over a period of approximately 2 years by an individual
from S.J.Delany webpage
* Elec2, electricity demand, 2 classes, 45,312 instances. Reference: M. Harries, Splice-2 comparative evaluation: Electricity pricing, Technical report, The University of South Wales, 1999
from J.Gama webpage
Comment on applicability
* PAKDD'09 competition data represents the credit evaluation task. It is collected over a five-year period. Unfortunately, the true labels are released only for the first part of the data
Access* Sensor stream and Power supply stream datasets are available from X. Zhu's Stream Data Mining Repository.
* SMEAR is a benchmark data stream with a lot of missing values. Environment observation data over 7 years. Predict cloudiness
Access* Text mining, a collection of
text mining
Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...
datasets with concept drift, maintained by I. Katakis
Access* Gas Sensor Array Drift Dataset, a collection of 13,910 measurements from 16 chemical sensors utilized for drift compensation in a discrimination task of 6 gases at various levels of concentrations
Access
Other
* KDD'99 competition data contains ''simulated'' intrusions in a military network environment. It is often used as a benchmark to evaluate handling concept drift
Synthetic
* Extreme verification latency benchmark
Accessfrom Nonstationary Environments – Archive.
* Sine, Line, Plane, Circle and Boolean Data Sets
Accessfrom L.Minku webpage.
* SEA concepts
from J.Gama webpage.
* STAGGER
* Mixed
Data generation frameworks
*
Downloadfrom L.Minku webpage.
*
Projects
INFER Computational Intelligence Platform for Evolving and Robust Predictive Systems (2010–2014), Bournemouth University (UK), Evonik Industries (Germany), Research and Engineering Centre (Poland)
HaCDAIS Handling Concept Drift in Adaptive Information Systems (2008–2012), Eindhoven University of Technology (the Netherlands)
KDUS Knowledge Discovery from Ubiquitous Streams, INESC Porto and Laboratory of Artificial Intelligence and Decision Support (Portugal)
ADEPT Adaptive Dynamic Ensemble Prediction Techniques, University of Manchester (UK), University of Bristol (UK)
ALADDIN autonomous learning agents for decentralised data and information networks (2005–2010)
GAENARI C++ incremental decision tree algorithm. it minimize concept drifting damage. (2022)
Benchmarks
NAB The Numenta Anomaly Benchmark, benchmark for evaluating algorithms for anomaly detection in streaming, real-time applications. (2014–2018)
Meetings
*2014
*
Special Session on "Concept Drift, Domain Adaptation & Learning in Dynamic Environments" @IEEE IJCNN 2014
*2013
*
RealStreamReal-World Challenges for Data Stream Mining Workshop-Discussion at the
ECML PKDD 2013, Prague, Czech Republic.
*
LEAPS 2013The 1st International Workshop on Learning stratEgies and dAta Processing in nonStationary environments
*2011
*
Special Session on Learning in evolving environments and its application on real-world problems at ICMLA'11
*
HaCDAIS 2011The 2nd International Workshop on Handling Concept Drift in Adaptive Information Systems
*
ICAIS 2011Track on Incremental Learning
*
IJCNN 2011Special Session on Concept Drift and Learning Dynamic Environments
*
Symposium on Computational Intelligence in Dynamic and Uncertain Environments
*2010
*
HaCDAIS 2010International Workshop on Handling Concept Drift in Adaptive Information Systems: Importance, Challenges and Solutions
*
Special Session on Dynamic learning in non-stationary environments
*
SAC 2010Data Streams Track at ACM Symposium on Applied Computing
*
SensorKDD 2010International Workshop on Knowledge Discovery from Sensor Data
*
StreamKDD 2010Novel Data Stream Pattern Mining Techniques
** Concept Drift and Learning in Nonstationary Environments a
IEEE World Congress on Computational Intelligence*
MLMDS’2010Special Session on Machine Learning Methods for Data Streams at the 10th International Conference on Intelligent Design and Applications, ISDA’10
Bibliographic references
Many papers have been published describing algorithms for concept drift detection. Only reviews, surveys and overviews are here:
Reviews
*
*
*
*
*
*
*
* Zliobaite, I., Learning under Concept Drift: an Overview. Technical Report. 2009, Faculty of Mathematics and Informatics, Vilnius University: Vilnius, Lithuania
PDF*
*
*
*
*
See also
*
Data stream mining
*
Data mining
*
Machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
References
{{reflist
Data mining
Machine learning