Concept Drift
   HOME

TheInfoList



OR:

In
predictive analytics Predictive analytics encompasses a variety of Statistics, statistical techniques from data mining, Predictive modelling, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or other ...
,
data science Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processing, scientific visualization, algorithms and systems to extract or extrapolate knowledge from potentially noisy, stru ...
,
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
and related fields, concept drift or drift is an evolution of data that invalidates the
data model A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be ...
. It happens when the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes. Drift detection and drift adaptation are of paramount importance in the fields that involve dynamically changing data and data models.


Predictive model decay

In machine learning and
predictive analytics Predictive analytics encompasses a variety of Statistics, statistical techniques from data mining, Predictive modelling, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or other ...
this drift phenomenon is called concept drift. In machine learning, a common element of a data model are the statistical properties, such as
probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
of the actual data. If they deviate from the statistical properties of the training data set, then the learned predictions may become invalid, if the drift is not addressed.


Data configuration decay

Another important area is
software engineering Software engineering is a branch of both computer science and engineering focused on designing, developing, testing, and maintaining Application software, software applications. It involves applying engineering design process, engineering principl ...
, where three types of data drift affecting data fidelity may be recognized. Changes in the software environment ("infrastructure drift") may invalidate software infrastructure configuration. "Structural drift" happens when the data
schema Schema may refer to: Science and technology * SCHEMA (bioinformatics), an algorithm used in protein engineering * Schema (genetic algorithms), a set of programs or bit strings that have some genotypic similarity * Schema.org, a web markup vocab ...
changes, which may invalidate databases. "Semantic drift" is changes in the meaning of data while the structure does not change. In many cases this may happen in complicated applications when many independent developers introduce changes without proper awareness of the effects of their changes in other areas of the software system."Driftctl and Terraform, they're two of a kind!"
/ref>Girish Pancha
Big Data's Hidden Scourge: Data Drift
''CMSWire'', April 8, 2016
For many application systems, the nature of data on which they operate are subject to changes for various reasons, e.g., due to changes in business model, system updates, or switching the platform on which the system operates. In the case of
cloud computing Cloud computing is "a paradigm for enabling network access to a scalable and elastic pool of shareable physical or virtual resources with self-service provisioning and administration on-demand," according to International Organization for ...
, infrastructure drift that may affect the applications running on cloud may be caused by the updates of cloud software. There are several types of detrimental effects of data drift on data fidelity. Data corrosion is passing the drifted data into the system undetected. Data loss happens when valid data are ignored due to non-conformance with the applied schema. Squandering is the phenomenon when new data fields are introduced upstream the data processing pipeline, but somewhere downstream there data fields are absent.


Inconsistent data

"Data drift" may refer to the phenomenon when database records fail to match the real-world data due to the changes in the latter over time. This is a common problem with databases involving people, such as customers, employees, citizens, residents, etc. Human data drift may be caused by unrecorded changes in personal data, such as place of residence or name, as well as due to errors during data input. "Data drift" may also refer to inconsistency of data elements between several replicas of a database. The reasons can be difficult to identify. A simple drift detection is to run
checksum A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify dat ...
regularly. However the remedy may be not so easy.


Examples

The behavior of the customers in an
online shop Online shopping is a form of electronic commerce which allows consumers to directly buy goods or services from a seller over the Internet using a web browser or a mobile app. Consumers find a product of interest by visiting the website of the ...
may change over time. For example, if weekly merchandise sales are to be predicted, and a predictive model has been developed that works satisfactorily. The model may use inputs such as the amount of money spent on
advertising Advertising is the practice and techniques employed to bring attention to a Product (business), product or Service (economics), service. Advertising aims to present a product or service in terms of utility, advantages, and qualities of int ...
, promotions being run, and other metrics that may affect sales. The model is likely to become less and less accurate over time – this is concept drift. In the merchandise sales application, one reason for concept drift may be seasonality, which means that shopping behavior changes seasonally. Perhaps there will be higher sales in the winter holiday season than during the summer, for example. Concept drift generally occurs when the covariates that comprise the data set begin to explain the variation of your target set less accurately — there may be some
confounding In causal inference, a confounder is a variable that influences both the dependent variable and independent variable, causing a spurious association. Confounding is a causal concept, and as such, cannot be described in terms of correlatio ...
variables that have emerged, and that one simply cannot account for, which renders the model accuracy to progressively decrease with time. Generally, it is advised to perform health checks as part of the post-production analysis and to re-train the model with new assumptions upon signs of concept drift.


Possible remedies

To prevent deterioration in
prediction A prediction (Latin ''præ-'', "before," and ''dictum'', "something said") or forecast is a statement about a future event or about future data. Predictions are often, but not always, based upon experience or knowledge of forecasters. There ...
accuracy because of concept drift, ''reactive'' and ''tracking'' solutions can be adopted. Reactive solutions retrain the model in reaction to a triggering mechanism, such as a change-detection test, to explicitly detect concept drift as a change in the statistics of the data-generating process. When concept drift is detected, the current model is no longer up-to-date and must be replaced by a new one to restore prediction accuracy. A shortcoming of reactive approaches is that performance may decay until the change is detected. Tracking solutions seek to track the changes in the concept by continually updating the model. Methods for achieving this include
online machine learning In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques whic ...
, frequent retraining on the most recently observed samples, and maintaining an ensemble of classifiers where one new classifier is trained on the most recent batch of examples and replaces the oldest classifier in the ensemble. Contextual information, when available, can be used to better explain the causes of the concept drift: for instance, in the sales prediction application, concept drift might be compensated by adding information about the season to the model. By providing information about the time of the year, the rate of deterioration of your model is likely to decrease, but concept drift is unlikely to be eliminated altogether. This is because actual shopping behavior does not follow any static, finite model. New factors may arise at any time that influence shopping behavior, the influence of the known factors or their interactions may change. Concept drift cannot be avoided for complex phenomena that are not governed by fixed laws of nature. All processes that arise from human activity, such as
socioeconomic Economics () is a behavioral science that studies the Production (economics), production, distribution (economics), distribution, and Consumption (economics), consumption of goods and services. Economics focuses on the behaviour and interac ...
processes, and
biological processes Biological processes are those processes that are necessary for an organism to live and that shape its capacities for interacting with its environment. Biological processes are made of many chemical reactions or other events that are involved in ...
are likely to experience concept drift. Therefore, periodic retraining, also known as refreshing, of any model is necessary.


See also

*
Data stream mining Data Stream Mining (also known as stream learning) is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read o ...
*
Data mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...
*
Snyk Snyk Limited is a developer-oriented cybersecurity company, specializing in securing custom developed code, open-source dependencies and cloud infrastructure. It was founded in 2015 out of Tel Aviv and London and is headquartered in Boston. Hist ...
, a company whose portfolio includes drift detection in software applications


Further reading

Many papers have been published describing algorithms for concept drift detection. Only reviews, surveys and overviews are here:


Reviews

* * * * * * * * * * * *


External links


Software


Frouros
An open-source
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...
library for drift detection in
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
systems.
NannyML
An open-source
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...
library for detecting
univariate In mathematics, a univariate object is an expression (mathematics), expression, equation, function (mathematics), function or polynomial involving only one Variable (mathematics), variable. Objects involving more than one variable are ''wikt:multi ...
and
multivariate distribution Multivariate is the quality of having multiple variables. It may also refer to: In mathematics * Multivariable calculus * Multivariate function * Multivariate polynomial * Multivariate interpolation * Multivariate optimization In computing * ...
drift and estimating
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
model performance without ground truth labels. *
RapidMiner RapidMiner is a data science platform that analyses the collective impact of an organization's data. It was acquired by Altair Engineering in September 2022. History RapidMiner, formerly known as YALE (Yet Another Learning Environment), was deve ...
: Formerly ''Yet Another Learning Environment'' (YALE): free open-source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time-varying concepts, and tracking drifting concept. It is used in combination with its data stream mining plugin (formerly concept drift plugin). * EDDM
Early Drift Detection Method
: free open-source implementation of drift detection methods in
Weka The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. Some authorities consider it as the only extant member of the genus '' Gallirallus''. ...
. * MOA (Massive Online Analysis): free open-source software specific for mining data streams with concept drift. It contains a prequential evaluation method, the EDDM concept drift methods, a reader of ARFF real datasets, and artificial stream generators as SEA concepts, STAGGER, rotating hyperplane, random tree, and random radius based functions. MOA supports bi-directional interaction with
Weka The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. Some authorities consider it as the only extant member of the genus '' Gallirallus''. ...
.


Datasets


Real

* USP Data Stream Repository, 27 real-world stream datasets with concept drift compiled by Souza et al. (2020)
Access
* Airline, approximately 116 million flight arrival and departure records (cleaned and sorted) compiled by E. Ikonomovska. Reference: Data Expo 2009 Competitio
Access
* Chess.com (online games) and Luxembourg (social survey) datasets compiled by I. Zliobaite
Access
* ECUE spam 2 datasets each consisting of more than 10,000 emails collected over a period of approximately 2 years by an individual

from S.J.Delany webpage * Elec2, electricity demand, 2 classes, 45,312 instances. Reference: M. Harries, Splice-2 comparative evaluation: Electricity pricing, Technical report, The University of South Wales, 1999

from J.Gama webpage. arxiv:1301.3524, Comment on applicability. * PAKDD'09 competition data represents the credit evaluation task. It is collected over a five-year period. Unfortunately, the true labels are released only for the first part of the data
Access
* Sensor stream and Power supply stream datasets are available from X. Zhu's Stream Data Mining Repository.

* SMEAR is a benchmark data stream with a lot of missing values. Environment observation data over 7 years. Predict cloudiness
Access
* Text mining, a collection of
text mining Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from differe ...
datasets with concept drift, maintained by I. Katakis
Access
* Gas Sensor Array Drift Dataset, a collection of 13,910 measurements from 16 chemical sensors utilized for drift compensation in a discrimination task of 6 gases at various levels of concentrations
Access


Other

* KDD'99 competition data contains ''simulated'' intrusions in a military network environment. It is often used as a benchmark to evaluate handling concept drift


Synthetic

* Extreme verification latency benchmark
Access
from Nonstationary Environments – Archive. * Sine, Line, Plane, Circle and Boolean Data Sets
Access
from L.Minku webpage. * SEA concepts

from J.Gama webpage. * STAGGER * Mixed


Data generation frameworks

*
Download
from L.Minku webpage. *


Projects


INFER
Computational Intelligence Platform for Evolving and Robust Predictive Systems (2010–2014), Bournemouth University (UK), Evonik Industries (Germany), Research and Engineering Centre (Poland)
HaCDAIS
Handling Concept Drift in Adaptive Information Systems (2008–2012), Eindhoven University of Technology (the Netherlands)
KDUS
Knowledge Discovery from Ubiquitous Streams, INESC Porto and Laboratory of Artificial Intelligence and Decision Support (Portugal)
ADEPT
Adaptive Dynamic Ensemble Prediction Techniques, University of Manchester (UK), University of Bristol (UK)
ALADDIN
autonomous learning agents for decentralised data and information networks (2005–2010)
GAENARI
C++ incremental decision tree algorithm. it minimize concept drifting damage. (2022)


Benchmarks


NAB
The Numenta Anomaly Benchmark, benchmark for evaluating algorithms for anomaly detection in streaming, real-time applications. (2014–2018)


Meetings

*2014 ** [] Special Session on "Concept Drift, Domain Adaptation & Learning in Dynamic Environments" @IEEE IJCNN 2014 *2013 *
RealStream
Real-World Challenges for Data Stream Mining Workshop-Discussion at the
ECML PKDD ECML PKDD, the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, is one of the leading academic conferences on machine learning and knowledge discovery, held in Europe every year. History ECM ...
2013, Prague, Czech Republic. *
LEAPS 2013
The 1st International Workshop on Learning stratEgies and dAta Processing in nonStationary environments *2011 *

Special Session on Learning in evolving environments and its application on real-world problems at ICMLA'11 *
HaCDAIS 2011
The 2nd International Workshop on Handling Concept Drift in Adaptive Information Systems *
ICAIS 2011
Track on Incremental Learning *
IJCNN 2011
Special Session on Concept Drift and Learning Dynamic Environments *

Symposium on Computational Intelligence in Dynamic and Uncertain Environments *2010 *
HaCDAIS 2010
International Workshop on Handling Concept Drift in Adaptive Information Systems: Importance, Challenges and Solutions *

Special Session on Dynamic learning in non-stationary environments *
SAC 2010
Data Streams Track at ACM Symposium on Applied Computing *
SensorKDD 2010
International Workshop on Knowledge Discovery from Sensor Data *
StreamKDD 2010
Novel Data Stream Pattern Mining Techniques ** Concept Drift and Learning in Nonstationary Environments a
IEEE World Congress on Computational Intelligence
*
MLMDS’2010
Special Session on Machine Learning Methods for Data Streams at the 10th International Conference on Intelligent Design and Applications, ISDA’10


References

{{reflist Data mining Machine learning Data analysis