Data Stream Mining (also known as stream learning) is the process of extracting knowledge structures from continuous, rapid data records. A
data stream
In connection-oriented communication, a data stream is the transmission of a sequence of digitally encoded signals to convey information. Typically, the transmitted symbols are grouped into a series of packets.
Data streaming has become u ...
is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities.
In many data stream mining applications, the goal is to predict the class or value of new instances in the data stream given some knowledge about the class membership or values of previous instances in the data stream.
Machine learning techniques can be used to learn this prediction task from labeled examples in an automated fashion.
Often, concepts from the field of
incremental learning are applied to cope with structural changes,
on-line learning and real-time demands.
In many applications, especially operating within non-stationary environments, the distribution underlying the instances or the rules underlying their labeling may change over time, i.e. the goal of the prediction, the class to be predicted or the target value to be predicted, may change over time. This problem is referred to as
concept drift
In predictive analytics, data science, machine learning and related fields, concept drift or drift is an evolution of data that invalidates the data model. It happens when the statistical properties of the target variable, which the model is trying ...
. Detecting
concept drift
In predictive analytics, data science, machine learning and related fields, concept drift or drift is an evolution of data that invalidates the data model. It happens when the statistical properties of the target variable, which the model is trying ...
is a central issue to data stream mining. Other challenges that arise when applying machine learning to streaming data include: partially and delayed labeled data, recovery from concept drifts,
and temporal dependencies.
Examples of data streams include computer network traffic, phone conversations, ATM transactions, web searches, and sensor data.
Data stream mining can be considered a subfield of
data mining
Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...
,
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
, and
knowledge discovery.
Software for data stream mining
*
MOA (Massive Online Analysis): free open-source software specific for mining data streams with concept drift developed in Java. It has several machine learning algorithms (
classification
Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...
,
regression,
clustering, outlier detection and recommender systems). Also, it contains a prequential evaluation method, the EDDM concept drift methods, a reader of ARFF real datasets, and artificial stream generators as SEA concepts, STAGGER,
rotating hyperplane, random tree, and
random radius based functions. MOA supports bi-directional interaction with
Weka (machine learning)
The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. Some authorities consider it as the only extant member of the genus '' Gallirallus''. ...
.
*
scikit-multiflow: A machine learning framework for multi-output/multi-label and stream data implemented in Python. scikit-multiflow contains stream generators, stream learning methods for single-target and multi-target, concept drift detectors, evaluation and visualisation methods. (This software is discontinued)
StreamDM StreamDM is an open source framework for
big data
Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...
stream mining that uses the Spark Streaming extension of the core Spark API. One advantage of StreamDM in comparison to existing frameworks is that it directly benefits from the Spark Streaming API, which handles much of the complex problems of the underlying data sources, such as out of order data and recovery from failures.
*
RapidMiner
RapidMiner is a data science platform that analyses the collective impact of an organization's data. It was acquired by Altair Engineering in September 2022.
History
RapidMiner, formerly known as YALE (Yet Another Learning Environment), was deve ...
: commercial software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time-varying concepts, and tracking drifting concept (if used in combination with its data stream mining plugin (formerly: Concept Drift plugin))
RiverML River is a Python library for online machine learning. It is the result of a merger between creme and scikit-multiflow. River's ambition is to be the go-to library for doing machine learning on streaming data.
GAENARI C++ incremental decision tree. It continuously executes inserts and updates of chunked data sets. Rebuild support for concept drift issues.
Events
International Workshop on Ubiquitous Data Mining held in conjunction with th
International Joint Conference on Artificial Intelligence (IJCAI) in Beijing, China, August 3–5, 2013.
International Workshop on Knowledge Discovery from Ubiquitous Data Streams held in conjunction with th
18th European Conference on Machine Learning (ECML) and the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD)in Warsaw, Poland, in September 2007.
ACM Symposium on Applied Computing Data Streams Trackheld in conjunction with th
2007 ACM Symposium on Applied Computing (SAC-2007)in
Seoul
Seoul, officially Seoul Special Metropolitan City, is the capital city, capital and largest city of South Korea. The broader Seoul Metropolitan Area, encompassing Seoul, Gyeonggi Province and Incheon, emerged as the world's List of cities b ...
,
Korea
Korea is a peninsular region in East Asia consisting of the Korean Peninsula, Jeju Island, and smaller islands. Since the end of World War II in 1945, it has been politically Division of Korea, divided at or near the 38th parallel north, 3 ...
, in March 2007.
IEEE International Workshop on Mining Evolving and Streaming Data (IWMESD 2006)to be held in conjunction with th
2006 IEEE International Conference on Data Mining (ICDM-2006)in
Hong Kong
Hong Kong)., Legally Hong Kong, China in international treaties and organizations. is a special administrative region of China. With 7.5 million residents in a territory, Hong Kong is the fourth most densely populated region in the wor ...
in December 2006.
Fourth International Workshop on Knowledge Discovery from Data Streams (IWKDDS)to be held in conjunction with th
17th European Conference on Machine Learning (ECML) and the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD) (ECML/PKDD-2006)in
Berlin
Berlin ( ; ) is the Capital of Germany, capital and largest city of Germany, by both area and List of cities in Germany by population, population. With 3.7 million inhabitants, it has the List of cities in the European Union by population withi ...
,
Germany
Germany, officially the Federal Republic of Germany, is a country in Central Europe. It lies between the Baltic Sea and the North Sea to the north and the Alps to the south. Its sixteen States of Germany, constituent states have a total popu ...
, in September 2006.
See also
*
Concept drift
In predictive analytics, data science, machine learning and related fields, concept drift or drift is an evolution of data that invalidates the data model. It happens when the statistical properties of the target variable, which the model is trying ...
*
Data Mining
Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...
*
Sequence mining
*
Streaming algorithm
In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically one-pass algorithm, just one. These algorithms are desi ...
*
Stream processing
In computer science, stream processing (also known as event stream processing, data stream processing, or distributed stream processing) is a programming paradigm which views Stream (computing), streams, or sequences of events in time, as the centr ...
*
Wireless sensor network
Wireless sensor networks (WSNs) refer to networks of spatially dispersed and dedicated sensors that monitor and record the physical conditions of the environment and forward the collected data to a central location. WSNs can measure environmental ...
*
Lambda architecture
Books
*
*
*
*
*
*
References
{{DEFAULTSORT:Data Stream Mining
Data mining