Evolutionary data mining, or genetic data mining is an
umbrella term
Hypernymy and hyponymy are the wikt:Wiktionary:Semantic relations, semantic relations between a generic term (''hypernym'') and a more specific term (''hyponym''). The hypernym is also called a ''supertype'', ''umbrella term'', or ''blanket term ...
for any
data mining
Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...
using
evolutionary algorithm
Evolutionary algorithms (EA) reproduce essential elements of the biological evolution in a computer algorithm in order to solve "difficult" problems, at least Approximation, approximately, for which no exact or satisfactory solution methods are k ...
s. While it can be used for mining data from
DNA sequence
A nucleic acid sequence is a succession of bases within the nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. This succession is denoted by a series of a set of five different letters that indicate the order of the nu ...
s,
[Wai-Ho Au, Keith C. C. Chan, and Xin Yao]
"A Novel Evolutionary Data Mining Algorithm With Applications to Churn Prediction"
''IEEE
The Institute of Electrical and Electronics Engineers (IEEE) is an American 501(c)(3) organization, 501(c)(3) public charity professional organization for electrical engineering, electronics engineering, and other related disciplines.
The IEEE ...
'', retrieved on 2008-12-4. it is not limited to biological contexts and can be used in any classification-based prediction scenario, which helps "predict the value ... of a user-specified goal attribute based on the values of other attributes."
[Freitas, Alex A]
"A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery"
'' Pontifícia Universidade Católica do Paraná'', Retrieved on 2008-12-4. For instance, a banking institution might want to predict whether a customer's
credit
Credit (from Latin verb ''credit'', meaning "one believes") is the trust which allows one party to provide money or resources to another party wherein the second party does not reimburse the first party immediately (thereby generating a debt) ...
would be "good" or "bad" based on their age, income and current savings.
[ Evolutionary algorithms for data mining work by creating a series of ]random
In common usage, randomness is the apparent or actual lack of definite pattern or predictability in information. A random sequence of events, symbols or steps often has no order and does not follow an intelligible pattern or combination. ...
rules to be checked against a training dataset
A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record o ...
. The rules which most closely fit the data are selected and are mutated
In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA replication, DNA or viral rep ...
. The process is iterated many times and eventually, a rule will arise that approaches 100% similarity with the training data. This rule is then checked against a test dataset, which was previously invisible to the genetic algorithm.
Process
Data preparation
Before database
In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
s can be mined for data using evolutionary algorithms, it first has to be cleaned, which means incomplete, noisy or inconsistent data should be repaired. It is imperative that this be done before the mining takes place, as it will help the algorithms produce more accurate results.[Jiawei Han, Micheline Kamber ''Data Mining: Concepts and Techniques'' (2006), ]Morgan Kaufmann
Morgan Kaufmann Publishers is a Burlington, Massachusetts (San Francisco, California until 2008) based publisher specializing in computer science and engineering content.
Since 1984, Morgan Kaufmann has been publishing contents on information te ...
,
If data comes from more than one database, they can be integrated, or combined, at this point. When dealing with large datasets, it might be beneficial to also reduce the amount of data being handled.[ One common method of data reduction works by getting a normalized sample of data from the database, resulting in much faster, yet statistically equivalent results.][
At this point, the data is split into two equal but mutually exclusive elements, a test and a training dataset.] The training dataset will be used to let rules evolve which match it closely. The test dataset will then either confirm or deny these rules.
Data mining
Evolutionary algorithms work by trying to emulate natural evolution
Evolution is the change in the heritable Phenotypic trait, characteristics of biological populations over successive generations. It occurs when evolutionary processes such as natural selection and genetic drift act on genetic variation, re ...
. First, a random series of "rules" are set on the training dataset, which try to generalize the data into formulas. The rules are checked, and the ones that fit the data best are kept, the rules that do not fit the data are discarded. The rules that were kept are then mutated, and multiplied to create new rules.
This process iterates as necessary in order to produce a rule that matches the dataset as closely as possible. When this rule is obtained, it is then checked against the test dataset. If the rule still matches the data, then the rule is valid and is kept. If it does not match the data, then it is discarded and the process begins by selecting random rules again.
See also
*Data mining
Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...
*Evolutionary algorithm
Evolutionary algorithms (EA) reproduce essential elements of the biological evolution in a computer algorithm in order to solve "difficult" problems, at least Approximation, approximately, for which no exact or satisfactory solution methods are k ...
*Knowledge discovery
Knowledge extraction is the creation of knowledge from structured ( relational databases, XML) and unstructured (text, documents, images) sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must ...
* Pattern mining
*Data analysis
Data analysis is the process of inspecting, Data cleansing, cleansing, Data transformation, transforming, and Data modeling, modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Da ...
References
{{Evolutionary computation
Data mining