decision tree learning Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of ob ...

, ID3 (Iterative Dichotomiser 3) is an

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...

invented by

Ross Quinlan John Ross Quinlan is a computer science researcher in data mining and decision theory. He has contributed extensively to the development of decision tree algorithms, including inventing the canonical C4.5 and ID3 algorithms. He also contributed to ...

Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81–106 used to generate a

decision tree A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains co ...

from a dataset. ID3 is the precursor to the

C4.5 algorithm C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referr ...

, and is typically used in the

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

and

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...

domains.

Algorithm

The ID3 algorithm begins with the original set

S

as the root node. On each

iteration Iteration is the repetition of a process in order to generate a (possibly unbounded) sequence of outcomes. Each repetition of the process is a single iteration, and the outcome of each iteration is then the starting point of the next iteration. ...

of the algorithm, it iterates through every unused attribute of the set

S

and calculates the

entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodyna ...

\Eta

or the

information gain Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, ...

IG(S)

of that attribute. It then selects the attribute which has the smallest entropy (or largest information gain) value. The set

S

is then split or partitioned by the selected attribute to produce subsets of the data. (For example, a node can be split into child nodes based upon the subsets of the population whose ages are less than 50, between 50 and 100, and greater than 100.) The algorithm continues to recurse on each subset, considering only attributes never selected before. Recursion on a subset may stop in one of these cases: * every element in the subset belongs to the same class; in which case the node is turned into a leaf node and labelled with the class of the examples. * there are no more attributes to be selected, but the examples still do not belong to the same class. In this case, the node is made a leaf node and labelled with the most common class of the examples in the subset. * there are no examples in the subset, which happens when no example in the parent set was found to match a specific value of the selected attribute. An example could be the absence of a person among the

population Population typically refers to the number of people in a single area, whether it be a city or town, region, country, continent, or the world. Governments typically quantify the size of the resident population within their jurisdiction using ...

with age over 100 years. Then a leaf node is created and labelled with the most common class of the examples in the parent node's set. Throughout the algorithm, the decision tree is constructed with each non-terminal node (

internal node In computer science, a tree is a widely used abstract data type that represents a hierarchical tree structure with a set of connected nodes. Each node in the tree can be connected to many children (depending on the type of tree), but must be con ...

) representing the selected attribute on which the data was split, and terminal nodes (leaf nodes) representing the class label of the final subset of this branch.

Summary

# Calculate the

of every attribute

a

of the data set

S

. # Partition ("split") the set

S

into

subset In mathematics, set ''A'' is a subset of a set ''B'' if all elements of ''A'' are also elements of ''B''; ''B'' is then a superset of ''A''. It is possible for ''A'' and ''B'' to be equal; if they are unequal, then ''A'' is a proper subset o ...

s using the attribute for which the resulting entropy after splitting is minimized; or, equivalently, information gain is

maximum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given r ...

. # Make a decision tree

node In general, a node is a localized swelling (a "knot") or a point of intersection (a vertex). Node may refer to: In mathematics * Vertex (graph theory), a vertex in a mathematical graph * Vertex (geometry), a point where two or more curves, line ...

containing that attribute. # Recurse on subsets using the remaining attributes.

Pseudocode

ID3 (Examples, Target_Attribute, Attributes) Create a root node for the tree If all examples are positive, Return the single-node tree Root, with label = +. If all examples are negative, Return the single-node tree Root, with label = -. If number of predicting attributes is empty, then Return the single node tree Root, with label = most common value of the target attribute in the examples. Otherwise Begin A ← The Attribute that best classifies examples. Decision Tree attribute for Root = A. For each possible value, , of A, Add a new tree branch below Root, corresponding to the test A = . Let Examples() be the subset of examples that have the value for A If Examples() is empty Then below this new branch add a leaf node with label = most common target value in the examples Else below this new branch add the subtree ID3 (Examples(), Target_Attribute, Attributes – ) End Return Root

Properties

ID3 does not guarantee an optimal solution. It can converge upon

local optima In applied mathematics and computer science, a local optimum of an optimization problem is a solution that is optimal (either maximal or minimal) within a neighboring set of candidate solutions. This is in contrast to a global optimum, which i ...

. It uses a greedy strategy by selecting the locally best attribute to split the dataset on each iteration. The algorithm's optimality can be improved by using

backtracking Backtracking is a class of algorithms for finding solutions to some computational problems, notably constraint satisfaction problems, that incrementally builds candidates to the solutions, and abandons a candidate ("backtracks") as soon as it de ...

during the search for the optimal decision tree at the cost of possibly taking longer. ID3 can

overfit mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...

the training data. To avoid overfitting, smaller decision trees should be preferred over larger ones. This algorithm usually produces small trees, but it does not always produce the smallest possible decision tree. ID3 is harder to use on continuous data than on factored data (factored data has a discrete number of possible values, thus reducing the possible branch points). If the values of any given attribute are

continuous Continuity or continuous may refer to: Mathematics * Continuity (mathematics), the opposing concept to discreteness; common examples include ** Continuous probability distribution or random variable in probability and statistics ** Continuous g ...

, then there are many more places to split the data on this attribute, and searching for the best value to split by can be time-consuming.

Usage

The ID3 algorithm is used by training on a

data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the d ...

S

to produce a

which is stored in

memory Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembered ...

. At runtime, this decision tree is used to

classify Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...

new test cases (

feature vector In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern ...

s) by traversing the decision tree using the features of the datum to arrive at a leaf node.

The ID3 metrics

Entropy

Entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodyna ...

\Eta

is a measure of the amount of uncertainty in the (data) set

S

(i.e. entropy characterizes the (data) set

S

). :

\Eta = \sum_

Where, *

S

– The current dataset for which entropy is being calculated **This changes at each step of the ID3 algorithm, either to a subset of the previous set in the case of splitting on an attribute or to a "sibling" partition of the parent in case the recursion terminated previously. *

X

– The set of classes in

S

p(x)

– The proportion of the number of elements in class

x

to the number of elements in set

S

When

\Eta = 0

, the set

S

is perfectly classified (i.e. all elements in

S

are of the same class). In ID3, entropy is calculated for each remaining attribute. The attribute with the smallest entropy is used to split the set

S

on this iteration. Entropy in

information theory Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. ...

measures how much information is expected to be gained upon

measuring Measurement is the quantification of attributes of an object or event, which can be used to compare with other objects or events. In other words, measurement is a process of determining how large or small a physical quantity is as compared ...

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the p ...

; as such, it can also be used to quantify the amount to which the

distribution Distribution may refer to: Mathematics *Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations *Probability distribution, the probability of a particular value or value range of a varia ...

of the quantity's values is unknown. A constant quantity has zero entropy, as its distribution is perfectly known. In contrast, a uniformly distributed random variable ( discretely or continuously uniform) maximizes entropy. Therefore, the greater the entropy at a node, the less information is known about the classification of data at this stage of the tree; and therefore, the greater the potential to improve the classification here. As such, ID3 is a greedy

heuristic A heuristic (; ), or heuristic technique, is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediat ...

performing a best-first search for

locally optimal In applied mathematics and computer science, a local optimum of an optimization problem is a solution that is optimal (either maximal or minimal) within a neighboring set of candidate solutions. This is in contrast to a global optimum, which ...

entropy values. Its accuracy can be improved by preprocessing the data.

Information gain

Information gain Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, ...

IG(A)

is the measure of the difference in entropy from before to after the set

S

is split on an attribute

A

. In other words, how much uncertainty in

S

was reduced after splitting set

S

on attribute

A

. :

IG(S, A) = \Eta - \sum_ p(t)\Eta
    = \Eta - \Eta.

Where, *

\Eta(S)

– Entropy of set

S

T

– The subsets created from splitting set

S

by attribute

A

such that

S = \bigcup_ t

p(t)

– The proportion of the number of elements in

t

to the number of elements in set

S

\Eta(t)

– Entropy of subset

t

In ID3, information gain can be calculated (instead of entropy) for each remaining attribute. The attribute with the largest information gain is used to split the set

S

on this iteration.

References

External links

* Seminars �
http://www2.cs.uregina.ca/
* Description and examples �

* Description and examples �
Decision Trees and Political Party Classification
Decision trees Classification algorithms Articles with example pseudocode