C4.5 is an algorithm used to generate a

decision tree A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...

developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier

ID3 algorithm In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross QuinlanQuinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81–106 used to generate a decision tree from a dataset. ID3 is the ...

. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier. In 2011, authors of the

Weka The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. It is the only extant member of the genus ''Gallirallus''. Four subspecies are recognize ...

machine learning software described the C4.5 algorithm as "a landmark decision tree program that is probably the machine learning workhorse most widely used in practice to date". It became quite popular after ranking #1 in the ''Top 10 Algorithms in Data Mining'' pre-eminent paper published by Springer

LNCS ''Lecture Notes in Computer Science'' is a series of computer science books published by Springer Science+Business Media since 1973. Overview The series contains proceedings, post-proceedings, monographs, and Festschrifts. In addition, tutorials, ...

in 2008.

Algorithm

C4.5 builds decision trees from a set of training data in the same way as

ID3 ID3 is a metadata container most often used in conjunction with the MP3 audio file format. It allows information such as the title, artist, album, track number, and other information about the file to be stored in the file itself. There are two ...

, using the concept of information entropy. The training data is a set

S =

of already classified samples. Each sample

s_i

consists of a p-dimensional vector

(x_, x_, ...,x_)

, where the

x_j

represent attribute values or

features Feature may refer to: Computing * Feature (CAD), could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (software design) is an intentional distinguishing characteristic of a software item ...

of the sample, as well as the class in which

s_i

falls. At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized

information gain Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, ...

(difference in

entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...

). The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurses on the partitioned sublists. This algorithm has a few base cases. *All the samples in the list belong to the same class. When this happens, it simply creates a leaf node for the decision tree saying to choose that class. *None of the features provide any information gain. In this case, C4.5 creates a decision node higher up the tree using the expected value of the class. *Instance of previously unseen class encountered. Again, C4.5 creates a decision node higher up the tree using the expected value.

Pseudocode

pseudocode In computer science, pseudocode is a plain language description of the steps in an algorithm or another system. Pseudocode often uses structural conventions of a normal programming language, but is intended for human reading rather than machine re ...

, the general algorithm for building decision trees is: #Check for the above base cases. #For each attribute ''a'', find the normalized information gain ratio from splitting on ''a''. #Let ''a_best'' be the attribute with the highest normalized information gain. #Create a decision ''node'' that splits on ''a_best''. #Recurse on the sublists obtained by splitting on ''a_best'', and add those nodes as children of ''node''.

Implementations

J48 is an

open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...

Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...

implementation of the C4.5 algorithm in the

data mining tool.

Improvements from ID3 algorithm

C4.5 made a number of improvements to ID3. Some of these are: * Handling both continuous and discrete attributes - In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it. * Handling training data with missing attribute values - C4.5 allows attribute values to be marked as ? for missing. Missing attribute values are simply not used in gain and entropy calculations. * Handling attributes with differing costs. * Pruning trees after creation - C4.5 goes back through the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes.

Improvements in C5.0/See5 algorithm

Quinlan went on to create C5.0 and See5 (C5.0 for Unix/Linux, See5 for Windows) which he markets commercially. C5.0 offers a number of improvements on C4.5. Some of these are:M. Kuhn and K. Johnson, Applied Predictive Modeling, Springer 2013 * Speed - C5.0 is significantly faster than C4.5 (several orders of magnitude) * Memory usage - C5.0 is more memory efficient than C4.5 * Smaller decision trees - C5.0 gets similar results to C4.5 with considerably smaller decision trees. * Support for boosting - Boosting improves the trees and gives them more accuracy. * Weighting - C5.0 allows you to weight different cases and misclassification types. * Winnowing - a C5.0 option automatically

winnow Winnowing is a process by which chaff is separated from grain. It can also be used to remove pests from stored grain. Winnowing usually follows threshing in grain preparation. In its simplest form, it involves throwing the mixture into the ...

s the attributes to remove those that may be unhelpful. Source for a single-threaded Linux version of C5.0 is available under the

GNU General Public License The GNU General Public License (GNU GPL or simply GPL) is a series of widely used free software licenses that guarantee end users the Four Freedoms (Free software), four freedoms to run, study, share, and modify the software. The license was th ...

(GPL).

References

External links

* Original implementation on Ross Quinlan's homepage:
http://www.rulequest.com/Personal/

{{DEFAULTSORT:C4.5 Algorithm Classification algorithms Decision trees Articles with example pseudocode