C4.5 Algorithm in Data Mining

By: Prof. Fazal Rehman Shamil
Last modified on July 12th, 2020

The C4.5 algorithm is a famous algorithm in Data Mining. The C4.5 algorithm acts as a Decision Tree Classifier. C4.5 is a data mining algorithm and it is used to generate a decision tree. The C4.5 algorithm is very helpful to generate a useful decision, that is based on a sample of data.

C4.5 is given a set of data representing things that are already classified.

When we generate the decision trees with the help of C4.5 algorithm, then it can be used for classification of the dataset, and that is the main reason due to which C4.5 is also known as a statistical classifier.

So, before starting the C4.5 algorithm, you must revise the Decision Trees and how Decision Trees can be used as classifiers in data mining.

Example of Decision Trees

What is classifier in data mining? 

A classifier is a piece of code in data mining that takes the data for classification and tries it’s best to predict that the new data belongs to which class.

Example of classifier in Data Mining

Suppose a dataset contains data of the patients. Let’s think that we know various things about each patient like age, heartbeat rate, blood pressure, and another family history, etc. Here, age, heartbeat rate, blood pressure, and other family history are called attributes.
Now, With the help of these attributes, we would like to predict whether the patient will be a victim of Hepatitis or not. In this case, the patient can be fall under 1 of the following two classes;

Class 1: The patient will be a victim of Hepatitis.

Class 2: The patient will not be a victim of Hepatitis.

The C4.5 algorithm can help us to predict the class for every patient.

Pseudocode of C4.5 algorithm

Let’s see the Pseudocode of the C4.5 algorithm in data mining.

  1. First, notice the base
  2. For each attribute X, find the normalized information gain ratio by splitting between X.
  3. Suppose that X is an attribute with the highest normalized information gain.
  4. Create a decision node that splits on attribute X.
  5. Repeat it on the sublists obtained by splitting the attribute X, and add these nodes as children of the node.

Advantages of C4.5 over other Decision Tree systems

  1. The algorithm is very helpful in Mitigating the overfitting because C4.5 inherently employs the Single Pass Pruning Process.
  2. C4.5 can work with Discrete data and can also work with Continuous Data
  3. C4.5  is very helpful in solving the issues of data incompleteness.

Further, it is important to know that C4.5 is not the best algorithm in all cases, but it is very useful in some situations.

Implementations of the C4.5 algorithm in Data Mining

J48 is an open-source Java implementation of the C4.5 algorithm. J48 is available in the Weka. As you know, Weka is a famous data mining tool.

Comparison of  C4.5   VS C5.0

C4.5  data mining algorithm was developed by Ross Quinlan.  C4.5 generates Decision Trees (DT), which can be used for classification of the dataset. C4.5 extends the ID3 algorithm because of C4.5  deals with both continuous and discrete attributes. C4.5  also deals with missing values and pruning trees after construction.

C4.5 is better but C5.0 data mining algorithm is better and faster than C4.5. Further C5.0 is more memory efficient and used for building smaller decision trees.

Prof. Fazal Rehman Shamil
Latest posts by Prof. Fazal Rehman Shamil (see all)