Decision Tree Induction and Entropy in data mining

Decision Tree Induction

A decision tree is a tree-like structure and consists of following parts(discussed in Figure 1);

  1. Root node:
    • age is the root node
  2. Branches:
    • Following are the branches;
      • <20
      • 21…50
      • >50
      • USA
      • PK
      • High
      • Low
  3. Leaf node:
    • Following are the leaf nodes;
      • Yes
      • No

decision tree induction examplesdata mining

 

Entropy:

Entropy is a method to measure uncertainty.

  • Entropy can be measured between 0 and 1.
  • High entropy represents that data have more variance with each other.
  • Low entropy represents that data have less variance with each other.

P = Total yes = 9

N = Total no = 5

Note that to calculate the  logof a number, we can do the following procedure.

For example;

what is  log of 0.642?

Ans: log (0.642) / log (2)

=9/14 * log2(9/14)  –  5/14 * log2 (5/14)

=-9/14 * log2(0.642)  –  5/14 * log2 (0.357)

=-9/14 * (0.639)  –  5/14 * (-1.485)

=0.941

[quads id=2]

For Age:

agePiNiInfo(Pi, Ni)
<20 2 YES 3 NO 0.970
21…50 4 YES0 NO 0
>50 3 YES 2 NO  0.970

 

Note: if yes =2 and No=3 then entropy is 0.970 and it is same  0.970 if yes=3 and No=2

So here when we calculate the entropy for age<20, then there is no need to calculate the entropy for age >50 because the total number of Yes and No is same.

 

The gain of Age0.2480.248 is a greater value than income, Credit Rating, and Region. So Age will be considered as the root node.
Gain of Income0.029 
Gain of Credit Rating0.048 
Gain of  Region0.151 

[quads id=3]


decision tree .pdf

Note that

  • if yes and no are in the following sequence like (0, any number) or (any number, 0) then entropy is always 0.
  • If yes and no are occurring in such a sequence (3,5) and (5, 3) then both have the same entropy.
  • Entropy calculates the impurity or uncertainty of data.
  • If the coin is fair (1/2, head and tail have equal probability, represent maximum uncertainty because it is difficult to guess that head occurs or tails occur) and suppose coin has the head on both sides then the probability is 1/1, and uncertainty or entropy is less.
  • if p is equal to q then more uncertainty
  • if p is not equal to q then less uncertainty

Now again calculate entropy for;

  1. Income
  2. Region
  3. Credit

For Income:

IncomePiNiInfo(Pi, Ni)
High0 YES2 NO0
Medium 1 YES1 NO1
Low 1 YES0 NO0

For Region:

RegionPiNiInfo(Pi, Ni)
USA0 YES3 NO0
PK2 YES0 NO0

For Credit Rating:

Credit RatingPiNiInfo(Pi, Ni)
Low1 YES2 NO0
High1 YES1 NO0

 

[quads id=4]

The gain of Region0.9700.970 is a greater value than income, Credit Rating, and Region. So Age will be considered as the root node.
Gain of Credit Rating0.02 
Gain of Income0.57 

Similarly, you can calculate for all.

Next Similar Tutorials

  1. Decision tree induction on categorical attributes  – Click Here
  2. Decision Tree Induction and Entropy in data mining – Click Here
  3. Overfitting of decision tree and tree pruning – Click Here
  4. Attribute selection Measures – Click Here
  5. Computing Information-Gain for Continuous-Valued Attributes in data mining – Click Here
  6. Gini index for binary variables – Click Here
  7. Bagging and Bootstrap in Data Mining, Machine Learning – Click Here
  8. Evaluation of a classifier by confusion matrix in data mining – Click Here
  9. Holdout method for evaluating a classifier in data mining – Click Here
  10. RainForest Algorithm / Framework – Click Here
  11. Boosting in data mining – Click Here
  12. Naive Bayes Classifier  – Click Here

 

Subscribe for Friendship

Latest posts by Prof. Fazal Rehman Shamil (see all)

Buy advertisement space on T4Tutorials

For more details email [email protected]