# Decision Tree Induction and Entropy in data mining

Last modified on December 9th, 2018 at 9:17 pm

# Decision Tree Induction

Decision tree is a tree-like structure and consists of following parts(discussed in Figure 1);

**Root node:**- age is the root node

**Branches:**- Following are the branches;
- <20
- 21…50
- >50
- USA
- PK
- High
- Low

- Following are the branches;
**Leaf node:**- Following are the leaf nodes;
- Yes
- No

- Following are the leaf nodes;

**Entropy:**

Entropy is a method to measure the uncertainty.

- Entropy can be measured in between 0 and 1.
- High entropy represents that data have more variance with each other.
- Low entropy represents that data have less variance with each other.

P = Total yes = 9

N = Total no = 5

Note that to calculate the log^{2 }of a number, we can do the following procedure.

For example;

what is log^{2 }^{ }of 0.642?

Ans: log (0.642) / log (2)

=–9/14 * log^{2}(9/14) – 5/14 * log^{2} (5/14)

=-9/14 * log^{2}(0.642) – 5/14 * log^{2} (0.357)

=-9/14 * (0.639) – 5/14 * (-1.485)

=0.941

[quads id=2]**For Age:**

age | P_{i} | N_{i} | Info(P_{i,} N_{i)} |

<20 | 2 YES | 3 NO | 0.970 |

21…50 | 4 YES | 0 NO | 0 |

>50 | 3 YES | 2 NO | 0.970 |

Note: if yes =2 and No=3 then entropy is 0.970 and it is same 0.970 if yes=3 and No=2

So here when we calculate the entropy for age<20, then there is no need to calculate the entropy for age >50 because the total number of Yes and No is same.

The gain of Age | 0.248 | 0.248 is a greater value than income, Credit Rating, and Region. So Age will be considered as the root node. |

Gain of Income | 0.029 | |

Gain of Credit Rating | 0.048 | |

Gain of Region | 0.151 |

Note that

- if yes and no are in the following sequence like (0, any number) or (any number, 0) then entropy is always 0.
- If yes and no are occurring in such a sequence (3,5) and (5, 3) then both have same entropy.
- Entropy calculates impurity or uncertainty of data.
- If the coin is fair (1/2, head and tail have equal probability, represent maximum uncertainty because it is difficult to guess that head occurs or tails occur) and suppose coin has the head on both sides then the probability is 1/1, and uncertainty or entropy is less.
- if p is equal to q then more uncertainty
- if p is not equal to q then less uncertainty

Now again calculate entropy for;

- Income
- Region
- Credit

**For Income:**

Income | P_{i} | N_{i} | Info(P_{i,} N_{i)} |

High | 0 YES | 2 NO | 0 |

Medium | 1 YES | 1 NO | 1 |

Low | 1 YES | 0 NO | 0 |

**For Region:**

Region | P_{i} | N_{i} | Info(P_{i,} N_{i)} |

USA | 0 YES | 3 NO | 0 |

PK | 2 YES | 0 NO | 0 |

**For Credit Rating:**

Credit Rating | P_{i} | N_{i} | Info(P_{i,} N_{i)} |

Low | 1 YES | 2 NO | 0 |

High | 1 YES | 1 NO | 0 |

[quads id=4]

The gain of Region | 0.970 | 0.970 is a greater value than income, Credit Rating, and Region. So Age will be considered as the root node. |

Gain of Credit Rating | 0.02 | |

Gain of Income | 0.57 |

Similarly, you can calculate for all.