How does a Decision Tree decide where to split?

There are various ways to decide on the metric to choose the variable on which splitting for a node is done. Different algorithms deploy different metrices to decide which variable splits the dataset the the best way.

Source: AnalyticsVidhya

The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria is different for classification and regression trees.

Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that purity of the node increases with respect to the target variable. Decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes.

The algorithm selection is also based on type of target variables.

Example:-

Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl), Class( IX/ X) and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time. We want to create a model to predict who will play cricket during leisure period? In this problem, we need to segregate students who play cricket in their leisure time based on highly significant input variable among all three.

Test Image

This is where decision tree helps, it will segregate the students based on all values of three variable and identify the variable, which creates the best homogeneous sets of students (which are heterogeneous to each other). In the snapshot below, you can see that variable Gender is able to identify best homogeneous sets compared to the other two variables.

As mentioned above, decision tree identifies the most significant variable and it’s value that gives best homogeneous sets of population. Now the question which arises is, how does it identify the variable and the split? To do this, decision tree uses various algorithms, which are discussed below:

Structure of a Decision Tree:

Test Image


Gini Index

Gini index says, if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure.

  • It works with categorical target variable “Success” or “Failure”.
  • It performs only Binary splits
  • Higher the value of Gini higher the homogeneity.
  • CART (Classification and Regression Tree) uses Gini method to create binary splits.

Steps to Calculate Gini for a split

  • Calculate Gini for sub-nodes, using formula sum of square of probability for success and failure ($p^2+q^2$).
  • Calculate Gini for split using weighted Gini score of each node of that split

Example: – Referring to example used above, where we want to segregate the students based on target variable ( playing cricket or not ). In the snapshot below, we split the population using two input variables Gender and Class. Now, I want to identify which split is producing more homogeneous sub-nodes using Gini index.

Test Image

Split on Gender:

  • Calculate, Gini for sub-node Female = (0.2)(0.2)+(0.8)(0.8)=0.68
  • Gini for sub-node Male = (0.65)(0.65)+(0.35)(0.35)=0.55
  • Calculate weighted Gini for Split Gender = (10/30)0.68+(20/30)0.55 = 0.59

Similar for Split on Class:

  • Gini for sub-node Class IX = (0.43)(0.43)+(0.57)(0.57)=0.51
  • Gini for sub-node Class X = (0.56)(0.56)+(0.44)(0.44)=0.51
  • Calculate weighted Gini for Split Class = (14/30)0.51+(16/30)0.51 = 0.51

Above, you can see that Gini score for Split on Gender is higher than Split on Class, hence, the node split will take place on Gender.

Chi-Square

It is an algorithm to find out the statistical significance between the differences between sub-nodes and parent node. We measure it by sum of squares of standardized differences between observed and expected frequencies of target variable.

  • It works with categorical target variable “Success” or “Failure”.
  • It can perform two or more splits.
  • Higher the value of Chi-Square higher the statistical significance of differences between sub-node and Parent node.
  • Chi-Square of each node is calculated using formula,

Chi-square = ((Actual – Expected)^2 / Expected)^1/2

It generates tree called CHAID (Chi-square Automatic Interaction Detector)

Steps to Calculate Chi-square for a split:

  • Calculate Chi-square for individual node by calculating the deviation for Success and Failure both
  • Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of each node of the split

Example: Let’s work with above example that we have used to calculate Gini.

Split on Gender:

  • First we are populating for node Female, Populate the actual value for “Play Cricket” and “Not Play Cricket”, here these are 2 and 8 respectively.
  • Calculate expected value for “Play Cricket” and “Not Play Cricket”, here it would be 5 for both because parent node has probability of 50% and we have applied same probability on Female count(10).
  • Calculate deviations by using formula, Actual – Expected. It is for “Play Cricket” (2 – 5 = -3) and for “Not play cricket” ( 8 – 5 = 3).
  • Calculate Chi-square of node for “Play Cricket” and “Not Play Cricket” using formula with formula, = ((Actual – Expected)^2 / Expected)^1/2. You can refer below table for calculation.
  • Follow similar steps for calculating Chi-square value for Male node.
  • Now add all Chi-square values to calculate Chi-square for split Gender.

Test Image

Split on Class:

Perform similar steps of calculation for split on Class and you will come up with below table.

Test Image

Above, you can see that Chi-square also identify the Gender split is more significant compare to Class.

Information Gain:

Look at the image below and think which node can be described easily. I am sure, your answer is C because it requires less information as all values are similar. On the other hand, B requires more information to describe it and A requires the maximum information. In other words, we can say that C is a Pure node, B is less Impure and A is more impure.

Test Image

Now, we can build a conclusion that less impure node requires less information to describe it. And, more impure node requires more information. Information theory is a measure to define this degree of disorganization in a system known as Entropy. If the sample is completely homogeneous, then the entropy is zero and if the sample is an equally divided (50% – 50%), it has entropy of one.

Entropy can be calculated using formula:

Entropy = $-p log_2p -q log_2q$

Here p and q is probability of success and failure respectively in that node. Entropy is also used with categorical target variable. It chooses the split which has lowest entropy compared to parent node and other splits. The lesser the entropy, the better it is.

Steps to calculate entropy for a split:

  • Calculate entropy of parent node
  • Calculate entropy of each individual node of split and calculate weighted average of all sub-nodes available in split.

Example: Let’s use this method to identify best split for student example.

  • Entropy for parent node = -(15/30) log2 (15/30) – (15/30) log2 (15/30) = 1. Here 1 shows that it is a impure node.
  • Entropy for Female node = -(2/10) log2 (2/10) – (8/10) log2 (8/10) = 0.72 and for male node, -(13/20) log2 (13/20) – (7/20) log2 (7/20) = 0.93
  • Entropy for split Gender = Weighted entropy of sub-nodes = (10/30)*0.72 + (20/30) * 0.93 = 0.86
  • Entropy for Class IX node, -(6/14) log2 (6/14) – (8/14) log2 (8/14) = 0.99 and for Class X node, -(9/16) log2 (9/16) – (7/16) log2 (7/16) = 0.99
  • Entropy for split Class = (14/30) * 0.99 + (16/30) * 0.99 = 0.99

Above, you can see that entropy for Split on Gender is the lowest among all, so the tree will split on Gender. We can derive information gain from entropy as 1- Entropy.

Reduction in Variance

Till now, we have discussed the algorithms for categorical target variable. Reduction in variance is an algorithm used for continuous target variables (regression problems). This algorithm uses the standard formula of variance to choose the best split. The split with lower variance is selected as the criteria to split the population:

Variance = $\frac{\sum (X - \bar X)^2}{n}$

Above $\bar X$ is mean of the values, X is actual and n is number of values.

Steps to calculate Variance:

  • Calculate variance for each node.
  • Calculate variance for each split as weighted average of each node variance.

Example:- Let’s assign numerical value 1 for play cricket and 0 for not playing cricket. Now follow the steps to identify the right split:

  • Variance for Root node, here mean value is (151 + 150)/30 = 0.5 and we have 15 one and 15 zero. Now variance would be ((1-0.5)^2+(1-0.5)^2+….15 times+(0-0.5)^2+(0-0.5)^2+…15 times) / 30, this can be written as (15(1-0.5)^2+15(0-0.5)^2) / 30 = 0.25
  • Mean of Female node = (21+80)/10=0.2 and Variance = (2(1-0.2)^2+8(0-0.2)^2) / 10 = 0.16
  • Mean of Male Node = (131+70)/20=0.65 and Variance = (13(1-0.65)^2+7(0-0.65)^2) / 20 = 0.23
  • Variance for Split Gender = Weighted Variance of Sub-nodes = (10/30)*0.16 + (20/30) * 0.23 = 0.21

  • Mean of Class IX node = (61+80)/14=0.43 and Variance = (6(1-0.43)^2+8(0-0.43)^2) / 14= 0.24
  • Mean of Class X node = (91+70)/16=0.56 and Variance = (9(1-0.56)^2+7(0-0.56)^2) / 16 = 0.25
  • Variance for Split Gender = (14/30)*0.24 + (16/30) * 0.25 = 0.25

Above, you can see that Gender split has lower variance compare to parent node, so the split would take place on Gender variable.

Written on January 5, 2018
]