Decision Tree Algorithm using Excel with GINI Index

Learn Decision Tree Algorithm in Excel

Learn Decision Tree Algorithm using Excel- Beginner Guide

A beginner guide to learn a decision tree using Excel.

Quick facts about decision tree

  • Decision trees try to construct small, consistent hypothesis
  • Decision trees are very bad for some functions:
    – Parity function
    – Majority function
  • Aim for:
    – Small decision trees
    – Robustness to misclassification
  • Constructing the shortest decision tree is intractable
    – Standard approaches are greedy
    – Classical approach is to split tree using an information-theoretic criterion
  • For errorless data, you can always construct a decision tree that correctly labels every element of the training set, but it may be exponential in size

Watch a video on decision tree

What is a Decision Tree?

As the name implies, in simple terms a decision tree helps us to make a decision about a data item. For instance, if you are a banker then you can decide whether to give a loan to a person or not on the basis of his occupation, age, and education level by using a decision tree. In any decision tree, we start at the root node and answer a particular question at each node, and take the branch that corresponds to our answer. In this way, we traverse from the root node to a leaf and form conclusions about our data item. Let us see an example of a decision tree below.

Visual Representation of Learn Decision Tree Algorithm
Fig: A tree showing the survival of passengers on the Titanic (“SIBSP” is the number of spouses or siblings aboard). (Source: Wikipedia).

This tree is used to predict whether a person aboard the ship survived or not on the basis of some characteristics such as age, sex and the number of spouses or siblings aboard. As you can see from the figure decision trees are very easy to understand and very intuitive.
Decision trees are used for prediction in statistics, data mining and machine learning. They are an integral algorithm in predictive machine learning. The classical decision tree algorithms have been around for decades and modern variations like random forest are among the most powerful classification techniques available.
We will discuss the simple decision tree algorithm known as CART which stands for Classification And Regression Trees.

Learn about CART in Decision Tree

CART is just a fancy term for Classification and Regression Trees, which was introduced by Leo Breiman to refer to the decision trees used for classification and regression. Decision tree models where the target variable can take a discrete set of values are called Classification Trees and decision trees where the target variable can take continuous values are known as Regression Trees.

The representation for the CART model is a binary tree. Binary means that at each node there are two branches.  At each node, we make a decision about an input variable for the data item. And then we take the corresponding branch. Take for instance a purely fictitious (not to mention naive) classification tree on the basis of age and employment status.

Decision Tree for Loan Approval
Fig: A decision tree for deciding whether to give a loan or not.

This tree can be stored as a set of rules:

  • If (age<20) then do not give a loan.
  • If (age>20) AND employed then give a loan.
  • If (age>20) AND not employed then do not give a loan.

Making predictions from a decision tree:

Making prediction from a decision tree is so simple that a child can follow the steps. Given a new data item, the tree is traversed by evaluating the specific input started at the root node of the tree (which in our example is age). Then we take corresponding branches and again evaluate a specific input and so on until we reach a leaf. For example, given the input of age = 25 and employment =  no,

We would traverse the above tree as follows:

Age>20?Yes
Employed?No
Therefore No loan

We have looked at how to make predictions from decision trees. Now let’s look at how the decision trees are constructed (or learned) in the first place.

How do we learn the decision tree from the data?

A binary decision actually a partitions the input space into two at each node.  So, our aim in constructing binary tree to decide which input variable to place at a particular node. One approach is using a cost function to decide it. Generally, the variable with the lowest cost is selected. There are many kinds of cost functions suitable for different purposes such as Gini impurity, Information gain, variance reduction, etc.

At each node, the input set of data is partitioned into two halves. For example: In the case considered above the node is the age less than 20 separates the datasets into two halves, i.e, one set with age less than 20 and another set with age greater than 20. How do we go about selecting a variable in a particular node? Well, as we said earlier the variable which corresponds to the lowest value of a cost function is selected. The concept will be much clearer when we look at a simple example below:

Let us consider a dataset (again purely hypothetical ) as shown below:

Gini Impurity in Decision Tree

The value of Y ( 1 or 0) is the target value to be predicted on the basis of the values of X1 and X2.
At each node the datasets split into two groups.: right and left. To determine the splitting variable let as use a simple cost function called Gini index.

Gini indexes widely used in a CART and other decision tree algorithms. It gives the probability of incorrectly labeling a randomly chosen element from the dataset if we label it according to the distribution of labels in the subset. It sounds a little complicated so let’s see what it means for the previous example. Suppose we have a data set with age, employment status and the loan status for each item. Gini impurity for age gives the probability that we would be wrong if we predict the loan status for each item in the dataset based on age only.

Mathematically Gini indexes are calculated as

Gini Split Index Formula in Decision Tree

For each class(i), for each group ( left or right).

In simple terms, for our example:
Gini(split) = (left(0)*(1-left(0)) + (right(0)*(1-right(0)) + (left(1)*(1-left(1))) + (right(1)*(1-right(1))

Here left(0) is the proportion of data instances in the left group with class 0, and so on.

For example: If we take the first split point( or node) to be X1<7 then, 4 data will be on the left of the splitting node and 6 will be on the right.
Left(0) = 4/4=1, as four of the data with classification value 0 are less than 7.
Right(0) = 1/6.
Left(1) = 0
Right(1) =5/6.

Using the above formula we can calculate the Gini index for the split.

Gini(X1=7) = 0 + 5/6*1/6 + 0 + 1/6*5/6 = 5/12.

We can similarly evaluate the Gini index for each split candidate with the values of X1 and X2 and choose the one with the lowest Gini index. In this case, if we look at the graph then we see that we can draw a vertical line at X1=8. Try yourself for this value and find the Gini index.

You will find the index to be zero. This is an ideal split which means that the class is perfectly separated by this. (Both of the calculations are done in detail in excel. You can check your calculations there).

Let's predict using Decision Tree Algorithm:

According to our decision tree:

  • If X1<8 then Y=0
  • if X1>8 then Y=1

So if we have a data item with X1 =15, X2 =10 then Y=1

Pros and Cons of Decision Trees:

?

?

Because decision trees are very graphic and simple they are extremely easy to visualize and understand. By providing with an intuitive picture they help us to understand non-linear relationships in the data.Decision trees that are grown very deep often overfit the training data so they show high variation even on a small change in input data. It is not a desirable property as we would want our model to be sensitive to noise in the data.
Decision trees do not require any kind of assumptions in the dataset.They are sensitive to the specific data on which they are trained so they are error prone to test data sets.
In most of the cases, the prediction process of decision trees is also extremely fast as only Boolean comparison processes are required.If the tree is very deep and complex then it becomes hard to visualize.
In addition, though they are easy to understand the preparation process requires a high level of mathematical and statistical knowledge.