An Intuitive Introduction to Decision Trees
Decision Trees are one of the most powerful yet easy to understand machine learning algorithm. It lets the practitioner ask a series of questions helping her decide to choose between multiple alternatives at hand. Decision trees are ubiquitous in day-to-day life. We use them daily knowingly or unknowingly. Its algorithm assumes that the data follows a set of rules. These rules are identified by using various mathematical techniques. Decision Trees find their application in both the Classification (This or That) and the Regression (How much of This?) settings.
In this article, I will just introduce a basic decision tree, its intuition, its various elements, and techniques of building a tree. For starters, it must be noted that a decision tree is similar to a flowchart. We come across these charts almost every day in offices but with a decision at the end of it.
“Decision Trees are everywhere.”
An intuition into Decision Trees
Suppose you are out to buy a new laptop for yourself.
After reaching a shop, you are confused about which one to buy among so many options. So, you asked the shopkeeper to help you decide. The shopkeeper then asks you a series of questions. These questions help you decide which laptop to buy.
Q1: “How much storage space are you looking for?”
“Umm… around 1 TB space, preferably with an HDD.”
Q2: “Perfect, and how about the RAM?”
“Definitely 8 GB or more.”
Q3: “Alright, and any preferences on the Graphics Card?”
“I want at least 2 GB GPU.”
Q4: “Alright, and any preferences on the Processor?”
“I want an octa-core processor with at least 2.2 GHz speed.”
“Sure! I’ve got the perfect laptop for you.” And he hands over a laptop to you.
What the shopkeeper just did was to help you walk through alternatives to narrow down your choices. Pictorially, we can represent this process as:
This figure corresponds to a decision-making process. This structure is called a Decision Tree.
These decision trees can be built for almost any decision-making in day-to-day life. For example, imagine you want to pick between an umbrella, raincoat, or nothing while going out on a rainy day. This process can be described as:
Now suppose we say, “if it is windy outside, I’ll use a raincoat; otherwise, I’ll use an umbrella.” This statement adds a little detail to our tree in this way:
In the above section, we dealt with what a decision tree is and how we can build one with a simple intuitive process. Let us now discuss the different elements of a decision tree with our previous case.
In this Decision Tree diagram, we have:
- Node: This is where we either ask a question or make a decision. These are on the ends of pointy arrows. For our case, Storage space, RAM, GPU, Processor, Buy and Don’t buy all of these are individual nodes.
- Root Node: This is the place where the first separation takes place. In our case, the question about the Storage space of the laptops forms our root node.
- Splitting: It is the process of dividing any node into two or more nodes. In our case, every question resulted in two splits, one that asks more questions and the other decides that we don’t want to buy.
- Decision Node: If, after any split, a resulting node asks another question, then the resulting node is called the decision node. Here we have, RAM, GPU, and processor node as decision nodes.
- Leaf: If, after any split, a resulting node outputs a decision (categorical or continuous value), it is called the leaf node. This node doesn’t ask further questions. In this example, Don’t buy and Buy nodes are leaf nodes.
Asking the Right Questions
So far, we have seen how we can build a simple decision tree and what its different elements are. But while creating a decision tree, it is crucial to ask the right questions at the correct stage in a tree. That is what essentially building a decision tree or decision tree learning means. Asking irrelevant questions can lead to complications in our problem.
For example: Imagine during the laptop purchase case, the shopkeeper asked if you wanted a laptop with or without a Graphics Card. Would that lead to a narrowing down of your options?
Mathematically, we have two commonly used techniques to determine what will be the best question to ask at any stage:
- Gini Index
These techniques help us decide what, when & where to start and stop asking questions. These techniques are popularly called the splitting criteria. I’ll give a brief description of these concepts in the subsequent sections.
Entropy and Information Gain
Imagine the shopkeeper did not keep an ordered shelf, in our laptop purchase case. In other words, what if the shop was unordered with all quality of laptops on a single display. Would that make your decision making easier?
“Entropy is an indicator of how messy your data is.”
For a dataset, messiness corresponds to a mixture of available options (target variable).
In the decision tree learning, our goal is to separate this mixture. Entropy lets us decide between the right questions to ask to separate the desired outcome from all available options.
Higher the entropy of a dataset, the higher the degree of mixing, while lower entropy corresponds to a well-separated data.
Once we have asked the right questions, we have narrowed down our options and know what not to choose from. Instead, we have more information about where to find the answer. For example, knowing that we needed a laptop with HDD >= 1 TB made sure that this is the bare minimum. We don’t desire laptops that do not possess this quality.
This phenomenon of finding a desirable direction for our exploration is called as Information Gain. Entropy helps in calculating this gain numerically. We will skip these numerical parts in the article.
Similar to entropy, the Gini Index also helps us decide the right set of questions to ask. But instead of measuring the messiness of a dataset, it measures its impurity.
“Gini Index is the measure of how impure your data is”
For a dataset, impurity corresponds to a mixture of decisions (target variable). If the dataset after a particular splitting remains mixed with all available options to choose from, we have impurity in the data. If we have reached a decision, it implies that we have data that is pure in terms of the options that we have.
While building a decision tree, we try to find out the series of questions that lead to the maximum decrease in the impurity of the dataset.
Higher Gini Index corresponds to a mixture (impure) while lower corresponds to separated data.
We can use either the Gini Index or the Entropy to build a decision tree.
Both these concepts can be quite confusing to grasp at first. So, let us do a simple exercise to understand these better.
Suppose you have a bag of blue and yellow balls. Your target is to separate these two in different packs. Given below is a diagram of 3 such bags with a different combination of yellow and blue balls. Find out which one has:
- Maximum Entropy
- Minimum Entropy
- Maximum Gini impurity
- Minimum Gini impurity
Advantages and Disadvantages of using a Decision Tree:
- Decision trees can be used both in Regression and in Classification settings.
- No parameter required to know before the model building
- Interpretability of decision trees is easy
- Fast to build a decision tree
- No scaling of features is needed to fit a decision tree
- Very high chances of Overfitting if we keep on splitting
- Decision trees are optimized at every split, which can sometimes lead to a wrong result.
- Low usages for imbalanced dataset.
Answer to above questions: A, C, A, C respectively