What is the Confusion Matrix in Machine Learning?
In any machine learning application, we always want to know how good or bad our model is. After all, evaluating the model is as important as building one. There are multiple ways to gauge a model’s performance, but the confusion matrix is a must-have when it comes to a classification problem. It is a way to summarize the results of a classification problem in a tabular format. It helps us understand how our model performed, where it went wrong, and offers guidance to correct its mistakes.
But wait for a second! Can’t we just calculate how accurate our predictions are to know our model’s performance? I mean, what is the need for a whole matrix when only one number can evaluate my model? Its confusing, right?
“Let’s make Confusion Matrix less confusing!”
The answer is, sometimes, we have different business objectives apart from the accuracy that we want to optimize. Learning and understanding a whole umbrella of model-evaluation tricks can come handy in those scenarios. But what exactly is the motivation behind such a technique?
Why an umbrella of evaluation methods?
In classification problems, accuracy provides the ratio of correct predictions to the total predictions made by the model.
In some cases, accuracy cannot be a good evaluator of our model. Let’s take an example of email spam detection. Assume we build a model where we have 95% of the emails as regular ones, and only 5% of them are spam. Now, suppose our model predicts every email as a regular one. In fact, it is hard even to call it a “model” as it can make predictions without any calculations. However, it will still be correct 95% of the time. In other words, even a non-significant model can provide accuracy as high as 95% in this case.

In other cases, accuracy should not be considered an evaluator of our model. For example, suppose the case of cancer detection where we want to predict if a given patient has cancer or not. In this case, we cannot afford to misclassify any patient having cancer as someone who doesn’t have it. Accuracy alone cannot handle such scenarios. A confusion matrix comes in handy while understanding and handling such complicated use cases.
Understanding a confusion matrix
Confusion Matrix is a tool to understand and evaluate how a model performed in the case of a classification problem. These problems can have multiple target responses, e.g.,
- Spam or Not Spam in the case of Email Spam classification (2 target response)
- Yes, No or maybe in case of Rainfall prediction in a day (3 target response)
Based on the number of such responses, a confusion matrix can be a 2*2 matrix (for two target response), a 3*3 matrix (for three target response), or likewise for any higher number of responses.
Let’s try to understand it for a binary (binary means two) response case of an email spam classification. In binary problems, we generally represent our primary target (favorable) as positive and others (unfavorable) as negative. In this case, our primary aim is to identify Spam emails. Hence, we mark it as (+ve) and Not Spam emails as (-ve). In this case, a confusion matrix represents four different combinations of predicted and actual values. As an example, assume that a total of 300 emails were used to evaluate a model. These emails were hand-labeled as either Spam or Not Spam.

Please notice how each cell of the matrix quantifies how our model performed for the data that we have. A general confusion matrix for a binary classification looks something like this.

Just remember that we write correctness of the predictions first (True or False) and our predictions (Positive or Negative) later while naming each cell.
Things to note here:
- Total predictions made = TP + FP + FN + TN
- Total correct predictions = TP + TN (True cells)
- Total incorrect predictions = FP + FN (False cells)
“False Positive is known as type I error, and False Negative is known as type II error.”
Now, we will see how we can use this matrix to evaluate our model.
Evaluating a model using Confusion Matrix
In the above example, a total of 300 emails were used to evaluate the model. Let us now see what metrics are generally used to evaluate our models using this matrix. We will later do some elementary calculations to understand this better.
Accuracy:
It says about how much we predicted correctly. It is the ratio of correct predictions to the total predictions made.
Error Rate:
It signifies the proportion of erroneous predictions a model makes. It is the ratio of wrong predictions to the total predictions made.
Often, we need to understand metrics that define the error in predictions of our primary target class only. Precision, Recall and F1 score do precisely the same.
Precision:
It signifies how much of total primary target predictions are correct predictions. It is the ratio of the true positive predictions to the total primary target predictions made.
Recall:
It signifies how much of the actual primary target samples were predicted as a primary target. It is the ratio of the true (+ve) predictions to the total (+ve) samples in the data.
F1 score:
In most real-life cases, we get high accuracy while having low Precision and Recall. In these cases, F1 is a go-to metric. It is the weighted average of Precision and Recall.
Specificity and Sensitivity:
Sensitivity is the same as Recall, which is defined only with respect to the positive responses.
Specificity, on the other hand, does the same for the negative responses. It measures the proportion of negative responses that are correctly predicted as negative.
Diagrammatically, Sensitivity and Specificity can be represented as in the below figure.

Let us now do some calculations on the email spam case as described above.
- Accuracy = (30+250)/300 = 0.933
- Error Rate = (12+8)/300 = 0.066
- Precision = 30/(30+12) = 0.714
- Recall (Sensitivity) = 30/(30+8) = 0.789
- F1 score = 2*0.714*0.789/(0.714+0.789) = 0.749
- Specificity = 250/(250+8) = 0.969
From the above calculations, we see that the F1 score provides a balance between Precision and Recall. It also gives a good estimate of the model performance for our target response (detecting Spam emails) despite the model’s accuracy being ~93%. High specificity and low recall mean that our model is doing just fine with detecting emails that are not spam but is unable to detect spam emails.
Conclusion:
Dealing with a machine learning problem can be a tedious task sometimes. It is equally important for us to evaluate those models using a proper evaluation metric. This choice does not come naturally in many cases. Hence, knowing a confusion matrix can be useful in identifying and targeting the specific problem in hand.