K-Nearest Neighbors Algorithm
K-Nearest Neighbors Algorithm is one of the simple, easy-to-implement, and yet effective supervised machine learning algorithms. We can use it in any classification (This or That) or regression (How much of This or That) scenario. It finds intensive applications in many real-life scenarios like pattern recognition, data mining, predicting loan defaults, etc. It predicts responses for new data (testing data) based upon its similarity with other known data (training) samples. It assumes that data with similar traits sit together.
“Birds of a feather flock together.”
Consider the following examples as business applications of kNN:
- Loan Default: Based on other defaulters, finding if a new customer is similar to them. (Defaulter or Not)
- Email spam: We can find the similarity between a new email and previous spam emails. (Spam or Not)
- Credit Rating: KNN can help when calculating an individual’s credit score by comparing it with persons with similar traits.
- Insurance: By finding the similarity between other events, premium charges for an insurance claim can be predicted
Imagine you own a mobile phone company named ABC. You have a different combination of Battery and RAM capacity in each of your phones. Based on these features, your marketing team launches it in either Expensive or Cheap market segments.
You want to launch a new phone with 1744 MB RAM capacity and 819 mAh Battery capacity. You turn to your marketing team to know which market segment to target. Below is the catalog you carried and the conversation that followed.
|RAM Capacity (MB)
You: “We want to launch a new phone in the market with RAM capacity 1744 MB and Battery as 819 mAh.”
Mktg: “Okay. We can price this phone basis its similarities with other similar phones.”
You: “Seems a good strategy.”
Mktg: “What are the most similar phones in RAM configuration?”
You: “Model ABC008 and ABC006 are the most similar ones from the catalog.”
Mktg: “What are their price segments?”
You: “Both belong to the Cheap market segments.”
Mktg: “Umm. Okay! What about the Battery?”
You: “Model ABC006 and ABC005 are the most similar ones. Again, both are Cheap models.”
Mktg: “Well, since the most similar phones to our new phone correspond to Cheap range, we should launch this phone in the Cheap market segment.”
What your marketing team just did was to find you an appropriate price range for your new phone based on overall similarity with other phones.
The above example provides the basic principle of k Nearest Neighbors (kNN) based classification. It lets you decide a response (This or That) for a new data sample (testing data) based upon the gathered evidence (training data).
Why k and Neighbors?
When we plot known data samples on a graph, data with similar traits are usually near to each other. Hence, similar data points are called neighbors of the new data. The number of similar known data samples that we look for making a decision is k. It is an integer value and depends upon the user to decide.
For example, let us look at the graphical representation of the mobile pricing case.
Two things can be noted here:
- We tagged the new phone as Cheap because it was closer to the cheaper ones than the expensive ones.
- We arbitrarily chose the value of k=2 by choosing to find only two similar phones. This method is thus called a 2-Nearest Neighbors Algorithm
Pictorially the process of using a kNN algorithm can be depicted with a simple animation as below.
Advantages of kNN:
- Simple and easy to understand
- No statistical assumptions regarding the data need to be satisfied
- Robust to any irrelevant information (noise)
- Only the choice of k needs to be optimized
Drawbacks of kNN:
- Computationally expensive to calculate the similarity between data samples
- Lazy learner, i.e., it uses all training samples at runtime and hence slow for large datasets