# Learn Support Vector Machine using Excel – Machine Learning Algorithm

# Learn Support Vector Machine using Excel - Machine Learning Algorithm

Beginner guide to learn the most well known and well-understood algorithm in statistics and machine learning. In this post, you will discover the Support Vector Machine Algorithm, how it works using Excel, application and pros and cons.

## Quick facts about Support Vector Machine Algorithm

- Basic idea of support vector machines: just like 1-layer or multi-layer neural nets
- Optimal hyperplane for linearly separable patterns
- Extend to patterns that are not linearly separable by transformations of original data to map into new space – the Kernel function

- SVM algorithm for pattern recognition
- Support vectors are the data points that lie closest to the decision surface (or hyperplane)
- They are the data points most difficult to classify
- They have direct bearing on the optimum location of the decision surface
- We can show that the optimal hyperplane stems from the function class with the lowest “capacity”= # of independent features/parameters we can twiddle [note this is ‘extra’ material not covered in the lectures… you don’t have to know this]

## Watch a video on Support Vector Machine Algorithm

## What is a Support Vector Machine Algorithm?

Support Vector Machine is a supervised machine learning method which can be used to solve both regression and classification problem. Generally, it is used as a classifier so we will be discussing SVM as a classifier. Unlike other machines it doesn’t have gears, valves, and different electronic parts nevertheless; it does what normal machines do: take input, do some manipulation to the input and provide output. To be exact, for a given labeled training data SVM outputs an optimal hyperplane which categorizes new examples.

## What’s in the name?

Fig 1: Possible planes that separates two classes

Fig 2: Optimal plane that separates two classes

When *Vladimir N. Vapnik* and *Alexey Ya. Chervonenkis* invented SVM, they didn’t baptize it consulting an astrologer. Neither did the name come out of the blue. They derived it from their own algorithm.

The main philosophy behind SVM is finding optimal plane by **maximizing the margin**. A margin is a distance between nearest data points of two different classes. Moreover, in SVM, each datum is represented as a vector and the data which are nearest data points of different class are called **support vectors** (In Fig 2, two rectangles and a circle which have a color filled in them, are support vectors). They are called so as they virtually support the optimal plane.

To elaborate, let’s observe two graphs in **Fig 1** and **Fig 2**. In Fig 1, we see that there are many possible planes that can separate our data into two classes. But these planes are not optimum and doesn’t work well for all examples. So we find the optimal plane by maximizing margin. The margin is maximized when its length is equal to a distance between the supports vectors. And when we get our margin maximized, we also get a plane that optimally divides the classes into two groups. In other words, those vectors are like a pillar which supports the optimal plane or say the optimal plane stands on their support. The optimal plane isn’t affected by any other vectors except those support vectors. To conclude, the whole approach is named after these vectors and it seems quite fair for they play a quite influential role.

## Dismantling Support Vector Machine Algorithm

To understand the working mechanism of SVM we are going to disassemble it and see how different parts are working. We will use Fig 1 as our guiding diagram. w.x + b = 0 in the figure is an equation of a straight line where ‘w’ is the slope of the line (for higher dimension equation of plane as written in the figure). The other two lines pass through the support vectors and support the optimal plane.

w.x + b = -1

w.x+b= +1

Or, equivalently:

w.x+(b+1)=0

w.x+(b-1)=0

Our mission in SVM is to maximize the distance between these two lines (plane). Before doing that, we need to find a distance between them and for that, we will use above mentioned equations. We find a distance to be:

here b_1=b+1 and b_2=b-1

From the equation, we see that the term |b_1-b_2| is constant and the distance depends only on||w||. We also see that when ||w|| is minimum; D has a maximum value (inversely proportional!). So we will minimize it using Lagrange multiplier.

Besides that, we also need to impose some constraint so that new instances are correctly predicted. In our case:

w.x+b ≤ -1 if y= -1

w.x+b ≥ +1 if y= +1

Or, equivalently:

y(w.x+b )≥ 1

This restriction shows where the new example falls on the graph. Whether it falls on top of the plane or below determines its class. To sum up:

We need to minimize ||w||^{2} ,subject to y(w.x+b )≥ 1, then

For a given new instance of x,the classifier is f(x)=sign(w.x+b)

## Putting everything together

Suppose we have two classes of fruits: ‘**Apple** ’ and ‘**Orange**’. Now what we do is take numerous examples of those classes and use them to train our SVM algorithm. When we finish training SVM, we get an optimal plane that separates the examples into exact two classes. Now we can use the model to predict the classes of any unknown examples. We start with a fruit ‘F’. We know it is either **Apple** or **Orange** but don’t know exactly what it is.

To find the class of our example **F**, we feed it to the equation f(x)=sign(w.x+b).Here **x** in the equation will now be **F**. Now,

If f(x) is negative, then:

**F** is an **Apple**

If f(x) is positive, then:

**F** is an **Orange**

## Kernel Trick

All data are not linearly separable in nature. Some data are nonlinear in nature. What are we supposed to do now? Should we surrender when such nonlinear data challenge us? Of course not. We have our rescuer – Kernel Trick.

Kernel is a mapping function that transforms a given space into some other space which is higher in dimension. Kernel trick is all about taking your data and projecting them into higher dimension using the kernel function. The data which were linearly inseparable in lower dimension becomes separable in higher dimension as shown in the figure.

## Multiple Classes!!! What to do now?

Everything we did to this point enables us only for binary classification- we can classify only two classes. But real life problems involve more than two classes. How to tackle this situation? Should we dump SVM and use some other methods? Fortunately not. We have two techniques that come handy in these kinds of situations: **One versus One** classification and **One versus All** classification.

In **one versus one** classification, if we have n classes we train **n (n-1)/2** classes. For example, if we have 4 different classes we should train 6 classifiers. We get this number as we train each class against all other classes. At the time of prediction we feed new instance to all the classifiers and then a voting scheme is applied –when new instance is feed to the classifiers, the class to which the sample get classified , gets one upvote and at last the class which gets highest upvotes win which means , the instance get classified into the winning class.

In one versus all classification, if we have **n** classes, we train n binary classifier. In this method, we train each class against all other classifier. When we feed a new instance, we find the confidence score for all decision rather than just class label. We then classify the instance into the class which gives the highest confidence score.

## Did you know that?

Support Vector Machine can be applied to solve various kind of classification problem. One of the simplest application can be classification of cancer to malign (harmful) or benign (nonthreatening) class. We can use different features of cell nucleus like radius, texture, smoothness etc. Then we can collect some sample data (as many as possible). These data contain a different value for the different features. It also has label (type of cancer) represented by the value.

For example, a cell nucleus with radius **x**, texture **y,** and smoothness **z** might be labeled malign while next nucleus with radius **a**, texture **b** and smoothness **c**. These values and labels are based on real research. Then we use the data to train our classification model. Later, when we have a cell nucleus and we don’t know whether it is malign or benign, then we can use SVM, which will assign a label to the nucleus according to its features.

## Implementation in excel

Fig: Table of two classes Non-technical (-1) and technical (1)

Fig: Scatter Plot of two classes Non-technical (-1) and technical (1)

Excel can be used to implement SVM. To begin, we need labeled data for training SVM. In our case, we create hypothetical data with two labels non-technical **(-1)** and technical **(1)**. Here technical article refers to an article relating to technology and science and non-technical refers to other work of fictions. The data has two features: Average time required to read an article i.e. **Time** and number of sentences in the article i.e. **Sentences** (2.2 represents 2.2k sentences or 2200 sentences).

Training is a process of finding the optimal value of coefficients (for our case B1 and B2). In this example, we only iterated for 20 times. To get a more accurate system, we need to iterate until we get a consistent value of B1 and B2 or say until we get constant accuracy. We then use the coefficients to predict the class of new sample.

We get, after 20 iteration,

B1 = 0.19047619

B2 = -0.16931217

Now, for **Time**= 1Hour and **Sentences** = 2.4 thousand sentences, the decision plane is given as

f(x) = B1*Time + B2*Sentences ,

(Here we aren’t using intercept for simplification)

Or, f(x) = 0.19047619*1 -0.16931217*2.4

f(x) = -0.162963

As we have discussed earlier, we check the sign of the output of the function (f(x)) to classify the sample. As the output has a negative sign, we can safely classify it into a class **Non-technical article**.

## Pros and Cons of Support Vector Machine Algorithm:

?

SVM offers different benefits to its user. One of them is, it provides a clear margin of separation and works really well for both linearly separable and inseparable data. Besides that, it is quite efficient when data are high dimensional or say when we have data with a large number of features. It is especially effective for the classification problems where a number of dimensions are higher than a number of samples. This means if you have few data but the number of features is very high, you can depend on SVM.

?

However, SVM doesn’t perform well when a data set is large as training time is higher. Similarly, its performance degrades when the dataset has more noise which means if data is not properly separated and target classes overlap, SVM performs poorly.

**You Learn Best By Implementing Algorithms From Scratch in Excel**

**No Fancy Math**