# List of Machine Learning Algorithms

- Machine Learning Algorithms:
- Supervised Learning
- Decision Trees
- Naive Bayes Classification
- Support vector machines for classification problems (SVM)
- Random forest for classification and regression problems
- Linear regression for regression problems
- Ordinary Least Squares Regression
- Logistic Regression
- Ensemble Methods
- Unsupervised Learning
- K-means clustering algorithm
- Apriori algorithm for association rule learning problems
- Principal Component Analysis
- Singular Value Decomposition
- Independent Component Analysis
- Reinforcement or Semi-Supervised Machine Learning
- Must have Machine learning algorithm cheat sheet

## Machine Learning Algorithms:

There is a distinct list of Machine Learning Algorithms. The method of how and when you should be using them. By learning about the List of Machine Learning Algorithm you learn furthermore about AI and designing Machine Learning System.

**The Machine Learning Algorithm list includes:**

- Linear Regression
- Logistic Regression
- Support Vector Machines
- Random Forest
- Naïve Bayes Classification
- Ordinary Least Square Regression
- K-means
- Ensemble Methods
- Apriori Algorithm
- Principal Component Analysis
- Singular Value Decomposition
- Reinforcement or Semi-Supervised Machine Learning
- Independent Component Analysis

These are the most important Algorithms in Machine Learning. If you are aware of these Algorithms then you can use them well to apply in almost any Data Problem. Data Scientists and the Machine Learning Enthusiasts use these Algorithms for creating various Functional Machine Learning Projects. Then comes the 3 types of Machine Learning Technique or Category which are used in these Machine Learning Algorithms.

### The three categories of these Machine Learning algorithms are:

To understand it better, you would need to understand each algorithm which will let you pick the right one which will match your Problem and Learning Requirement.

## Supervised Learning

The supervised Learning method is used by maximum Machine Learning Users. There is a basic Fundamental on why it is called Supervised Learning. It is called Supervised Learning because the way an Algorithm’s Learning Process is done, it is a training DataSet. And while using Training dataset, the process can be thought of as a teacher Supervising the Learning Process. The correct answer is known and stored in the system already. The algorithm helps in making Predictions about the Data that is in Training Process and gets the correction done by the Teacher itself. There is an end to the learning only when the Algorithm has achieved an acceptable degree or level of Performance.

There are two types of Supervised learning problems. These Supervised problems can be further grouped into regression and classification problems.

**Classification Problems:**Classification problem can be defined as the problem that brings output variable which falls just in particular categories, such as the “red” or “blue” or it could be “disease” and “no disease”.**Regression**: A regression problem is when the output variable is a real value, such as “dollars” or it could be “weight”.

There are some problems which you get to observe in the Data Type. The common problems which occur or gets built on the head of the Classification Problems and the Regression Problem. The common Problems include the Time-series Prediction and Recommendation respectively.

**There are few really Popular supervised machine learning algorithms, such as:**

How supervised machine learning works? - Image Source: Boozallen.com

## Decision Trees

Well, a lot is noticeable when you read the name Decision Tree, in simple terms a decision tree lends you the **help to make a decision about the data item**. For instance, in case, if you are a banker you get to take the decision whether you should give a loan to a person or not on the basis of his age, occupation and education level. You can do this by using a decision tree. While considering any decision tree, *we have to start the process from the root node and go on answering a particular question at each **node **and take the branch that corresponds to the particular answer*. Well, following this mannerism, we traverse from the root node then to a leaf and then form conclusions in context to the data item. Let us consider an example based on a decision tree below.

**Fig:** A tree showing the survival of passengers on the Titanic (“SIBSP” is the number of spouses or siblings aboard). (Source: Wikipedia).

## Naive Bayes Classification

Heard about the Bayes’ Theorem? So this is a classification technique dependent on the Bayes’ Theorem. This is based on the Assumption which has independence amongst the Predictors. In simple terms, this could be put up as Naive Bayes Classifier which assumes that a particular feature in a class is not exactly directly related to any other feature.

Considering the example, a Fruit can be considered an apple only based on its color i.e. if the color is red if it is round in shape and if it is about 3 inches in terms of diameter. Even if these features are interdependent and each of the features exist because of the other feature. All these properties got to contribute independently to the probability of the outcome of Fruit that it is an apple and the reason being it would be Naive.

Naive Bayes model isn’t difficult to build and is really useful for very large datasets. Along with simplicity, Naive Bayes is also considered to have outperformed all the highly sophisticated classification methods.

## Support vector machines for classification problems (SVM)

Support Vector Machine is proved to be a supervised machine learning method. This is considered to be used in solving both regression and the classification problems. Generally, Support Vector is used as a classifier so that we can discuss SVM as how it is a classifier. Well, like other machines it doesn’t have gears, valves, or different electronic parts nevertheless; it does what it can with normal machines to do: it takes the input, does the manipulation of the input and then provides the output.

*To be apt, in a given labeled training data SVM outputs, it applies an optimal hyperplane. This later helps in categorizing new examples.*

## Random forest for classification and regression problems

You have probably already guessed the answer having learned about decision trees. Yes, just the way a forest is a collection of trees, a random forest is also a collection of decision trees. Decision trees that are grown very deep often indulge in overfitting the training data so they can show high variation even on a small change in an input data.

They are always sensitive to the specific data on which they can be trained so that they can remain error-prone to test data sets. The random forest algorithm helps to grow many such decision trees and provide the average of the different classification trees (or the mode). This reduces the variance. The different classification trees are trained on the basis of different parts of the training dataset. In order to classify a new object from an input vector, put the input vector down, with each of the trees in the forest. Each tree gives a classification, the forest then chooses the classification of having the most votes or the average of all the trees in the forest.

## Linear regression for regression problems

As the name indicates this already, linear regression is well known to be an approach for modeling the relationship that lies in between a dependent variable ‘y’ and another or more independent variables that are denoted as ‘x’ and expressed in a linear form. The word Linear indicates that the dependent variable is directly proportional to the independent variables. There are other things that are to be kept in mind.

It has to be constant as if x is increased/decreased then Y also changes linearly. Mathematically the relationship is based and expressed in the simplest form as: This is

**y = Ax + B**

Here A and B are considered to be the constant factors. The goal hidden behind the Supervised learning using linear regression is to find the exact value of the Constants ‘A’ and ‘B’ with the help of the data sets. Then these values, i.e. the value of the Constants will be helpful in predicting the values of ‘y’ in the future for any values of ‘x’. Now, the cases where there is a single and independent variable it is termed as simple linear regression, while if there is the chance of more than one independent variable, then this process is called multiple linear regression.

## Ordinary Least Squares Regression

The Ordinary Least Squares Regression or call it ordinary least squares (OLS). The linear least squares. When we consider the statistics, this is a method where we estimate the unknown parameters. This is known as the linear regression model, it comes with the goal which minimizes the differences of the observed responses in some arbitrary dataset.

Also, minimizes the responses that are very well predicted by the linear approximation of the data (visually this can be seen as the sum, which is of the vertical distances falling in between each data point in the set and the corresponding points on the regression line – it is observed that the smaller the differences are, the better would be the model that fits the data). The resulting estimator can be expressed in the form of a simple formula, especially when this falls in the case of a single regressor and is on the right-hand side. The OLS estimators are known to be really consistent whereas the regressors are exogenous and there lies no perfect multicollinearity, and this remains optimal in the class of the linear unbiased estimators. While there are errors, these are homoscedastic and serially uncorrelated. Under these conditions, there is a method of OLS. It provides with the minimum-variance, there is a mean-unbiased estimation, here the errors would have finite variances. Under these additional assumptions, there are errors that could be normally distributed. The OLS algorithm is the maximum likelihood estimator. The OLS is mostly used in the subject matter such as economics (econometrics), in political science and then electrical engineering (control theory and the signal processing), there are many other areas of application. The Multi-fractional order estimator is known to be an expanded version of the OLS.

## Logistic Regression

Logistic Regression is a supervised machine learning algorithm used for classification. Though the ‘Regression’ in its name can be somehow misleading let’s not mistake it as some sort of regression algorithm. The name logistic regression came from a special function called Logistic Function which plays a central role in this method.

A **logistic regression model is termed as a probabilistic model**. It helps in finding the probability that a new instance belongs to a certain class. Since it is probability, the output lies between 0 and 1. Whenever we are using the logistic regression as a binary classifier (classification done into two classes), we can consider the classes to be a positive class and a negative class. We then find the probability. Higher the probability (greater than 0.5), it is likelier that it falls into the positive class. Similarly, if the probability is low (less than 0.5), we can classify this into the negative class.

Let’s consider an example of classifying emails into the spam malignant and ham (not spam). We assume that the malignant spam would be falling in the positive class and benign ham would be in the negative class. What we can do in the beginning is to take several labeled examples of emails and then use it to train the model. After training it, this can be used really well to predict the class of new email based examples. When we feed the examples to our model, it returns to us a value, say it is y such that 0≤y≤1. Suppose, the value we get is 0.8. From this value, we can say or predict that there is 80% probability that tested examples are a kind of spam. Thus this can be classified it in the form of a spam mail.

## Ensemble Methods

Ensemble methods are the meta-algorithms that combine several machine learning algorithms and techniques into one predictive model in order to decrease the variance (bagging), bias (boosting) or improve the predictions (stacking).

**The Ensemble methods can be divided into two groups:**

- The sequential ensemble methods are derived totally from where the base learners. And then this is generated sequentially (e.g. AdaBoost).
- The primary motivation of sequential methods is mainly to exploit the dependence that falls in between the base learners. The overall performance can be increased and boosted by weighing all the previously mislabeled examples with higher weight.
- The parallel ensemble methods where the base learners are generated in parallel (e.g. Random Forest).
- Then there is the basic motivation called the parallel methods which help to exploit independence that falls in between the base learners since the error here can be reduced dramatically by averaging.
- Most ensemble methods make use of a single base learning algorithm to
**produce homogeneous base learner**s, i.e. learners who fall in the same type, leading to homogeneous ensembles.

There are also some methods that are continuously using heterogeneous learners, i.e. learners that are of different types, this leads to heterogeneous ensembles. In order for ensemble methods to be more accurate than any of its individual members, the base learners should have to be as accurate as possible and even as diverse as possible.

**Bagging**

The term Bagging stands for bootstrap aggregation. One way which is known to reduce the variance of an estimate is by the Average, to average together the multiple estimates. For example, we can train M the different trees on different subsets of the data (which is chosen randomly with replacement) and compute the ensemble:

**Boosting**

The term Boosting here refers to a family of algorithms that are able and successful to convert weak learners into strong learners. The main principle of boosting is to fit a sequence that is made out of weak learners− models that are only slightly better than any random guessing, such as in the form of small decision trees− to the weighted versions of the data. More weight is now given to the examples that were misclassified in the earlier rounds.

The predictions are later combined through a weighted of majority vote (classification) or it can be a weighted sum (regression) to help produce the final prediction. The principal difference that is found in between boosting and the committee methods are such as bagging. And this says it is the base learners who are trained in sequence on a weighted version of the data.

Well, the algorithm below describes the most widely used form of boosting algorithm i.e called the **AdaBoost**, which basically stands for adaptive boosting.

**Stacking**

Stacking is known to be an ensemble learning technique this helps combine the multiple classifications or regression models via a meta-classifier or it could be a meta-regresser. Well, these base level models are well trained. And this completely depends on a training set and after that, the meta-model is trained in a way which is based on the outputs that are received by the base level model as features.

The base level is known to be consisting of different learning algorithms and these algorithms are therefore stacking ensembles that are often considered to be known as heterogeneous. The algorithm given below summarizes stacking.

## Unsupervised Learning

Unsupervised learning is that algorithm where you only have to insert/put the input data (X) and no corresponding output variables are to be put.

The major goal for the unsupervised learning is to help model the underlying structure or maybe in the distribution of the data in order to help the learners learn more about the data.

These are termed as unsupervised learning because unlike supervised learning which is shown above there are no correct answers and there is no teacher to this. Algorithms are left to their own devices to help discover and present the interesting structure that is present in the data.

Unsupervised learning problems can even be grouped ahead into clustering and association problems.

**Clustering:**A clustering is that problem which indicates what you want to discover and this helps in the inherent groupings of the data, such as grouping the customers based on their purchasing behavior.**Association:**An association rule is termed to be the learning problem. This is where you would be discovering the exact rules that will describe the large portions of your data. Example: People who buy X are also the one who tends to buy Y.

**Some popular examples of unsupervised learning algorithms are:**

## K-means clustering algorithm

K-means, it is **one of the simplest unsupervised learning algorithms that will solve the most well-known clustering problem**. The procedure can be grouped as the one which follows a simple and very easy way to classify a given data set with the help of a certain number of clusters (assume k clusters) fixed Apriori. The **main idea here is** to define k centers, which takes one for each cluster. These centers should now be planned and placed in an absolute cunning way because it has got various locations leading or causing a different result. So, there is a better choice, which is to place them very far away from each other. As far as possible. Then comes the next step which is to take each point that is belonging to a given data set and can be associated with the nearest center. When there is no point pending, the first step is already completed and a complete early group age is done. This is the point, where we all need to do the re-calculation. Here, ‘**k**’ is the complete new centroids as barycenter of the clusters which actually results from the previous or the earlier step. Also, after we have got these **k** new centroids, a new binding has to be done. This will need to be in between the same data set points and the nearest new center. A **loop has to be generated**. As a result of this loop, we may notice that the k centers will be changing the location step by step. This will continue until no more changes are to be done or in other words, can say the centers do not move anymore. Finally, this algorithm is always aiming at minimizing an objective function which is known to be as squared error function given and explained as such:

*‘||x*is the Euclidean distance between

_{i }– v_{j}||’*x*and

_{i}*v*

_{j.} *‘c _{i}’* is the number of data points in

*i*cluster.

^{th} *‘c’* is the number of cluster centers.

## Apriori algorithm for association rule learning problems

Apriori is considered an algorithm for frequent itemset mining and association rule learning over transactional databases. It proceeds just by identifying the frequent individual items in the database and then extending them to larger and larger item sets. The observation is, for as long as those itemsets appear sufficiently often in the database. The frequent itemsets that were determined by Apriori can be later used to determine about the association rules which highlights all the general trends that are being used in the database: this has got applications that fall in the domains such as the market basket analysis.

## Principal Component Analysis

The main idea which falls behind the principal component analysis (PCA) is to **help in reducing the dimensionality of the dataset** which consists of many variables, that are always correlated with each other, either in a heavy or light manner, while retaining the variation which is present in the dataset, up to its maximum extent. The same thing is repeated and done by transforming and bringing the variables to a whole new set of variables, which are called the principal components (or simply, the PCs) and are even termed to be orthogonal, ordered in such a way that the retention of variation which is present in the original variables can be decreased as we try to move down in the proper order. So, by following this particular way, the 1st principal component retains the most and maximum variation that was earlier present in the original components. The principal components are basically known to be the eigenvectors of a covariance matrix, and hence they are even called the orthogonal.

Most importantly, the dataset which is based on what the PCA techniques are to be used and must be scaled. The result also turns out to be sensitive based on the relative scaling. As a layman, it can be termed as a method of summarizing data. Just imagine having some wine bottles on your dining table. Each wine would be described only by its attributes, that are like colour, age, strength, etc. But eventually, redundancy will arise maybe because many of them would be measured based on the related properties. So what does PCA have to do or has to offer in this case? It will basically summarize each wine in the stock with really fewer characteristics.

## Singular Value Decomposition

In linear algebra, you can call the **singular-value decomposition** (**SVD**) as a factorization of maybe real or complex matrix. It is the generalization of the eigendecomposition, that is the origin of a positive semidefinite normal matrix is done somewhere over here (for example, take a symmetric matrix which has actually got the positive eigenvalues) to any {\displaystyle m\times n} matrix via an extension which is lying under the polar decomposition. It has many useful applications that are signal processing and are into statistics.

The singular-value decomposition can be computed easily by making the use of the following observations:

- The left-singular vectors of
**M**are considered to be a set of orthonormal eigenvectors of**MM**^{∗}. - The right-singular vectors of
**M**are actually the set of orthonormal eigenvectors of**M**^{∗}**M**. - The non-zero singular values of
**M**(that are found on the diagonal entries of**Σ**) are considered to be the square roots of the non-zero eigenvalues of both**M**^{∗}**M**and**MM**^{∗}.

Applications that help to employ the SVD include computing of the pseudoinverse, the least squares fitting of data, multivariable control, matrix approximation, and determining the rank, range and null space of a matrix.

## Independent Component Analysis

Now, consider the Independent component analysis (ICA), it is considered to be a statistical and computational technique. It helps to bring our or in revealing hidden factors that underlie in the sets of random variables, measurements, or signals.

ICA helps to define a generative model. This model stands for the observed multivariate data. It is typically recognized in the form of a large database of samples. Well, In the model, the data variables are assumed to be the linear mixtures of few less known

known or you can call it as unknown latent variables, and even the mixing system is also unknown. Then comes the latent variables. These variables are actually assumed to be the nongaussian. They are even the mutually independent ones. These could be termed as the independent components belonging in the category of the observed data. These independent components, also termed as the sources or factors, can be found by the ICA.

ICA, the term is basically superficially related to the principal component analysis and then to the factor analysis. ICA is considered and supposedly it is a much more powerful technique. Still, however this would be always capable of finding the underlying factors. It can even be the sources if possible by any chance, if these classic methods fail completely anyhow.

The data which is analyzed by the ICA could be originating from various kinds of application fields, this could be including digital images, the document databases, the economic indicators and then the psychometric measurements. In many cases, these measurements are given to be considered as a set of parallel signals or time series; the term blind source separation is then used in this to characterize this problem. Typical examples are actually the mixtures of simultaneous speech signals that have been picked up by several microphones, these are the brain waves that is recorded by multiple sensors and then the interfering radio signals that arriving at a mobile phone, or maybe the parallel time series which is obtained from performing some industrial process.

## Reinforcement or Semi-Supervised Machine Learning

There are Problems where you’ll find yourself that you’ve found a large amount of input data. Let’s consider it as (X) and then later some of the data is labeled as (Y). These are termed as semi-supervised learning problems.

These problems will actually sit in between supervised learning and then the unsupervised learning.

A good example would be to photo archive the places where only some of the images are labeled, (e.g. dog, cat, person) and the majority of the place is unlabeled.

Many of the realistic-world machine learning related problems fall into this category. This is because it could be really expensive or maybe time-consuming. To label this data as it may require the access to get through the domain experts. The unlabeled data is cheap and comparatively easy to collect and store.

You can use these unsupervised learning techniques to do wonders. This can help you discover and learn the various valid structures that are in the input variables.

You can also use the supervised learning techniques to make the best of the guess predictions which would be belonging to the unlabeled data. You can then feed that data back into the supervised learning algorithm as training data does and then later use the model to make predictions based on new unseen data.