Support Vector Regression: A Comprehensive Guide with Example

In this article, we will discuss a complete overview of Support Vector Regression and how it can be applied to solve regression problems.

Introduction

You often heard that Support Vector Machines are one of the best classification algorithms in Machine learning. In fact, it is a versatile algorithm that can be used for both classification and regression problems. Support Vector Regression (SVR) is a type of Support Vector Machine that is used for regression problems. It is a powerful and robust Machine Learning algorithm that can be used to solve a variety of regression problems. In this article, we will discuss a complete overview of Support Vector Regression and how it can be applied to solve regression problems. 

If you want to learn how SVM actually works, Checkout this article:Primal formulation of SVM 

What is Support Vector Regression?

Support Vector Regression is nothing but a Support Vector Machine with little modification to support regression. It is a supervised learning algorithm that is used for regression problems. However, it's not that popular since the Support Vector Machine is more popular for classification rather than regression. But SVR is a very powerful algorithm that can be used in linear and non-linear regression problems. It is very much robust to outliers and can be extended using the kernel function for performing regression in higher dimensional space. 

SVR works by fitting a regression line known as a hyperplane in the space of the data points that are closest to the line. The equation of the line is then used to predict the output for the new data points. Here we need to note that in the case of classification we are trying to separate the classes by a hyperplane that in terms does not allow passing the points inside the margin. But in the case of SVR, we are trying to fit a hyperplane in which most of the data points can be passed through the margin. If any of the data points are far away from the hyperplane then it will be considered an outlier.

Support Vector Classifier Vs Support Vector Regression
SVC vs SVR

The SVR can be extended using kernel functions, kernel functions are used to transform the data points into higher dimensional space, and then the hyperplane can be fit in this space. The kernel functions used in SVR are linear, polynomial, radial basis function, and sigmoid. Selecting kernel functions for SVR completely depends on the characteristics of the dataset. If the dataset is more linear, we can use the linear kernel, if the dataset is more non-linear, we can use polynomial, radial basis function, and sigmoid kernel. 

How SVM can also be used for Regression?

As we said before, SVM is quite versatile and can be used for both classification and regression problems. In the case of regression, the goal is to predict a continuous value instead of discrete values as in classification problems. So we can alter the SVM algorithm a little bit to make it a regression algorithm.
SVR epsilon Graph


This is possible by changing the loss function and introducing a new hyperparameter, epsilon(ε). The new loss function is called the epsilon insensitive loss function and it is used to define a threshold of acceptable error. By adjusting this hyperparameter, we can control the trade-off between the number of support vectors and the accuracy of the model. We'll discuss this in detail when going further.

Terms to note in SVR

Hyperplane - A hyperplane is a decision boundary that can be used to predict continuous data. In the support vector regression algorithm, a hyperplane is created by finding a line that best fits the data. When the decision boundary is more than 2-dimensional, it is called a hyperplane.

Epsilon(ε) - Epsilon is a hyperparameter that is can be tuned to increase or decrease the distance between the decision boundary and the data points. It is used to define a threshold of error that can be acceptable.

Xi(𝛏) - Slack variables are introduced to denote the deviation between the data point from the two boundaries.

Kernel - Kernel functions are used to map data into a higher dimensional space to find the best fit hyperplane in case linear separation is not possible.

The Mathematical formulation of SVR

We have discussed what is SVR and some important terms regarding that. Now let's combine this to formulate the mathematical optimization term for SVR.

The term that we need to optimize for SVM is the following:

SVM mimimization term

The main goal of SVR is to include as many data points inside the ε-tube(margin) and the data points which is far away from the ε-tube is considered an outlier. But however, we need to consider some errors in our model so that the problem of overfitting can be avoided. This can be possible by adding slack variables as we discussed above, so the optimization term will be,

SVR optimization term

C = Number of errors in training
𝛏 = Slack variable to denote the deviation from the point to the positive edge of the hyperplane
𝛏* = Slack variable to denote the deviation from the point to the negative edge of the hyperplane.

SVR Graph - xi and epsilon


In order to minimize the above term for SVR, we need to satisfy some constraints as well. The constraints are needed to get the optimal results, if the model does not satisfy the constraints we can penalize it until the optimal result and more generalized versions are obtained. The constraints will look like this,

SVR optimization term with constraints


Implementing SVR using sklearn

Now let's do the pratical implementation of SVR using sklearn. For that we need a dataset, here we are using the california housing dataset,

Importing the dataset

from sklearn.datasets import fetch_california_housing

dataset = fetch_california_housing()

X = dataset.data
y = dataset.target

Splitting the dataset

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42)

Importing and training SVR

from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error


svr = SVR(epsilon=0.5)
svr.fit(X_train, y_train)

Here we are providing the epsilon(ε) value of 0.5 and the default kernel will be RBF.

Making Prediction

svr_pred = svr.predict(X_test)

mse = mean_squared_error(svr_pred, y_test)

print("MSE:", mse)

-----

MSE: 1.3184895700041785

The MSE is around 1.3, with an epsilon value 0.5, now let's see if we can get more accurate predictions by tuning the epsilon hyperparameter using Grid Search Algorithm

Finding optimal ε hyprerparameter using Grid Search

from sklearn.model_selection import GridSearchCV
import numpy as np

epsilon = np.arange(0, 1.5, 0.1).tolist()

params = {'epsilon':epsilon}

grid = GridSearchCV(svr, param_grid=params,cv=5,scoring='r2',verbose=1)

grid.fit(X_train, y_train)

print("Best epsilon parameter:", grid.best_estimator_)

----

Best epsilon parameter: SVR(epsilon=1.1)

Alright, we got 1.1 as the best hyperparameter for epsilon when performing Grid Search from 0 to 1.5, Now let's see what will be our MSE using the new epsilon value.

svr = SVR(epsilon=1.1)

svr.fit(X_train, y_train)

svr_pred = svr.predict(X_test)

mse = mean_squared_error(svr_pred, y_test)

print("MSE:", mse)

----

MSE: 1.2889827565929928

Conclusion

SVR is a powerful machine learning algorithm for performing regression tasks. However, it is not suitable for all kinds of datasets. When the size of the dataset becomes large, SVR might not be the best option. In this case, we can use either Linear Regression or other Regression algorithms. SVR can be used when the dataset is small and the data points are scattered in space more often. Also, SVR is really good with its generalization capabilities and is robust to outliers which makes it one of the best machine learning tools according to our needs.


Articles to read: Overview to SVM

If you have and queries please provide in the comment box.