Classification of Iris dataset using SVM in Python

In this article, we are looking forward on classifying the Iris dataset using different SVM kernels with the help of Scikit-Learn package in Python.

Introduction

SVM or Support Vector Machines are used in machine learning and pattern recognition for classification and regression problems, especially when dealing with large datasets. They are relatively simple to understand and use, but also very powerful and effective. In this article, we are going to classify the Iris dataset using different SVM kernels using Python’s Scikit-Learn package. To keep it simple and understandable we will only use 2 features from the dataset — Petal length and Petal width.

The Iris dataset

The iris data consisted of 150 samples of three species of Iris. The first column represented sepal length, the second column represented sepal width, the third column represented petal length, and the fourth column represented petal width. I'm going to use sci-kit-learn to classify these instances according to their species of Iris, which will be distinguished based on their measurements. The picture of the Iris species is given below:



In fact, three of these iris species look similar, but the difference in measurements can be used to classify them. This data set is a classic example of supervised learning. The input variables are sepal length and width and petal length and width; each row represents an instance or observation. The output variable is Iris-setosa, Iris-versicolor, or Iris-virginica; each column represents a class label.

For getting the dataset, we can simply use the scikit-learn built-in datasets, which includes the Iris dataset as well, First, we need to import it.

from sklearn import datasets
import pandas as pd
import numpy as np

iris = datasets.load_iris() #Loading the dataset
iris.keys()

---------

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

As we can see that the dataset consists of some keys including data, target, frame, etc... The important keys in this Iris dataset are the data which consists of the length and width of iris flowers that help to distinguish them, and the other is target which consists of the corresponding labels(outputs) from each value in the training set.

Converting to Pandas Data Frame

Now we have our dataset within us, but to have a more well-structured dataset, it's essential to convert the dataset into a pandas data frame. Pandas is a great tool for all sorts of stuff related to dataset preprocessing. So let's see how we can convert the NumPy array to a pandas data frame.

iris = pd.DataFrame(
    data= np.c_[iris['data'], iris['target']],
    columns= iris['feature_names'] + ['target']
    )


iris.head()

    sepal length (cm)   sepal width (cm)    petal length (cm)   petal width (cm)    target
0       5.1                  3.5                 1.4                  0.2            0.0
1       4.9                  3.0                 1.4                  0.2            0.0
2       4.7                  3.2                 1.3                  0.2            0.0
3       4.6                  3.1                 1.5                  0.2            0.0
4       5.0                  3.6                 1.4                  0.2            0.0
5       5.4                  3.9                 1.7                  0.4            0.0
6       4.6                  3.4                 1.4                  0.3            0.0
7       5.0                  3.4                 1.5                  0.2            0.0
8       4.4                  2.9                 1.4                  0.2            0.0
9       4.9                  3.1                 1.5                  0.1            0.0

You can see that the Iris dataset consists of the length and width of sepals and petals of the three species of Iris including the target column.

Since there is no column of names of species in the data frame let's add one more column with names of different species corresponding to their numerical values. It really helps us to access the different classes using their names instead of numbers.

species = []

for i in range(len(iris['target'])):
    if iris['target'][i] == 0:
        species.append("setosa")
    elif iris['target'][i] == 1:
        species.append('versicolor')
    else:
        species.append('virginica')


iris['species'] = species

iris.head()
Now the dataset looks like this:


Plotting the dataset

Now let's plot the dataset to understand how the data is being distributed. To plot this data we can use the matplotlib package.

import matplotlib.pyplot as plt

setosa = iris[iris.species == "setosa"]
versicolor = iris[iris.species=='versicolor']
virginica = iris[iris.species=='virginica']

fig, ax = plt.subplots()
fig.set_size_inches(13, 7) # adjusting the length and width of plot

# lables and scatter points
ax.scatter(setosa['petal length (cm)'], setosa['petal width (cm)'], label="Setosa Petal", facecolor="blue")
ax.scatter(versicolor['petal length (cm)'], versicolor['petal width (cm)'], label="Versicolor", facecolor="green")
ax.scatter(virginica['petal length (cm)'], virginica['petal width (cm)'], label="Virginica", facecolor="red")

ax.set_xlabel("sepal length (cm)")
ax.set_ylabel("sepal width (cm)")
ax.grid()
ax.set_title("Iris petals")
ax.legend()
Note that here we are plotting the distribution of the petal length and width of the Iris flowers and not the sepal. We only need two features either sepal or petal.

The plot will look like this:

Performing Classification using SVC

Now let's do the classification of the Iris dataset. Here we only need two features of the dataset which are the petal length and width for classifying the species of Iris since the petal length and petal width are different among the three species of Iris and it helps to classify them.

from sklearn.model_selection import train_test_split

X = iris.drop(['sepal length (cm)', 'sepal width (cm)', 'target', 'species'], axis=1)
y = iris.target


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.55, random_state=42)
Here the X and y values containing the features and labels respectively are split into train and test sets using the sci-kit learn train_test_split method.

Training and testing the SVC classifier

Alright, we have all the stuff necessary for our SVC model. So let's import it from sci-kit learn and train it.

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Input the kernel from the user

kernels = ['linear', 'rbf', 'poly']

for kernel in kernels:
    model = SVC(kernel=kernel)
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    print("Accuracy using {}:".format(kernel), accuracy_score(pred, y_test))


Accuracy using linear: 0.9518072289156626
Accuracy using rbf: 0.9879518072289156
Accuracy using poly: 1.0
You can see the accuracy of the SVC classifier using different kernels. The polynomial kernel scored a high accuracy of 100% followed by RBF with 98% and linear with 95%. However this doesn't mean that one kernel is better than the other, it depends on how the data is being distributed.

Plotting the decision boundary of SVC

For better understanding, let's plot how SVC classified the three species of Iris into their corresponding classes.

For plotting, we can modify the above code just like this:

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
# Input the kernel from the user

def make_meshgrid(x, y, h=.02):
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    return xx, yy

def plot_contours(ax, clf, xx, yy, **params):
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, **params)
    return out

kernels = ['linear', 'rbf', 'poly']

for kernel in kernels:
    model = SVC(kernel=kernel)
    model.fit(X_train, y_train)

    pred = model.predict(X_test)

    print("Accuracy using {}:".format(kernel), accuracy_score(pred, y_test))
   
    fig, ax = plt.subplots()
    # title for the plots
    title = ('Decision surface of SVC ' + model.kernel)
    # Set-up grid for plotting.
    X0, X1 = X[:, 0], X[:, 1]
    xx, yy = make_meshgrid(X0, X1)

    plot_contours(ax, model, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
    ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
    ax.set_ylabel('y label here')
    ax.set_xlabel('x label here')
    ax.set_xticks(())
    ax.set_yticks(())
    ax.set_title(title)
    ax.legend()
    plt.show()
If you execute this, you will get the plot that shows the decision boundary which separates the three classes when using different kernels.

It looks like this: