Multinomial Logistic Regression: Defintion, Math, and Implementation

A in depth overview to Multinomial Logistic Regression(Softmax Regression), Defintion, Math, and it's implementation using python.

Introduction

Logistic Regression could be one of the best models for solving problems that require categorizing different instances into groups of similar features, which we basically call classification problems. Logistic Regression as we know is a binary classifier that deals with binary classes, for instance, it can be used to predict whether a person has a risk of heart disease or not. But classification will not only end up in two classes. Suppose  In some cases, we need more than two classes, in such a case we can extend the binary Logistic Regression to multiclass known as a Multinomial Logistic Regression or Softmax Regression

This article solely focuses on an in-depth understanding of Multinomial Logistic Regression, when and where it can be used in machine learning etc.

What is Multinomial Logistic Regression?

Training multiple binary classifiers and combining them together for multiclass classification makes sense, but there is another better way of doing multiclass classification which is by generalizing the Logistic Regression to support multiple classes directly known as Multinomial Logistic Regression or Softmax Regression. 

In binary Logistic Regression, we deal with two classes, ie, y ∈ {0, 1} which we used to predict whether a person has a risk of heart disease(1) or not (0). In the case of Multinomial Logistic Regression, it handles up to K classes, ie, y ∈ {1, 2,3, ..., K}.

The softmax function is used to generalize the Logistic Regression for supporting multiple classes. We provide an input vector along with the coefficients to the softmax function and it gives an output vector of K classes with probabilities of which class the data belongs to. The class that has the highest probability will be the correct class and is represented as 1 and the remaining classes as 0. The vector representation of such a class is known as a one-hot vector. 

We can represent the probability of getting the correct result mathematically as p(y = 1|x), the probability of getting a correct class for a given input, ie, the probability of y = 1 for a given x. 

The Softmax Function 

In the case of Logistic Regression, we use a sigmoid function for predicting the results. The generalized version of the sigmoid is known as the softmax function. The softmax function accepts an input vector z = [ z1, z2, ..., zk] and produces another output vector of probability distributions. It can be expressed as :


where: 
  • σ = Softmax
  • vector z = Input vector of the given data.
  • e^(zi) = Exponential function for input vector.
  • e^(zj) = Exponential function for output vector.
  • K = Number of classes.
The softmax function is used in many cases, like the activation function in neural networks, classification problems, Reinforcement Learning, etc.

Let's understand the softmax function with an example, consider an input vector z = [3, 4.5, -1]

softmax for vector z = [z1, z2,....,zk] is:


softmax for input vector  z = [3, 4.5, -1] is:



The resulting rounded vector softmax(z) will be:

z = [0.1819, 0.8149, 0.0034]

The probability distribution of the softmax function is always between 0 and 1 and the probabilities will sum up to 1. 

Now let's see how softmax can be implemented in python.

import numpy as np

def softmax(x):
    """Compute softmax for each x values"""
    return np.exp(x) / np.sum(np.exp(x), axis=0)

input_vector = [34.5-1]
softmax(input_vector)

output:
array([0.181818030.814851860.00333011])

Applying Softmax in Logistic Regression

So we looked at the softmax function, now we need to know how to apply this function to Logistic Regression to switch into multiclass. When we apply softmax in logistic regression the inputs will be the dot product of the weight vector(w) and the input vector(x) plus a bias(b) term (w.x + b).

input = [w1, w2, w3, .... , wk].[x1, x2, x3,.] + bias 


Can you recall the slope-intercept equation y = mx + b? You can see the inputs are given in the form of the slope-intercept equation, so our objective is to find the optimal weights(w and b) to minimize the cost and maximize the probability of predicting the correct class for given data. For finding the optimal weights we don't need to bother much since Gradient Descent will take care of it.

We can write this equation into a single form for output ŷ just like this:

ŷ = softmax(Wx + b) or ŷ = σ(Wx + b)

Learning in Multinomial Logistic Regression

How are the weight(w) and bias (b) learned? Well, the Logistic model predicts output by analyzing the provided labels in the training data. What we want to do is to learn the parameters w and b for ŷ that produce an output close to the actual y values and reduce the cost.

To get the optimal weights to reduce the cost, we can find the distance from the predicted values ŷ to the actual values y. This distance is known as the loss function more specifically the cross-entropy loss function of Binary Logistic Regression. Minimizing the loss function reduces the distance from predicted values ŷ to the actual y values. This loss function can be generalized to support K classes will give us the loss function of Multinomial Logistic Regression. 

Deriving cross-entropy loss function

For an observation x, which parameters maximize the likelihood of the observed data from our actual data can be estimated by a method known as Maximum Likelihood Estimation(MLE). After estimating the values we will get the cross-entropy loss function.

So now you have a little bit of intuition about the loss function as we discussed in brief, let's understand it more closely and see how we can derive the loss function for Binary Logistic Regression and then generalize it to get the loss function of Multinomial Logistic Regression,

We are interested in learning the weights(parameters) that maximize the probability of getting the correct result p(y|x)(probability of y for given x) 

Since Logistic Regression deals with binary outcomes we have only two outcomes (0, 1), so here we can use the Bernoulli distribution. Notice that here we are first deriving the loss function of Binary Logistic Regression and then generalizing it. So don't get confused,

When we apply Bernoulli distribution for p(y|x) we will get:


Let's apply log on both sides:


The result is a log-likelihood that needs to be maximized, but we need to get the loss function that wants to be minimized. For getting the loss function, what we only need to do is to add a negative sign in front of the equation.



When we apply the value of ŷ = σ(Wx + b) and express the equation as cross-entropy loss LCE, we will get:



That's great, we derived the loss function of Binary Logistic Regression, Here we need to note that the cost will be large when the model predicts a probability close to 0 for a positive instance, and the cost will be high when the model predicts probabilities close to 1 for negative instance. 

So a good model predicts probabilities close to 0 for negative instances and close to 1 for positive instances.

Let's understand this concept with an example.

Suppose our Logistic model predicts a probability of 0.70 for a positive class, where y = 1

Lce( ŷ, y) =  -[ 1 . ln (0.70)  +  (1 - 1) ln (1 - σ(w.x + b))] 
                 
                 = - ln(0.70)
 
                 =  0.36

When the model predicts a probability of 0.70 for a negative class, where y = 0

Lce( ŷ, y) =  -[ 0 . ln σ(w.x + b)  +  (1 - 0) ln (1 - 0.70)] 
                 
                 = - ln(1 - 0.70)
 
                 =  1.2


In the first case, when the model predicts a probability of 0.70 for the positive class where y = 1, the cost is less than 0.36. But in the second case, the model predicts 0.70 for a negative class where y = 0, and the cost is higher than 1.2 as compared to the first case. 

The model predicts a probability close to 1 for a positive class, which is the correct result, but in the second case, the model predicts a probability close to 1 for a negative class, which is the wrong result. Because of that, the cost also increased.

Now you have a better understanding of the loss function, let's see how we can generalize this loss function to support K classes.

Generalizing loss function

For Multinomial Logistic Regression, we represent both input y and output ŷ as vectors. The actual y label is a vector containing K classes where yc = 1 if c is the correct class and the remaining elements will be 0. With these labels, the model predicts a ŷ vector containing K classes. 

If c is the correct class, where yc = 1, the right side of the cost function drops out. The input vector y and output vector ŷ change to K classes yk and ŷk respectively. The loss function will be the sum of logs of K classes. So the loss function for a single training sample will be:



For all training samples the loss function will be:


Implementing Multinomial Logistic Regression in python

Now let's implement all the crazy stuff we learned so far programmatically in python. First of all, we need a dataset for our model to work on. Here I'm using the MNIST dataset which is a collection of 70000 small images of handwritten digits of school students from the US.

The MNIST dataset is one of the best practical datasets for classification problems,

So let's load the dataset available in keras.datasets

import pandas as pd
import numpy as np
from keras.datasets import mnist

# Splitting the dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()

First, we loaded the dataset from Keras and assigned it to the corresponding train and test variables. The images in this dataset are in the form of matrices, so if you want to see the images of numbers use the matplotlib imshow method.

import matplotlib.pyplot as plt

plt.imshow(X_train[0])

Reshaping the dataset:

# Reshaping the data into 28*28 equal matrix
X_train = X_train.reshape(60000,28*28)
X_test = X_test.reshape(10000,28*28)

The dataset contains 784 features and we need to represent as 28 * 28 equal number of rows and columns.

One-Hot Encoding

def OneHot(y, c):

  # Constructing zero matrix
    y_encoded = np.zeros((len(y), c))

  # Giving 1 for some colums
   y_encoded[np.arange(len(y)), y] = 1
   return y_encoded

The above function converts the training data into a series of ones and zeros for the classes given.

Softmax Function

# Softmax Function
def Softmax(z):
  exp = np.exp(z - np.max(z))
  for i in range(len(z)):
    exp[i]/=np.sum(exp[i])
  return exp

Training the model

Now comes the important part, we are going to implement a function that trains the model with the optimal parameters(w and b) and reduce the cost. In each iteration, the cost will shrink to provide the best possible parameters. So let's see how we can implement that,

def fit(X,y, c, epochs, learn_rate):

    # Splitting the number of training examples and features
    (m,n) = X.shape

    # Selecting random weights and bias
    w = np.random.random((n,c))
    b = np.random.random(c)

    loss_arr = []

    # Training 
    for epoch in range(epochs):

        # Hypothesis function
        z = X@+ b

        # Computing gradient of loss w.r.t w and b
        grad_for_w = (1/m)*np.dot(X.T,Softmax(z) - OneHot(y, c))
        grad_for_b = (1/m)*np.sum(Softmax(z) - OneHot(y, c))

        # Updating w and b
        w = w - learn_rate * grad_for_w
        b = b - learn_rate * grad_for_b

        # Computing the loss
        loss = -np.mean(np.log(Softmax(z)[np.arange(len(y)), y]))
        loss_arr.append(loss)
        print("Epoch: {} , Loss: {}".format(epoch, loss))

    return w, b, loss_arr

# Normalizing the training set
X_train = X_train/300

# Training the model
w, b, loss = fit(X_train, y_train, c=10epochs=1000learn_rate=0.10)

output:

Epoch: 0 , Loss: 4.048884450473747
Epoch: 1 , Loss: 3.1123844928930318
Epoch: 2 , Loss: 2.4359512147361935
Epoch: 3 , Loss: 2.0743646205439106
Epoch: 4 , Loss: 1.7426834190627996
Epoch: 5 , Loss: 1.5270608054318329
Epoch: 6 , Loss: 1.3661773662434502
Epoch: 7 , Loss: 1.253514604554096
Epoch: 8 , Loss: 1.1604735954233256
Epoch: 9 , Loss: 1.092909563196898
Epoch: 10 , Loss: 1.0287242505816592
Epoch: 11 , Loss: 0.9819879297108901
Epoch: 12 , Loss: 0.9330864749109451
Epoch: 13 , Loss: 0.8970693055728086
Epoch: 14 , Loss: 0.8597687440748668
Epoch: 15 , Loss: 0.8307884325356042
Epoch: 16 , Loss: 0.8026402805231958
.
.
.
.

The function mainly accepts 5 parameters, The training data X and Y, the number of classes, the number of iterations(epochs), and the learning rate.

When you run the code, you can see that the cost is reduced in each iteration. At the end of the iteration, you'll get the minimum cost and the optimal weights.

Making Predictions

Now it's time to test our model, let's see how our model performs for the training data.

def predict(X, w, b):
    
    z = X@+ b
    y_hat = Softmax(z)
    
    # Returning highest probability class.
    return np.argmax(y_hat, axis=1)

predictions = predict(X_train, w, b)
actual_values = y_train

print("Predictions:", predictions)
print("Actual values:", actual_values)

accuracy = np.sum(actual_values==predictions)/len(actual_values)
print(accuracy)


Predictions: [5 0 4 ... 5 6 8]
Actual values: [5 0 4 ... 5 6 8]
0.8780666666666667

The accuracy score is pretty good in my case, the model is able to classify the handwritten digits with an accuracy of 88%.

What about the case of testing data,

# Normalizing the training set
X_test = X_test/300

test_predictions = predict(X_test, w, b)
test_actual = y_test

print("Test predictions:", test_predictions)
print("Test actual:", test_actual)

test_accuracy = np.sum(test_actual==test_predictions)/len(test_actual)

print(test_accuracy)


Test predictions: [7 2 1 ... 4 8 6]
Test actual: [7 2 1 ... 4 5 6]
0.8853

The test score seems great, the model predicts values with an accuracy of approximately 88%.