## Introduction

Logistic Regression
could be one of the best models for solving problems that require categorizing
different instances into groups of similar features, which we basically call
classification problems. Logistic Regression as we know is a binary classifier
that deals with binary classes, for instance, it can be used to predict
whether a person has a risk of heart disease or not. But classification will
not only end up in two classes. Suppose In some cases, we need more than
two classes, in such a case we can extend the binary Logistic Regression to
multiclass known as a

**Multinomial Logistic Regression**or**Softmax Regression**.
This article solely focuses on an in-depth understanding of Multinomial
Logistic Regression, when and where it can be used in machine learning etc.

## What is Multinomial Logistic Regression?

Training multiple binary classifiers and combining them together for
multiclass classification makes sense, but there is another better way of
doing multiclass classification which is by generalizing the Logistic
Regression to support multiple classes directly known as Multinomial Logistic
Regression or Softmax Regression.

In binary Logistic Regression, we deal with
two classes, ie,

**y ∈ {0, 1}**which we used to predict whether a person has a risk of heart disease(1) or not (0). In the case of Multinomial Logistic Regression, it handles up to**K**classes, ie,**y ∈ {1, 2,3, ..., K}.**
The

**softmax function**is used to generalize the Logistic Regression for supporting multiple classes. We provide an input vector along with the coefficients to the softmax function and it gives an output vector of K classes with probabilities of which class the data belongs to. The class that has the highest probability will be the correct class and is represented as**1**and the remaining classes as**0.**The vector representation of such a class is known as a**one-hot**vector.
We can represent the probability of getting the correct result mathematically
as

**p(**y**= 1|x), the**probability of getting a correct class for a given input, ie, the probability of y = 1 for a given x.## The Softmax Function

In the case of Logistic Regression, we use a

**sigmoid function**for predicting the results. The generalized version of the sigmoid is known as the**softmax**function. The softmax function accepts an input vector z = [ z1, z2, ..., zk] and produces another output vector of probability distributions. It can be expressed as :where:

**σ**= Softmax- vector
**z**= Input vector of the given data. **e^(zi)**= Exponential function for input vector.**e^(zj)**= Exponential function for output vector.- K = Number of classes.

The softmax function is used in many cases, like the activation function in
neural networks, classification problems, Reinforcement Learning, etc.

Let's understand the softmax function with an example, consider an input
vector

**z = [3, 4.5, -1]**softmax for vector

**z = [z1, z2,....,zk]**is:softmax for input vector

**z = [3, 4.5, -1] is:**

The resulting rounded vector

**softmax(z)**will be:**z = [0.1819, 0.8149, 0.0034]**

The probability distribution of the softmax function is always between 0 and 1
and the probabilities will sum up to 1.

Now let's see how softmax can be implemented in python.

import numpy as npdef softmax(x):"""Compute softmax for each x values"""return np.exp(x) / np.sum(np.exp(x), axis=0)input_vector = [3, 4.5, -1]softmax(input_vector)output:array([0.18181803, 0.81485186, 0.00333011])

## Applying Softmax in Logistic Regression

So we looked at the softmax function, now we need to know how to apply this
function to Logistic Regression to switch into multiclass. When we apply
softmax in logistic regression the inputs will be the dot product of the

**weight vector(w)**and the**input vector(x)**plus a**bias(b)**term (**w.x + b**).input = [w1, w2, w3, .... , wk].[x1, x2, x3,.] + bias

Can you recall the slope-intercept equation

**y = mx + b?**You can see the inputs are given in the form of the slope-intercept equation, so our objective is to find the optimal weights(**w**and**b**) to minimize the cost and maximize the probability of predicting the correct class for given data. For finding the optimal weights we don't need to bother much since Gradient Descent will take care of it.
We can write this equation into a single form for output

**ŷ**just like this:

ŷ = softmax(Wx + b) or ŷ = σ(Wx + b)

## Learning in Multinomial Logistic Regression

How are the weight(w) and bias (b) learned? Well, the Logistic model predicts
output by analyzing the provided labels in the training data. What we want to
do is to learn the parameters

**w**and**b**for**ŷ****that produce an output close to the actual****y**values and reduce the cost.
To get the optimal weights to reduce the cost, we can find the

**distance**from the predicted values**ŷ**to the actual values**y**. This distance is known as the**loss**function more specifically**the cross-entropy loss function**of**Binary Logistic Regression. Minimizing the loss function reduces the distance from predicted values****ŷ**to the actual**y**values. This loss function can be generalized to support K classes will give us the loss function of Multinomial Logistic Regression.## Deriving cross-entropy loss function

For an observation x, which parameters maximize the likelihood of the observed
data from our actual data can be estimated by a method known as

**Maximum Likelihood Estimation(MLE).**After estimating the values we will get the**cross-entropy loss**function.
So now you have a little bit of intuition about the loss function as we
discussed in brief, let's understand it more closely and see how we can derive
the loss function for Binary Logistic Regression and then generalize it to get
the loss function of Multinomial Logistic Regression,

We are interested in learning the weights(parameters) that maximize the
probability of getting the correct result

**p(y|x)(**probability of y for given x)
Since Logistic Regression deals with binary outcomes we have only two outcomes
(0, 1), so here we can use the

**Bernoulli distribution**. Notice that here we are first deriving the loss function of Binary Logistic Regression and then generalizing it. So don't get confused,
When we apply

**Bernoulli distribution**for p(y|x) we will get:Let's apply

**log**on both sides:
When we apply the value of

**ŷ = σ(Wx + b)**and express the equation as cross-entropy loss LCE, we will get:So a good
model predicts probabilities close to 0 for negative instances and close to 1
for positive instances.

Let's understand this concept with an example.

Suppose our Logistic model predicts a probability of

**0.70**for a positive class, where**y = 1**

*Lce( ŷ, y) = -[ 1 . ln (0.70) + (1 - 1) ln (1 - σ(w.x + b))]*

*= - ln(0.70)*

*= 0.36*
When the model predicts a probability of

**0.70**for a negative class, where**y = 0**

*Lce( ŷ, y) = -[ 0 . ln*

*σ(w.x + b)*

*+ (1 - 0) ln (1 - 0.70)]*

*= - ln(1 - 0.70)*

*= 1.2*

In the first case, when the model predicts a probability of 0.70 for the
positive class where y = 1, the cost is less than 0.36. But in the second
case, the model predicts 0.70 for a negative class where y = 0, and the cost
is higher than 1.2 as compared to the first case.

The model predicts a
probability close to 1 for a positive class, which is the correct result,
but in the second case, the model predicts a probability close to 1 for a
negative class, which is the wrong result. Because of that, the cost also increased.

Now you have a better understanding of the loss function, let's see how we
can generalize this loss function to support K classes.

## Generalizing loss function

For Multinomial Logistic Regression, we represent both input

**y**and output**ŷ**as vectors. The actual**y**label is a vector containing K classes where**yc**= 1 if c is the correct class and the remaining elements will be 0. With these labels, the model predicts a**ŷ**vector containing K classes.
If c is the correct class, where yc =
1, the right side of the cost function drops out. The input vector

**y**and output vector**ŷ**change to K classes**yk**and**ŷk**respectively. The loss function will be the sum of logs of K classes. So the loss function for a single training sample will be:For all training samples the loss function will be:

## Implementing Multinomial Logistic Regression in python

Now let's implement all the crazy stuff we learned so far programmatically in
python. First of all, we need a dataset for our model to work on. Here I'm
using the MNIST dataset which is a collection of 70000 small images of
handwritten digits of school students from the US.

The MNIST dataset is one of the best practical datasets for classification
problems,

So let's load the dataset available in

**keras.datasets**

import pandas as pdimport numpy as npfrom keras.datasets import mnist# Splitting the dataset(X_train, y_train), (X_test, y_test) = mnist.load_data()

First, we loaded the dataset from Keras and assigned it to the corresponding
train and test variables. The images in this dataset are in the form of
matrices, so if you want to see the images of numbers use the

**matplotlib imshow**method.import matplotlib.pyplot as pltplt.imshow(X_train[0])

### Reshaping the dataset:

# Reshaping the data into 28*28 equal matrixX_train = X_train.reshape(60000,28*28)X_test = X_test.reshape(10000,28*28)

The dataset contains 784 features and we need to represent as 28 * 28 equal
number of rows and columns.

### One-Hot Encoding

def OneHot(y, c):# Constructing zero matrixy_encoded = np.zeros((len(y), c))# Giving 1 for some columsy_encoded[np.arange(len(y)), y] = 1return y_encoded

The above function converts the training data into a series of ones and zeros
for the classes given.

### Softmax Function

# Softmax Functiondef Softmax(z):exp = np.exp(z - np.max(z))for i in range(len(z)):exp[i]/=np.sum(exp[i])return exp

### Training the model

Now comes the important part, we are going to implement a function that trains
the model with the optimal parameters(

**w**and**b**) and reduce the cost. In each iteration, the cost will shrink to provide the best possible parameters. So let's see how we can implement that,def fit(X,y, c, epochs, learn_rate):# Splitting the number of training examples and features(m,n) = X.shape# Selecting random weights and biasw = np.random.random((n,c))b = np.random.random(c)loss_arr = []# Trainingfor epoch in range(epochs):# Hypothesis functionz = X@w + b# Computing gradient of loss w.r.t w and bgrad_for_w = (1/m)*np.dot(X.T,Softmax(z) - OneHot(y, c))grad_for_b = (1/m)*np.sum(Softmax(z) - OneHot(y, c))# Updating w and bw = w - learn_rate * grad_for_wb = b - learn_rate * grad_for_b# Computing the lossloss = -np.mean(np.log(Softmax(z)[np.arange(len(y)), y]))loss_arr.append(loss)print("Epoch: {} , Loss: {}".format(epoch, loss))return w, b, loss_arr# Normalizing the training setX_train = X_train/300# Training the modelw, b, loss = fit(X_train, y_train, c=10, epochs=1000, learn_rate=0.10)output:Epoch: 0 , Loss: 4.048884450473747Epoch: 1 , Loss: 3.1123844928930318Epoch: 2 , Loss: 2.4359512147361935Epoch: 3 , Loss: 2.0743646205439106Epoch: 4 , Loss: 1.7426834190627996Epoch: 5 , Loss: 1.5270608054318329Epoch: 6 , Loss: 1.3661773662434502Epoch: 7 , Loss: 1.253514604554096Epoch: 8 , Loss: 1.1604735954233256Epoch: 9 , Loss: 1.092909563196898Epoch: 10 , Loss: 1.0287242505816592Epoch: 11 , Loss: 0.9819879297108901Epoch: 12 , Loss: 0.9330864749109451Epoch: 13 , Loss: 0.8970693055728086Epoch: 14 , Loss: 0.8597687440748668Epoch: 15 , Loss: 0.8307884325356042Epoch: 16 , Loss: 0.8026402805231958....

The function mainly accepts 5 parameters, The training data

**X**and**Y**, the number of classes, the number of iterations(epochs), and the learning rate.
When you run the code, you can see that the cost is reduced in each iteration.
At the end of the iteration, you'll get the minimum cost and the optimal
weights.

## Making Predictions

Now it's time to test our model, let's see how our model performs for the
training data.

def predict(X, w, b):z = X@w + by_hat = Softmax(z)# Returning highest probability class.return np.argmax(y_hat, axis=1)predictions = predict(X_train, w, b)actual_values = y_trainprint("Predictions:", predictions)print("Actual values:", actual_values)accuracy = np.sum(actual_values==predictions)/len(actual_values)print(accuracy)Predictions: [5 0 4 ... 5 6 8]Actual values: [5 0 4 ... 5 6 8]0.8780666666666667

The accuracy score is pretty good in my case, the model is able to classify
the handwritten digits with an accuracy of 88%.

What about the case of testing data,

# Normalizing the training setX_test = X_test/300test_predictions = predict(X_test, w, b)test_actual = y_testprint("Test predictions:", test_predictions)print("Test actual:", test_actual)test_accuracy = np.sum(test_actual==test_predictions)/len(test_actual)print(test_accuracy)Test predictions: [7 2 1 ... 4 8 6]Test actual: [7 2 1 ... 4 5 6]0.8853

The test score seems great, the model predicts values with an accuracy of
approximately 88%.

You may also like:
Concept of Logistic Regression | Machine Learning