Build A Convolutional Neural Network (CNN) From Scratch Using Python

In this article, we are going to build a Convolutional Neural Network from scratch with the NumPy library in Python.
Convolutional Neural Network From Scratch Python
CNN From Scratch

Introduction

Convolutional Neural Network (CNN) as we know is one of the popular algorithms found in Deep Learning that are widely used for image-related tasks, such as Image Recognition and Object Detection, as well as in advanced Computer Vision projects. While popular libraries like TensorFlow, Keras, and PyTorch offer convenient ways to build efficient CNN models, there is nothing wrong to give a try with building a CNN completely from the ground up. The benefit of such an attempt provides the answer to your curiosity about how all these things work on a deeper level. If you are curious like me, let's build the thing together. In this article, we are going to build a Convolutional Neural Network from scratch with the NumPy library in Python. We'll code the different layers of CNN like Convolution, Pooling, Flattening, and Full Connection, including the forward and backward pass (backpropagation) in CNN, and finally train the network on the famous Fashion MNIST dataset to see how it performs. let's start.

Wrapping the Basic Intuition


Okay, let's start by discussing the basics in brief. CNN has many layers starting from the Convolution Layer, Pooling Layer, Flattening Layer, and Fully Connected (Dense) Layer. 

The Convolution Layer consists of filters or kernels. These are just a set of weights that hold the most important features in the input image.  This layer performs convolutions on the input data to extract relevant features from the image.

The Pooling Layer helps to reduce the spatial dimensions of the input. It does this by downsampling the feature maps generated by the Convolution Layer, retaining the most important information while reducing computational complexity.

The Flattening Layer transforms the pooled feature maps into a one-dimensional vector. This step is necessary to connect the output of the previous layers to the Fully Connected Layer.

The Fully Connected Layer, also known as the Dense Layer, makes the final decision or prediction based on the extracted features. It connects every neuron in the previous layer to every neuron in this layer, allowing for decision-making.

The input is passed through each of these layers from left to right for producing the output, this process is known as the forward pass in CNN. And for training the network, we need to pass the gradient of loss backward through each of these layers which is a process known as backward pass or backpropagation. When performing a backward pass, we update the parameters of the CNN using Gradient Descent. If you're not familiar with the math and concepts, you can check this article which discusses the whole math behind CNN on a deeper level, highly recommended!

The Fashion MNIST Dataset

Here we are using the Fashion MNIST Dataset. Some of you may have heard about Fashion MNIST which consists of 70,000 grayscale images of clothing and accessories, divided into 10 different classes. Each image is a 28x28 pixel square, which is really small compared to other large image datasets, but it is quite a challenging task for machine learning models to correctly classify the images due to the variations in the images that differentiate them from one another. 
Image Source: TensorFlow

Here each image is a 28 x 28-pixel range together forms 784 total pixels. The entire image is already converted in the form of matrices in the dataset itself, so we have no burden to preprocess at a higher level. These matrices are fed into the CNN and it performs the computations for producing the desired output.

Building the Convolution Layer

Now that we have covered the fundamental understanding of CNNs and the dataset we will be working with, it's time to focus on constructing the most crucial element of CNN architecture, which is the convolution layer. 

Initializing Instance Variables

Let's initialize some variables that are necessary for our CNN, including the kernels and biases and their shapes, here is the code for creating the Convolution class and initializing instance variables,

import numpy as np
from scipy.signal import correlate2d

class Convolution:
   
    def __init__(self, input_shape, filter_size, num_filters):
        input_height, input_width = input_shape
        self.num_filters = num_filters
        self.input_shape = input_shape
       
        # Size of outputs and filters
       
        self.filter_shape = (num_filters, filter_size, filter_size) # (3,3)
        self.output_shape = (num_filters, input_height - filter_size + 1, input_width - filter_size + 1)
       
        self.filters = np.random.randn(*self.filter_shape)
        self.biases = np.random.randn(*self.output_shape)

We have initialized a few things here, first is the dimensions (shape) of the inputs and outputs, then the number of filters or kernels we need to produce, and finally, the filters and biases which is a NumPy array holds the weights and biases in the Convolutional Layer. 

Forward Pass

Now let's create the forward pass method where we make the predictions, here is the formula for forward pass in the convolution layer,

 \[(I * K)(i,j)  = \sum^m_{i=1} \sum^n_{j=1} (I_{(i+m-1)(j+n-1)} \star rot180(K_{ij})) + b_i\]

This is the convolution operation that we perform on the convolution layer. However, it's important to note that the definition of convolution in the context of CNNs differs from the traditional implementation.

Traditionally, convolution involves flipping the kernel by 180 degrees before (as shown in the equation) performing the element-wise multiplication and accumulation. However, in CNNs, libraries such as Keras and PyTorch do not apply this rule of flipping the kernel. They perform a cross-correlation operation without kernel rotation.

The decision to omit the rotation in these libraries is based on practical considerations. In the context of CNNs, the exact orientation of the kernel is not crucial for feature extraction, and flipping the kernel adds unnecessary complexity to the implementation.

So note that we are using the cross-correlation operation here,

    def forward(self, input_data):
        self.input_data = input_data
        # Initialized the input value
        output = np.zeros(self.output_shape)
        for i in range(self.num_filters):
            output[i] = correlate2d(self.input_data, self.filters[i], mode="valid")
        #Applying Relu Activtion function
        output = np.maximum(output, 0)
        return output

The forward pass method accepts the input data as an argument, performs cross-correlation with the filters, and applies the ReLU activation function. This ReLU activation is used to make the outputs non-linear. This will help to extract relevant features and prevent unnecessary features from propagating through.

Backward Pass

Let's create the backward method which passes the gradients (partial derivatives) with respect to the loss function backward through the convolution layer. For the backward pass of inputs, we need to find the gradient of loss with respect to weights or kernels and the gradient of loss with respect to inputs, here is the mathematical form of how you can do it,

Derivative of loss with respect to weights/filters,

\[\begin{align}\frac{\partial L}{\partial K_{ijk}} = \frac{\partial L}{\partial{\hat{y}_{ij}}} * \frac{\partial{\hat{y}_{ij}}}{{\partial K_{ijk}}} \\ = \frac{\partial L}{\partial{\hat{y_{ij}}}} * x_{ij}\end{align}\]

Updating the filters,

\[K_{ijk}^\text{new} = K_{ijk}^\text{old} - \eta \cdot \frac{{\partial L}}{{\partial K_{ijk}}}\]

Updating the bias,

\[b_j^\text{new} = b_j^\text{old} - \eta \cdot \frac{{\partial L}}{{\partial b_j}}\]

Now, let's implement this thing above in code, here the method needs to take the derivative of outputs as an argument and the learning rate as well, here is how we can do it.

    def backward(self, dL_dout, lr):
        # Create a random dL_dout array to accommodate output gradients
        dL_dinput = np.zeros_like(self.input_data)
        dL_dfilters = np.zeros_like(self.filters)

        for i in range(self.num_filters):
                # Calculating the gradient of loss with respect to kernels
                dL_dfilters[i] = correlate2d(self.input_data, dL_dout[i],mode="valid")

                # Calculating the gradient of loss with respect to inputs
                dL_dinput += correlate2d(dL_dout[i],self.filters[i], mode="full")

        # Updating the parameters with learning rate
        self.filters -= lr * dL_dfilters
        self.biases -= lr * dL_dout

        # returning the gradient of inputs
        return dL_dinput

This method performs the backward pass through the convolution layer of CNN, The process of computing the gradients goes on until the number of filters is used. Finally, we updated the filters and biases which is the most important step.

So we have finished coding the Convolution Layer, let's head over to the MaxPooling Layer,

Building the MaxPool Layer

MaxPooling is a simple kind of pooling function that selectively passes the maximum values from each window to the subsequent layer. For constructing the Max Pooling layer, we need to specify the pool size which is the window where the pooling is applied, let's see how,

class MaxPool:

    def __init__(self, pool_size):
        self.pool_size = pool_size

That's all we need as instance variables here,

Forward Pass in MaxPooling Layer

During the forward pass of Max Pooling, the input is taken from the output of the convolution operation. It is essential to consider the number of channels in which the output is generated, as convolving with multiple filters can produce multiple feature maps. Here is the code for performing forward pass in MaxPooling,

    def forward(self, input_data):

        self.input_data = input_data
        self.num_channels, self.input_height, self.input_width = input_data.shape
        self.output_height = self.input_height // self.pool_size
        self.output_width = self.input_width // self.pool_size

        # Determining the output shape
        self.output = np.zeros((self.num_channels, self.output_height, self.output_width))

        # Iterating over different channels
        for c in range(self.num_channels):
            # Looping through the height
            for i in range(self.output_height):
                # looping through the width
                for j in range(self.output_width):

                    # Starting postition
                    start_i = i * self.pool_size
                    start_j = j * self.pool_size

                    # Ending Position
                    end_i = start_i + self.pool_size
                    end_j = start_j + self.pool_size

                    # Creating a patch from the input data
                    patch = input_data[c, start_i:end_i, start_j:end_j]

                    #Finding the maximum value from each patch/window
                    self.output[c, i, j] = np.max(patch)

        return self.output

In the given code, we have multiple kernels, resulting in multiple feature maps. To handle this, we need three nested loops. The outermost loop iterates over each of these feature maps, while the other two loops traverse the height and width of the feature maps.

Within each iteration, we initialize the start and end indices based on the current position in the loops. These indices determine the region in the input data from which we will extract a patch or window.

Subsequently, we create a patch by extracting the relevant section of the input data using the calculated indices. Then, from each patch, we identify the maximum value.

By repeating this process for each patch in every feature map, we populate the output feature map with the maximum values obtained.

Finally, the output feature map, which has undergone the max pooling operation, is returned as the result.

Backward Pass in MaxPooling Layer

Indeed, in MaxPooling, the backward pass does not involve gradient computations. Unlike other operations that require gradient calculations, such as convolution or fully connected layers, MaxPooling simply passes the maximum values to the next layer without any further computations.

Therefore, during the backward pass of MaxPooling, we do not calculate gradients. Instead, we transmit the maximum gradients obtained from the previous layer directly to the corresponding locations in the next layer. This process ensures that the maximum gradient values flow through the MaxPooling layer and continue propagating through the network. 

    def backward(self, dL_dout, lr):
        dL_dinput = np.zeros_like(self.input_data)

        for c in range(self.num_channels):
            for i in range(self.output_height):
                for j in range(self.output_width):
                    start_i = i * self.pool_size
                    start_j = j * self.pool_size

                    end_i = start_i + self.pool_size
                    end_j = start_j + self.pool_size
                    patch = self.input_data[c, start_i:end_i, start_j:end_j]

                    mask = patch == np.max(patch)

                    dL_dinput[c,start_i:end_i, start_j:end_j] = dL_dout[c, i, j] * mask

        return dL_dinput

Nothing fancy here, the method takes the gradient of the loss with respect to inputs as an input argument. It then performs patching by dividing the input data into windows.

During this process, a mask variable is utilized to determine which values to propagate and which values to discard. Only the maximum value within each window is selected and passed to the next layer, while the remaining values are set to 0.

If you are not familiar with the concepts of MaxPooling, highly recommend checking this article about the math of CNN including the backward pass

That's also over, let's get into the decision-making area of CNN, the Fully Connected Layer.

Building the Fully Connected (Dense) Layer

You may be familiar with this Fully Connected Layer which is simply a Multi-Layer Perceptron (MLP) that is similar to an ideal neural network architecture, the Dense Layer is like the brain of CNN which makes decisions based on the features extracted from the convolution, and pooling layers.

To construct a Fully Connected Layer, we need to initialize some variables, like the input size, output size, weights, and biases, here is how you can do it,

class Fully_Connected:

    def __init__(self, input_size, output_size):
        self.input_size = input_size # Size of the inputs coming
        self.output_size = output_size # Size of the output producing
        self.weights = np.random.randn(output_size, self.input_size)
        self.biases = np.random.rand(output_size, 1)

Now, the equation for performing a forward pass in the Fully Connected Layer for each neuron can be written like this,

\[output = \Phi (W_{ij} . X_{j} + b_{j})\]

Where, W = Weights and X = inputs from the preceding layer, Φ = Activation function. This equation is simple and we have learned it when starting the basics of Neural Networks. Here we are computing the weighted sum of inputs, and adding a bias.

But what activation function we are going to use? If you think about the dataset we are working on, it contains 10 different classes, in order to create an output layer of 10 classes with a probability distribution, we need to consider a Softmax Function as an Activation Function for our Dense Layer, 

The equation for Softmax looks like this,

\[\text{Softmax}(\mathbf{z})_i = \frac{{e^{z_i}}}{{\sum_{j=1}^{K} e^{z_j}}}, \quad \text{for} i = 1, 2, \ldots, K\]

The Softmax function creates a probability distribution by exponentiating each element of the input vector and normalizing them through the sum of all exponentiated values. This normalization ensures that the resulting probabilities sum up to 1, representing a valid probability distribution across the classes.

Here is the code for constructing the Softmax Activation function,

    def softmax(self, z):
        # Shift the input values to avoid numerical instability
        shifted_z = z - np.max(z)
        exp_values = np.exp(shifted_z)
        sum_exp_values = np.sum(exp_values, axis=0)
        log_sum_exp = np.log(sum_exp_values)

        # Compute the softmax probabilities
        probabilities = exp_values / sum_exp_values

        return probabilities

The method accepts the weighted sum of inputs (dot product of weights and inputs) and applied the softmax function to return the probabilities. The magnitude of this probability vector is equal to the total number of classes we have.

One more thing we need is the derivative of softmax because when performing backpropagation, we need to calculate the gradients of loss with respect to outputs and there we need the derivative of softmax,

The derivative of softmax can be expressed using this equation,

\[\frac{{\partial \text{Softmax}(\mathbf{z})_i}}{{\partial z_j}} = \text{Softmax}(\mathbf{z})_i \cdot (\delta_{ij} - \text{Softmax}(\mathbf{z})_j)\]

Let's code this,

    def softmax_derivative(self, s):
        return np.diagflat(s) - np.dot(s, s.T)

Forward Pass in Dense Layer

The forward pass in Dense Layer is simply the equation shown above, all we need to do is to code it. However, before doing so, it's important to note that the output from the Convolution and Pooling Layers are in the form of a 2D matrix. To pass this output to the Dense Layer, it needs to be converted into a 1D vector. This is where the concept of Flattening comes in.

While some implementations treat Flattening as a separate layer, in this code, we handle it within the forward pass itself. Both the forward and backward passes include the operation of flattening the 2D matrix into a 1D vector. Here is the code for performing the forward pass,

    def forward(self, input_data):
        self.input_data = input_data
        # Flattening the inputs from the previous layer into a vector
        flattened_input = input_data.flatten().reshape(1, -1)
        self.z = np.dot(self.weights, flattened_input.T) + self.biases

        # Applying Softmax
        self.output = self.softmax(self.z)
        return self.output

Simple, the method accepts the input data as an argument and flattens it, finds the weighted sum of inputs, applies the softmax activation, and finally returns the output.

Backward pass in Dense Layer

This is actually the final implementation of the architecture, the backward pass in the Dense layer is based on these three gradient computations,

Derivative of loss with respect to weights,

\[\frac{\partial{L}}{\partial{w_{ij}}} = \frac{\partial{L}}{\partial{\hat{y_i}}}.\frac{\partial{\hat{y_i}}}{\partial{w_{ij}}}\]

Derivative of loss with respect to inputs,

\[\frac{\partial L}{\partial x_j} = \sum_{i=1}^{n} \frac{\partial L}{\partial{\hat{y_i}}}.w_{ij}\]

Derivative of loss with respect to bias,

\[\frac{\partial L}{\partial b_j} = \frac{\partial{L}}{\partial{\hat{y_i}}}\]

After finding each of these gradients, we need to update the weights and bias using these gradients at the same time pass the input gradients backward to the subsequent layers. Here is the code for doing that,

    def backward(self, dL_dout, lr):
        # Calculate the gradient of the loss with respect to the pre-activation (z)
        dL_dy = np.dot(self.softmax_derivative(self.output), dL_dout)
        # Calculate the gradient of the loss with respect to the weights (dw)
        dL_dw = np.dot(dL_dy, self.input_data.flatten().reshape(1, -1))

        # Calculate the gradient of the loss with respect to the biases (db)
        dL_db = dL_dy

        # Calculate the gradient of the loss with respect to the input data (dL_dinput)
        dL_dinput = np.dot(self.weights.T, dL_dy)
        dL_dinput = dL_dinput.reshape(self.input_data.shape)

        # Update the weights and biases based on the learning rate and gradients
        self.weights -= lr * dL_dw
        self.biases -= lr * dL_db

        # Return the gradient of the loss with respect to the input data
        return dL_dinput

And that's it! We have coded the entire CNN from scratch, but it's not complete, there are a few more things remaining. We need to choose a loss function, its derivative, a predict function, and of course, a training function.

Categorical Cross Entropy & its Derivative

The loss function we are choosing is the categorical cross-entropy, but why? Since we are dealing with multiple classes, we need a loss function that can find the error between multiple classes effectively and a categorical cross-entropy loss function really does that, here is the code for calculating the loss and its gradient (derivative),

def cross_entropy_loss(predictions, targets):

    num_samples = 10

    # Avoid numerical instability by adding a small epsilon value
    epsilon = 1e-7
    predictions = np.clip(predictions, epsilon, 1 - epsilon)
    loss = -np.sum(targets * np.log(predictions)) / num_samples
    return loss

def cross_entropy_loss_gradient(actual_labels, predicted_probs):
    num_samples = actual_labels.shape[0]
    gradient = -actual_labels / (predicted_probs + 1e-7) / num_samples

    return gradient

If you want to know more about the Cross-Entropy loss function and its different variants, check this article, Cross-Entropy Loss function

Training Function

Now, let's create a training function that combines everything together for training our CNN.

conv = Convolution(X_train[0].shape, 6, 1)
pool = MaxPool(2)
full = Fully_Connected(121, 10)

def train_network(X, y, conv, pool, full, lr=0.01, epochs=200):
    for epoch in range(epochs):
        total_loss = 0.0
        correct_predictions = 0

        for i in range(len(X)):
            # Forward pass
            conv_out = conv.forward(X[i])
            pool_out = pool.forward(conv_out)
            full_out = full.forward(pool_out)
            loss = cross_entropy_loss(full_out.flatten(), y[i])
            total_loss += loss

            # Converting to One-Hot encoding
            one_hot_pred = np.zeros_like(full_out)
            one_hot_pred[np.argmax(full_out)] = 1
            one_hot_pred = one_hot_pred.flatten()

            num_pred = np.argmax(one_hot_pred)
            num_y = np.argmax(y[i])

            if num_pred == num_y:
                correct_predictions += 1
            # Backward pass
            gradient = cross_entropy_loss_gradient(y[i], full_out.flatten()).reshape((-1, 1))
            full_back = full.backward(gradient, lr)
            pool_back = pool.backward(full_back, lr)
            conv_back = conv.backward(pool_back, lr)

        # Print epoch statistics
        average_loss = total_loss / len(X)
        accuracy = correct_predictions / len(X_train) * 100.0
        print(f"Epoch {epoch + 1}/{epochs} - Loss: {average_loss:.4f} - Accuracy: {accuracy:.2f}%")

Let's explain every piece one by one, the first thing we did is to create the instances of the classes we have created so far. The train_network function takes many arguments including the input and labels, convolution layer, pooling layer, full connection (Dense) layer, learning rate, and finally the number of epochs or iterations.

During the training process, each epoch involves iterating through each input. For each input, a forward pass is performed, followed by a backward pass through all the layers. The output generated from the Fully Connected Layer is then used to calculate the loss. Subsequently, the loss gradient is propagated backward through each layer to update the parameters.

Predict Function

Predict function is simply calling the forward pass of each layer, here is how you can do it,

def predict(input_sample, conv, pool, full):
    # Forward pass through Convolution and pooling
    conv_out = conv.forward(input_sample)
    pool_out = pool.forward(conv_out)
    # Flattening
    flattened_output = pool_out.flatten()
    # Forward pass through fully connected layer
    predictions = full.forward(flattened_output)
    return predictions

Nothing to explain here, just passing the inputs through each layer one after the other and finally producing the output in the Fully Connected Layer.

Performing Classification

Let's see our CNN in action, that we need to get the Fashion MNIST dataset, here I'm using the Keras library from TensorFlow to download the dataset which is already preprocessed and converted to train and test sets,

import tensorflow.keras as keras
import numpy as np

# Load the Fashion MNIST dataset
(train_images, train_labels), (test_images, test_labels) = keras.datasets.fashion_mnist.load_data()

Once we got the dataset, you can print the train_images to see the matrix distribution. Here I'm using the first 5000 images for training the network and using the second 5000 images for testing. Also, it is better to scale the images into smaller values because it increases the computational efficiency.

X_train = train_images[:5000] / 255.0
y_train = train_labels[:5000]

X_test = train_images[5000:10000] / 255.0
y_test = train_labels[5000:10000]

X_train.shape

-------------

(5000, 28, 28)

Converting Labels to One-Hot Vectors

Converting labels to One-Hot vectors is a common practice when comes to Neural Networks because it provides compatibility, easy loss calculation, and training stability,

from keras.utils import to_categorical

y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

y_test[0]

-----------------

array([0., 0., 0., 0., 1., 0., 0., 0., 0., 0.], dtype=float32)

Training CNN

Alright, it's time to train our CNN, simply call the train_network function by passing training data labels, and the layers.

train_network(X_train, y_train, conv, pool, full)

This will start training the network, and you may find the loss is decreasing and the accuracy is increasing at different rates. Note that you can choose any kind of architecture here and the convergence of the network depends on which architecture you choose. You can experiment with different filter sizes, number of filters, etc. This architecture is what I find to perform better in this case where there is only a single filter of size 6, a pooling size of 2, etc. Change these values to see which one is providing the best results. 

However, using this architecture, I got around 82% accuracy in training and 78% accuracy in testing on 200 epochs which is not outstanding but satisfactory. But the interesting fact I found is that even using a larger dataset for testing, the network still gives a reasonable accuracy of around 75%. This is the beauty of Convolutional Neural Networks, it is highly efficient in finding the most relevant patterns rather than processing the entire images.

Predictions

Now, let's call the predict method to see the outputs produced after training the network,

predictions = []

for data in X_test:
    pred = predict(data, conv, pool, full)
    one_hot_pred = np.zeros_like(pred)
    one_hot_pred[np.argmax(pred)] = 1
    predictions.append(one_hot_pred.flatten())

predictions = np.array(predictions)

predictions

The predictions are stored in a Python list which is then converted to a NumPy array. Using this prediction you can check the accuracy of the model as well as other metrics like precision, recall, classification report, etc.

from sklearn.metrics import accuracy_score

accuracy_score(predictions, y_test)

--------

0.784625

For the full version of the code, you can see my Kaggle notebook here: CNN from scratch

Conclusion

In conclusion, we have built a Convolutional Neural Network (CNN) from scratch. We started by creating the Convolution Layer to extract important features from the input data. Then, the MaxPooling Layer helped reduce the data size. Finally, the Fully Connected Layer allowed us to make predictions by passing the output through the network.

During backpropagation, we updated the weights and biases by passing gradients backward through the layers. We trained the network using the Fashion MNIST dataset and performed predictions.

Although the training and testing accuracy were not exceptional, it was interesting to see that even when tested with a larger dataset, the network still achieved reasonably accurate results. This highlights CNN's ability to efficiently find relevant patterns without processing the entire image.

In summary, we built a CNN, trained it on the Fashion MNIST dataset, and witnessed its capability. This hands-on experience provided valuable insights into how CNNs work and their effectiveness in image classification tasks.