Let's create a Spam classifier using Naive Bayes and Tf-IDF Vectorizer.
Loading the Dataset
Dataset can be downloaded from Kaggle. Lots of datasets are available in Kaggle of different messages and the type you may receive sometimes.
Dataset can be fetched by the read_csv function in pandas. This particular dataset contains 5559 rows of 747 Spam and 4812 Ham messages. However, one thing is missing, we need the labels in numbers, for instance, we can add another column to the dataset containing 0 and 1 for ham and spam respectively.
Now if you run this code it will create a new column at the end with the specific labels for each message.
Splitting the dataset
Usually, we split the entire dataset into train and test data for the model to perform. The model is trained by the training data and its corresponding labels. Then it can be tested with the test data to check whether the model is good at predicting or not.
rows in train and test set -
Tf-IDF is one of the efficient statistical methods to figure out the words that are relevant in a text, sentence, or paragraph. We know that the messages are texts, however, computers aren't good at texts so Tf-IDF helps to convert the texts to corresponding numerical values so as to fit the data to the model.
Here we applied the Tf-IDF vectorizer and fitted the transformed train data to the Multinomial Naive Bayes classifier. Now let's look at the predictions made by the model.
Run this code to see the predictions and compare them with actual values.
Evaluating the Model
The scores are not bad at all. Now if you look at the recall score(sensitivity) and the precision score is 1.0 and 0.75.. respectively. The recall is so high that the model can identify most of the True Positives. This means the model is able to classify most of the spam messages including some ham messages. However, the precision and f1 scores are also not bad.
Let's predict some real messages. Here are some messages that I received in the past.
The first three messages I received were spam and the last one is probably ham. The model prediction is correct when I run this code. However, this model may not be that perfect for a real-life application since this is a basic understanding of spam classification. but you can optimize the model with some of the performance measures like Grid Search. Try to collect more datasets and work with some of the good datasets available there on platforms like Kaggle.