NLP | Neural Networks | LSTM | RNN | Machine Learning

This is a spam email classifier, using the synthetic Spam and Ham dataset from Kaggle.

This work was done in collaboration with Arvin Ymson and Warren de la Cruz.

Dataframe

The dataset consisted of 150 emails: 50 spam emails and 100 “ham” emails. That gives us a proportional chance criterion 1.25(2/3^2 * 1/3^2) of 69.4%–basically, 1.25 times the baseline accuracy if all we did was classify spam and ham randomly. This is the minimum accuracy for our model to beat to be of any use.

Distribution

Principal component analysis on the TF-IDF matrix of the data shows that the spam and ham emails are actually easily separated by vocabulary alone: one can draw a horizontal line and get most of the way there. To confirm this, and XGBoost model trained on the PCA data achieved a 91% test accuracy (with a .15 test split).

PCA

Three neural networks were created to compare using 5-fold cross validation-one simple RNN model, one simple LSTM model, and one multilayer LSTM model. These are made to improve over the baseline model by taking into account word order.

NNs

As expected, cross validation showed the multilayer LSTM is the best model. On a .15 test split, it achieved 100% test accuracy. I do however note that real life results wouldn’t be nearly as good; the data was easy to separate in the first place. However, this does show that recurrent neural network models outperform simple machine learning in NLP classification tasks.

Results

You will find the full notebook here.