R: spam classification

From OnnoWiki
Jump to navigation Jump to search
---
title: "Untitled"
author: "subhash"
date: "4 May 2018"
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,include=TRUE,warning=FALSE,message = FALSE)
```
###Business Objective
Many email services today provide spam filters that are able to classify emails
into spam and non-spam email with high accuracy. This project aims to build a spam classifier for sms recieved by the mobile phose users.      
We will building the application from scratch and will  follow the traditional Data science workflow.  
         
*  Data Collection     
*  Understanding Data     
*  Data Cleaning     
*  Data preparation      
*  Analyze Data     
*  Modelling     
*  Testing   
 
Loading the required libraries.
```{r}
library(caret)
library(readr)
library(ggplot2)
library(gridExtra)
library(tm)
library(knitr)
```

#### Data Collection   
Kaggle is home of datascience learner. It host compitetion and has many free to use datasets, which can be used for practicing your ML 
skills. For the project, we obtained a structured data of SMS messages in CSV format which has 2 variables- category and message. 
```{r}
smsdata <- read.csv("SPAM text message 20170820 - Data.csv", stringsAsFactors = FALSE)
kable(head(smsdata),caption = "SMS spam dataset")
```
#### Understanding the data   
The category variable is of character class which has to be converted to factor variable. There are about 5572 observations/messages.The 
category type is skewed towards the "ham" class, spam messages are around 750. 
```{r}
str(smsdata)
smsdata$Category<-as.factor(smsdata$Category)
table(smsdata$Category)
ggplot(smsdata)+geom_bar(aes(smsdata$Category))+xlab("category")+ggtitle("SMS catergories")
```

We create a corpus using the messages we have and Lets examine it.The Corpus() function creates an R object to store text documents. Since we  
have already read the SMS messages and stored them in an R vector, we specify VectorSource(), which tells Corpus() to use the messages in the  
vector smsdata$Message. 
```{r}
message_corpus <- Corpus(VectorSource(smsdata$Message))
inspect(message_corpus[1:3])
```
We see that the messages are similar to the ones we recieve. Above examples show that message may contains a URL, numbers, and dollaramounts.  
To process these sms we need to clean it before doing any analysis.

#### Data Cleaning   
Preprocessing and cleaning our text data can improve our performance of spam classifier. So we will use some basic preprocessing step such as-  

*  Lower-casing: The entire sms is converted into lower case, so that captialization is ignored    
*  Numbers: All the numbers are removed.    
*  Stopwords: stop words are removed using the R Builtin stopwords.     
*  Removal of non-words: Non-words and punctuation have been removed.       
*  Trimming: All white spaces (tabs, newlines, spaces) have all been trimmed to a single space character.    
```{r}
corpus_clean <- tm_map(message_corpus,tolower)
inspect(corpus_clean[1:5])
corpus_clean <- tm_map(corpus_clean, removeNumbers)
inspect(corpus_clean[1:5])
```
```{r}
# Removing Stop Words
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords())
inspect(corpus_clean[1:3])
# Removing punctuation:
corpus_clean <- tm_map(corpus_clean, removePunctuation)
inspect(corpus_clean[1:3])
# Strip White Spaces
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
inspect(corpus_clean[1:3])
```
#### Data Preparation    
As usual for any data science problem, to evalute the performance of our ML model we will split the data into training set and test set (used  
for evaluation).Important to check the class distbribution for both trainset and testset, so that they are evenly split.
```{r}
set.seed(99)
# We use the dataset to create a partition (75% training 25% testing)
index <- sample(1:nrow(smsdata), 0.75*nrow(smsdata))
# select 25% of the data for testing
testset <- smsdata[-index,]
# select 75% of data to train the models
trainset <- smsdata[index,]
par(mfrow=c(1,2))
plot(trainset$Category,xlab="category",ylab="counts",main = "Train")
plot(testset$Category[-index],xlab="category",ylab="counts",main = "Test")
```

#### Analyze the Data  
We can find the most frequent words that are seen in the messages using the term-document matrix which is a mathematical matrix that 
describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the 
collection and columns correspond to terms/word.
```{r}
dtm <- TermDocumentMatrix(corpus_clean)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word,
        col ="lightblue", main ="Most frequent words",
        ylab = "Word frequencies")
```
We see the top 10 words in the whole message dataset irrespective of the sms.We can use a word cloud which is a way to visually depict the 
frequency at which words appear in text data. The cloud is made up of words scattered somewhat randomly around the figure Words appearing 
more often in the text are shown in a larger font, while less common terms are shown in smaller fonts. category.
```{r}
library(RColorBrewer)
library(wordcloud)
#wordcloud for spam messages
wordcloud(trainset$Message[trainset$Category=="spam"], min.freq = 5,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))
#wordcloud for good messages
wordcloud(trainset$Message[trainset$Category=="ham"], min.freq = 35,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))
```
We see that the words "call","now,"free" are more frequent in the spam sms which is normally expected.
#### Modelling 

The heart of our application lies the Machine learning model.We will be using a Naive Bayes which is a good algorithm for working with text  
classification. When dealing with text, it's very common to treat each unique word as a feature as we are doing using term document 
matrix.Naive Bayes performs well when we have multiple classes and working with text classification. 
 
Advantage of Naive Bayes algorithms are:         
*  It is simple and if the conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than 
discriminative models like logistic regression, so you need less training data. And even if the NB assumption doesn't hold.       
*  It requires less model training time        

Currently we have document term matrix of our entire corpus which is just a matrix of 1's and 0's which represented whether the word is 
present in the document/message or not.As our feature are terms/words the matrix containing the data must be treated factor varibles by our 
NB model.
```{r}
#converting variable into factorclass
convert_factor <- function(x) {
  x <- factor(ifelse(x > 0, 1, 0))
    return(x)
}
```
From the "d" dataframe we know there are 7596 terms/words. We could use all these terms as features in our model but it isnt viable. So lets
rescrict ourselves by using the top frequent words, this is a clever way of  reducing the features to train our model. In order to choose the 
optimal number of frequent words("n") to be used as features, we use a validation set split from the training set, to measure the 
classification accuracy for different values of "n" and choose the minimum "n" which gives a good performance.
```{r}
library(e1071)
set.seed(99)
index_validation<-sample(1:nrow(trainset), 0.75*nrow(trainset))
train_corpus<-corpus_clean[index]
```
"train_corpus" is corpus of documents from the training set, which is further indexed for training and validation. Below is a function which 
returns the vector of training accuracy and validation accuracy.The methodoly for building the NB classifier is fairly simple, one can use 
the "e1071" package.    

Here are the steps involved in building the model:
     
*  Create a vector of frequent words which are to used as features      
*  Create document term matrix for both training set and validation set.    
*  Convert the DTM into matrix with factor variables    
*  Build the Naive bayes model.    
*  Make predictions and calcuate the accuracy     
```{r}
naiveclass<-function(n)
{
frequentWords<-as.character(d$word[1:n])
# Creating a Document Term Matrix using words that have high frequency
sms_train<- DocumentTermMatrix(train_corpus[index_validation], list(dictionary = frequentWords))
sms_test <- DocumentTermMatrix(train_corpus[-index_validation], list(dictionary = frequentWords))
#convert the DM into matrix for passing into the model
sms_train <- apply(sms_train, MARGIN = 2, convert_factor)
sms_test <- apply(sms_test, MARGIN = 2, convert_factor)
#build the model
sms_classifier <- naiveBayes(sms_train, trainset[index_validation,]$Category)
#make predictions
sms_test_pred <- predict(sms_classifier, sms_test)
sms_train_pred <- predict(sms_classifier, sms_train)
return (cbind(mean(trainset[-index_validation,]$Category==sms_test_pred),mean(trainset[index_validation,]$Category==sms_train_pred)))
}
 
```
We can test the model for different values of "n" and choose the minimum based on the validation accuracy.
```{r}
acc=matrix(NA,5,2)
n=c(10,100,250,500,1000)
for(i in 1:5){
  #print(i)
  acc[i,]=naiveclass(n[i])
}
plot(n,acc[,1],pch=19,type="b",col="red",ylab = "Classification Accuracy",ylim  =c(0.89,1.02),xlab="No of word features")
points(n,acc[,2],col="blue",pch=19,type="b",ylim=c(0.89,1.0))
legend("topright",legend=c("Training","Validation"),col=c("red","blue"),pch=19)
```


We see that the validation accuracy closely follows the training accuracy and the curves flatten out as you increase the number of terms 
included. We reach a maximum of 98% with just 1000 features out of possible 7956. We see an good improvement by incresing the features from 
10 to 100. So our optimal "n" would be choosing n=500. We can now draw predictions using our model on the seperated test dataset and draw 
conclusions.

#### Testing      
We will use n=500 and entire training set to build the model and make predictions on testset.   

```{r}
frequentWords<-as.character(d$word[1:500])
# Creating a Document Term Matrix using words that have high frequency
sms_train<- DocumentTermMatrix(corpus_clean[index], list(dictionary = frequentWords))
sms_test <- DocumentTermMatrix(corpus_clean[-index], list(dictionary = frequentWords))
#convert the DM into matrix for passing into the model
sms_train <- apply(sms_train, MARGIN = 2, convert_factor)
sms_test <- apply(sms_test, MARGIN = 2, convert_factor)
#build the model
sms_classifier <- naiveBayes(sms_train, trainset$Category)
#make predictions
sms_test_pred <- predict(sms_classifier, sms_test)
sms_train_pred <- predict(sms_classifier, sms_train)
confusionMatrix(sms_test_pred,testset$Category)
```
We see an classification accuracy of 97.42%, which I consider very good. There are a total of 36 misclassified messages of which only 7 are  
legimate messages which are misclassified as spam messages( False Positives)
Naive Bayes model is easy to build and particularly useful for very large data sets. It uses Bayes theorem which provides a way of 
calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). We can access these prior probabilities from the NM model.  
```{r}
sms_classifier$tables[1:5]
```

#### Further Work 

Our ML model is ready to be deployed. Further improvements can be done by using Word Stemming: Words are reduced to their stemmed form. For
example, "discount", "discounts", "discounted" and "discounting" are all replaced with "discount". This can essentially reduce features for  
the document term matrix. Other classifier can be tested out for comparing purposes.

Referensi

Pranala Menarik