Thursday 28 December 2017

Customer Behavior Targeting: A Short Analysis of Ad-Clicks Rate Prediction


Customer Behavior Targeting: A Short Analysis of Ad-Clicks Rate Prediction

This is one of my exiting analysis what a data scientist do to understand customer behavior and get insights like when, where, what type of advertisement to offer a customer. The data set is from kaggle; major contribution is from Prof. Shoban Babu Phd at IIT-H who personally looked into this dataset and helped me out solving this.

To predict the same we can also use Linear Regression, Logistic Regression and Support Vector Machine. Here in this, I will be using Logistic Regression as it is giving a better prediction compared to others.
For further reading please visit: http://www.sciencedirect.com/science/article/pii/S1877050913002366

About dataset

The dataset contains daily time spent on internet, age of the person, income, daily internet usage, ad topic, city, sex, country, timestamp and clicked on Ad.

Now our objective is to classify the data into people whom can be targeted based on few attributes or information from which we can come up with a model and hit the bull’s eye of STP or assess where our ad campaigns are heading.

Let’s load the data and explore.

mydata<-read.csv("C:/Users/Sangmesh/Downloads/advertising.csv",header=TRUE)
mydata$Age<-as.numeric(mydata$Age)
mydata$Male<-as.numeric(mydata$Male)
summary(mydata)
##  Daily.Time.Spent.on.Site      Age         Area.Income   
##  Min.   :32.60            Min.   :19.00   Min.   :13996  
##  1st Qu.:51.36            1st Qu.:29.00   1st Qu.:47032  
##  Median :68.22            Median :35.00   Median :57012  
##  Mean   :65.00            Mean   :36.01   Mean   :55000  
##  3rd Qu.:78.55            3rd Qu.:42.00   3rd Qu.:65471  
##  Max.   :91.43            Max.   :61.00   Max.   :79485  
##                                                          
##  Daily.Internet.Usage                                 Ad.Topic.Line
##  Min.   :104.8        Adaptive 24hour Graphic Interface      :  1  
##  1st Qu.:138.8        Adaptive asynchronous attitude         :  1  
##  Median :183.1        Adaptive context-sensitive application :  1  
##  Mean   :180.0        Adaptive contextually-based methodology:  1  
##  3rd Qu.:218.8        Adaptive demand-driven knowledgebase   :  1  
##  Max.   :270.0        Adaptive uniform capability            :  1  
##                       (Other)                                :994  
##               City          Male                 Country   
##  Lisamouth      :  3   Min.   :0.000   Czech Republic:  9  
##  Williamsport   :  3   1st Qu.:0.000   France        :  9  
##  Benjaminchester:  2   Median :0.000   Afghanistan   :  8  
##  East John      :  2   Mean   :0.481   Australia     :  8  
##  East Timothy   :  2   3rd Qu.:1.000   Cyprus        :  8  
##  Johnstad       :  2   Max.   :1.000   Greece        :  8  
##  (Other)        :986                   (Other)       :950  
##                Timestamp   Clicked.on.Ad
##  2016-01-01 02:52:10:  1   Min.   :0.0  
##  2016-01-01 03:35:35:  1   1st Qu.:0.0  
##  2016-01-01 05:31:22:  1   Median :0.5  
##  2016-01-01 08:27:06:  1   Mean   :0.5  
##  2016-01-01 15:14:24:  1   3rd Qu.:1.0  
##  2016-01-01 20:17:49:  1   Max.   :1.0  
##  (Other)            :994

After loading the data I started removing insignificant features applying Logistic Regression stepwise and understanding which are significant and which are not.

mydata<-mydata[-5]
mydata<-mydata[-5]
mydata<-mydata[-7]
mydata<-mydata[-6]

Now Look at the names which are significant.

names(mydata)
## [1] "Daily.Time.Spent.on.Site" "Age"                     
## [3] "Area.Income"              "Daily.Internet.Usage"    
## [5] "Male"                     "Clicked.on.Ad"

Now lets apply Logistic Regression

n=nrow(mydata)
n1=floor(n*(0.7))
n2=n-n1

set.seed(123)
ind<-sample(1:n,n1,replace = FALSE)
train<-mydata[ind,]
test<-mydata[-ind,]
log_fit1<-glm(Clicked.on.Ad~.,family = "binomial",data = train)
summary(log_fit1)
## 
## Call:
## glm(formula = Clicked.on.Ad ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8573  -0.1130  -0.0452   0.0091   3.2810  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               3.179e+01  3.972e+00   8.005 1.20e-15 ***
## Daily.Time.Spent.on.Site -2.314e-01  3.018e-02  -7.667 1.76e-14 ***
## Age                       1.715e-01  3.232e-02   5.307 1.12e-07 ***
## Area.Income              -1.416e-04  2.356e-05  -6.011 1.84e-09 ***
## Daily.Internet.Usage     -6.981e-02  9.316e-03  -7.494 6.68e-14 ***
## Male                     -7.578e-01  5.298e-01  -1.430    0.153    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 970.35  on 699  degrees of freedom
## Residual deviance: 112.55  on 694  degrees of freedom
## AIC: 124.55
## 
## Number of Fisher Scoring iterations: 8

Now we can see the *** significance in the data. And can see the Male feature is not showing any significance to the model. We have to remove all the features which are not contributing to the model and this is a stepwise approach for an optimum model selection and we will deploy the same model for the prediction.

Frist we will apply it to training dataset and then to testing data to see is there any variance in the model.

ptrain <- predict(log_fit1,newdata=train,type="response")
gg1=floor(ptrain+0.5)
ttt=table(train$Clicked.on.Ad,gg1)
ttt
##    gg1
##       0   1
##   0 347   6
##   1  12 335

The model looks good. Now we will apply it to testing dataset and let’s compare the accuracy.

ptest <- predict(log_fit1,newdata=test,type="response")
gg2=floor(ptest+0.5)
ttt=table(test$Clicked.on.Ad,gg2)
ttt
##    gg2
##       0   1
##   0 142   5
##   1   5 148

Now lets see the ROC of our training data.

library(ROSE)
## Warning: package 'ROSE' was built under R version 3.4.3
## Loaded ROSE 0.0-3
roc.curve(train$Clicked.on.Ad,gg1)

## Area under the curve (AUC): 0.974

Now lets see the ROC of our test data.

roc.curve(test$Clicked.on.Ad,gg2)

## Area under the curve (AUC): 0.967

No comments:

Post a Comment