Customer Behavior Targeting: A Short Analysis of Ad-Clicks Rate Prediction
Sangamesh K S
December 28, 2017
This is one of my exiting analysis what a data scientist do to understand customer behavior and get insights like when, where, what type of advertisement to offer a customer. The data set is from kaggle; major contribution is from Prof. Shoban Babu Phd at IIT-H who personally looked into this dataset and helped me out solving this.
To predict the same we can also use Linear Regression, Logistic Regression and Support Vector Machine. Here in this, I will be using Logistic Regression as it is giving a better prediction compared to others.
For further reading please visit: http://www.sciencedirect.com/science/article/pii/S1877050913002366
About dataset
The dataset contains daily time spent on internet, age of the person, income, daily internet usage, ad topic, city, sex, country, timestamp and clicked on Ad.
Now our objective is to classify the data into people whom can be targeted based on few attributes or information from which we can come up with a model and hit the bull’s eye of STP or assess where our ad campaigns are heading.
Let’s load the data and explore.
mydata<-read.csv("C:/Users/Sangmesh/Downloads/advertising.csv",header=TRUE)
mydata$Age<-as.numeric(mydata$Age)
mydata$Male<-as.numeric(mydata$Male)
summary(mydata)
## Daily.Time.Spent.on.Site Age Area.Income
## Min. :32.60 Min. :19.00 Min. :13996
## 1st Qu.:51.36 1st Qu.:29.00 1st Qu.:47032
## Median :68.22 Median :35.00 Median :57012
## Mean :65.00 Mean :36.01 Mean :55000
## 3rd Qu.:78.55 3rd Qu.:42.00 3rd Qu.:65471
## Max. :91.43 Max. :61.00 Max. :79485
##
## Daily.Internet.Usage Ad.Topic.Line
## Min. :104.8 Adaptive 24hour Graphic Interface : 1
## 1st Qu.:138.8 Adaptive asynchronous attitude : 1
## Median :183.1 Adaptive context-sensitive application : 1
## Mean :180.0 Adaptive contextually-based methodology: 1
## 3rd Qu.:218.8 Adaptive demand-driven knowledgebase : 1
## Max. :270.0 Adaptive uniform capability : 1
## (Other) :994
## City Male Country
## Lisamouth : 3 Min. :0.000 Czech Republic: 9
## Williamsport : 3 1st Qu.:0.000 France : 9
## Benjaminchester: 2 Median :0.000 Afghanistan : 8
## East John : 2 Mean :0.481 Australia : 8
## East Timothy : 2 3rd Qu.:1.000 Cyprus : 8
## Johnstad : 2 Max. :1.000 Greece : 8
## (Other) :986 (Other) :950
## Timestamp Clicked.on.Ad
## 2016-01-01 02:52:10: 1 Min. :0.0
## 2016-01-01 03:35:35: 1 1st Qu.:0.0
## 2016-01-01 05:31:22: 1 Median :0.5
## 2016-01-01 08:27:06: 1 Mean :0.5
## 2016-01-01 15:14:24: 1 3rd Qu.:1.0
## 2016-01-01 20:17:49: 1 Max. :1.0
## (Other) :994
After loading the data I started removing insignificant features applying Logistic Regression stepwise and understanding which are significant and which are not.
mydata<-mydata[-5]
mydata<-mydata[-5]
mydata<-mydata[-7]
mydata<-mydata[-6]
Now Look at the names which are significant.
names(mydata)
## [1] "Daily.Time.Spent.on.Site" "Age"
## [3] "Area.Income" "Daily.Internet.Usage"
## [5] "Male" "Clicked.on.Ad"
Now lets apply Logistic Regression
n=nrow(mydata)
n1=floor(n*(0.7))
n2=n-n1
set.seed(123)
ind<-sample(1:n,n1,replace = FALSE)
train<-mydata[ind,]
test<-mydata[-ind,]
log_fit1<-glm(Clicked.on.Ad~.,family = "binomial",data = train)
summary(log_fit1)
##
## Call:
## glm(formula = Clicked.on.Ad ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8573 -0.1130 -0.0452 0.0091 3.2810
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.179e+01 3.972e+00 8.005 1.20e-15 ***
## Daily.Time.Spent.on.Site -2.314e-01 3.018e-02 -7.667 1.76e-14 ***
## Age 1.715e-01 3.232e-02 5.307 1.12e-07 ***
## Area.Income -1.416e-04 2.356e-05 -6.011 1.84e-09 ***
## Daily.Internet.Usage -6.981e-02 9.316e-03 -7.494 6.68e-14 ***
## Male -7.578e-01 5.298e-01 -1.430 0.153
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 970.35 on 699 degrees of freedom
## Residual deviance: 112.55 on 694 degrees of freedom
## AIC: 124.55
##
## Number of Fisher Scoring iterations: 8
Now we can see the *** significance in the data. And can see the Male feature is not showing any significance to the model. We have to remove all the features which are not contributing to the model and this is a stepwise approach for an optimum model selection and we will deploy the same model for the prediction.
Frist we will apply it to training dataset and then to testing data to see is there any variance in the model.
ptrain <- predict(log_fit1,newdata=train,type="response")
gg1=floor(ptrain+0.5)
ttt=table(train$Clicked.on.Ad,gg1)
ttt
## gg1
## 0 1
## 0 347 6
## 1 12 335
The model looks good. Now we will apply it to testing dataset and let’s compare the accuracy.
ptest <- predict(log_fit1,newdata=test,type="response")
gg2=floor(ptest+0.5)
ttt=table(test$Clicked.on.Ad,gg2)
ttt
## gg2
## 0 1
## 0 142 5
## 1 5 148
Now lets see the ROC of our training data.
library(ROSE)
## Warning: package 'ROSE' was built under R version 3.4.3
## Loaded ROSE 0.0-3
roc.curve(train$Clicked.on.Ad,gg1)
## Area under the curve (AUC): 0.974
Now lets see the ROC of our test data.
roc.curve(test$Clicked.on.Ad,gg2)
## Area under the curve (AUC): 0.967