Linear Regression on Marketing Survy Data
Sangamesh K S
November 7, 2017
Introduction
Today in this article I will show how linear regression can be applied to Marketing data. This markeing data is a survy of customer satisfaction where in the overall experience is the output and predictor variable.
Now lets go ahead and have a look of our data
Please Visit:https://experimentswithdatascience.blogspot.in/2017/10/german-credit-linear-regression-analysis_8.html where you can find diffrent approach for Linear Regression and Feature Selection
## 'data.frame': 500 obs. of 8 variables:
## $ weekend : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 2 1 1 2 ...
## $ num.child: int 0 2 1 0 4 5 1 0 0 3 ...
## $ distance : num 114.6 27 63.3 25.9 54.7 ...
## $ rides : int 87 87 85 88 84 81 77 82 90 88 ...
## $ games : int 73 78 80 72 87 79 73 70 88 86 ...
## $ wait : int 60 76 70 66 74 48 58 70 79 55 ...
## $ clean : int 89 87 88 89 87 79 85 83 95 88 ...
## $ overall : int 47 65 61 37 68 27 40 30 58 36 ...
## weekend num.child distance rides games wait clean overall
## 1 yes 0 114.64826 87 73 60 89 47
## 2 yes 2 27.01410 87 78 76 87 65
## 3 no 1 63.30098 85 80 70 88 61
## 4 yes 0 25.90993 88 72 66 89 37
## 5 no 4 54.71831 84 87 74 87 68
## 6 no 5 22.67934 81 79 48 79 27
## weekend num.child distance rides games wait clean overall
## 495 no 5 41.47010 83 84 77 90 55
## 496 no 0 11.05258 90 72 68 90 46
## 497 yes 0 8.18774 91 83 82 91 47
## 498 no 2 45.17740 95 92 85 93 71
## 499 no 3 27.08838 83 83 80 88 54
## 500 no 1 38.40876 86 88 77 85 62
When we see the data we have 500 hundred observation and 8 features wherein overall is an output. Our objective of this study is to find weather the 7 features are affecting the output? If yes, we will understand the significance of the influencing factors/feature.Normally and commonly these types of data are affected by “halo effect” and get a genralised persective and responders tend to give extreem result of good and bad. later in this analysis we will also look at how to crack these data
Now we will check for the abnormality in the data and will look for outliers, skewers, missing values and these is the 1st process to analyses any data and in R we can do it with couple of packages but the most popular functions are describe() and summary(). Now lets look at the summary of the data
## weekend num.child distance rides
## no :259 Min. :0.000 Min. : 0.5267 Min. : 72.00
## yes:241 1st Qu.:0.000 1st Qu.: 10.3181 1st Qu.: 82.00
## Median :2.000 Median : 19.0191 Median : 86.00
## Mean :1.738 Mean : 31.0475 Mean : 85.85
## 3rd Qu.:3.000 3rd Qu.: 39.5821 3rd Qu.: 90.00
## Max. :5.000 Max. :239.1921 Max. :100.00
## games wait clean overall
## Min. : 57.00 Min. : 40.0 Min. : 74.0 Min. : 6.00
## 1st Qu.: 73.00 1st Qu.: 62.0 1st Qu.: 84.0 1st Qu.: 40.00
## Median : 78.00 Median : 70.0 Median : 88.0 Median : 50.00
## Mean : 78.67 Mean : 69.9 Mean : 87.9 Mean : 51.26
## 3rd Qu.: 85.00 3rd Qu.: 77.0 3rd Qu.: 91.0 3rd Qu.: 62.00
## Max. :100.00 Max. :100.0 Max. :100.0 Max. :100.00
Now I am able to see something is wrong with distance. The mean and median are wider than it used to be and the min and max are also abnormal. Now we will cross-clarify with the help of kurtosis. And find is it having any abnormal skewers?
## vars n mean sd median trimmed mad min max range
## weekend* 1 500 1.48 0.50 1.00 1.48 0.00 1.00 2.00 1.00
## num.child 2 500 1.74 1.50 2.00 1.61 1.48 0.00 5.00 5.00
## distance 3 500 31.05 33.15 19.02 24.65 17.26 0.53 239.19 238.67
## rides 4 500 85.85 5.46 86.00 85.81 5.93 72.00 100.00 28.00
## games 5 500 78.67 8.12 78.00 78.72 8.90 57.00 100.00 43.00
## wait 6 500 69.90 10.77 70.00 70.00 11.86 40.00 100.00 60.00
## clean 7 500 87.90 5.12 88.00 87.87 5.19 74.00 100.00 26.00
## overall 8 500 51.26 15.88 50.00 50.92 16.31 6.00 100.00 94.00
## skew kurtosis se
## weekend* 0.07 -2.00 0.02
## num.child 0.44 -0.75 0.07
## distance 2.57 9.00 1.48
## rides 0.06 -0.47 0.24
## games -0.05 -0.35 0.36
## wait -0.07 -0.24 0.48
## clean 0.01 -0.45 0.23
## overall 0.19 -0.12 0.71
As per the kurtosis value we can clearly say there is a skewness in the data. So let me plot the histogram of distance data and check is it normally distributed or not
Here we will use log normal transformation and box-cox the reason why I select log normal is based on the skewness and histogram plot. It is evident that when data is too skewed log transformation works well; as it visuvally drag the data towords normal.
Please Note: Whenever we find there is skewness we will do transformation with the data. One reason for transformation is linear regression ask the data to be normal and so we need to do transformation.
Types of Data Transformation?
Log-Normal Transformation
Square Root Transformation
Box-cox Transformation
Removal of outliers
So we will do log transformation and check it with the help of plotting a histogram.
Now the data is concentrated towards the center and looks fine.
## Warning: package 'corrplot' was built under R version 3.4.1
Now we can clearly see the clean and rides are having high correlation. Overall with clean and rides factors are having moderate influence of correlation. Along with it we do not see muti-colinearity with the data. If interested futher we can apply VIF and check the reading and I am satisfactory with the data.
Please Note: If you find any multi colinearity in the data it is better to remove those features.
lm.model.fit1<-lm(overall~rides+games+wait+clean+weekend+loggdist,data = mydata)
summary(lm.model.fit1)
##
## Call:
## lm(formula = overall ~ rides + games + wait + clean + weekend +
## loggdist, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.366 -6.349 0.978 7.283 28.629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -133.49660 8.45908 -15.781 < 2e-16 ***
## rides 0.53815 0.14184 3.794 0.000167 ***
## games 0.15385 0.06885 2.235 0.025896 *
## wait 0.55270 0.04766 11.598 < 2e-16 ***
## clean 0.96975 0.15950 6.080 2.41e-09 ***
## weekendyes -0.83767 0.95033 -0.881 0.378508
## loggdist 1.00272 0.48212 2.080 0.038060 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.55 on 493 degrees of freedom
## Multiple R-squared: 0.5635, Adjusted R-squared: 0.5582
## F-statistic: 106.1 on 6 and 493 DF, p-value: < 2.2e-16
Now seeing this model i can cearly see that weekends is having no impact on the model and I will try to remove it. one reason for removing this is its effect on the models; the number of unwanted features you add to the model, more the noise is created. thus we remove weekend feature.
lm.model.fit2<-lm(overall~rides+games+wait+clean+loggdist,data = mydata)
summary(lm.model.fit2)
##
## Call:
## lm(formula = overall ~ rides + games + wait + clean + loggdist,
## data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.8343 -6.6544 0.8957 7.2803 29.0467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -134.29859 8.40811 -15.973 < 2e-16 ***
## rides 0.54326 0.14169 3.834 0.000142 ***
## games 0.15465 0.06883 2.247 0.025092 *
## wait 0.55197 0.04764 11.587 < 2e-16 ***
## clean 0.96778 0.15945 6.070 2.56e-09 ***
## loggdist 1.04366 0.47977 2.175 0.030080 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.55 on 494 degrees of freedom
## Multiple R-squared: 0.5628, Adjusted R-squared: 0.5584
## F-statistic: 127.2 on 5 and 494 DF, p-value: < 2.2e-16
As I have told we are interested in reducing the noise and these noise will eventually reduce the Multiplied R^2 from 0.5635 to 0.5628.
Here it says that 0.5628 i.e 56.28% of the variation is explained by the model; rest 43.725 of the data is still random which means the model is not predicting it. Even we can take Multiple R^2 56.28 of lm.model.fit2 as accuracy of the model
The “*" give the significance of the features on the model. Where in the Rides, Clean and wait have higher significance
The Residual standard error of the model which is 10.55 say this is the error factor in the model and our objective is to reduce these to minimum as possible which eventually increase our predictability of the model. We can look at systematic residual nature of the data using Residual 1st Quartile and 3rd Quartile. It will give you how far they are from median and are there similar Q1 and Q3. If they (i.e Q1 and Q3)are similar then we can say the data is normally distributed.
Now we will do few important data visivalisation of the same data.
If you look at Normal QQ plot you can see it is not on a straight line which says the data is not proper and contain outliers which are numbered and observation are away from the straight line which say the data is not normally distributed.
By lookin at Residual Vs Fitted: we can clearly see there is no linear pattern looking into this data.
Even Residual Vs Leverage: We can see no dominance i.e no single value is showing effective consentraion. the data is towords the left. Also they are showing few outliers in the data numbered 55,478 and 441
Now when we want to compare any 2 models then we basically go with R^2. Higher the R^2; higher the predictability. Now we will look at R^2 of our model-1
## [1] 0.5634968
Now we will do the same for model-2
## [1] 0.5628089
We can see there is a negligible contribution of the feature towords the model.
let me construct a model based with weekend
## [1] 0.002812546
We can conclude that any model with all features will have the high R^2.either it contribute to the model or not
Now let’s do null hypothesis on the same, with an assumption that M1=M2.With the alternate hypothesis saying they are not same.
## Analysis of Variance Table
##
## Model 1: overall ~ rides + games + wait + clean + weekend + loggdist
## Model 2: overall ~ rides + games + wait + clean + loggdist
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 493 54918
## 2 494 55005 -1 -86.548 0.7769 0.3785
We can see the P-Value is .3785 which is greater than 0.05.;And the null hypothesis is right.
now we will predict the model
## 1 2 3 4 5 6 7 8
## 47.94096 54.16450 52.74094 50.15020 54.37468 28.51799 36.11193 43.01148
## 9 10 11 12 13 14 15 16
## 66.92028 45.06093 64.04987 37.79224 65.60674 43.54052 59.96347 58.07422
## 17 18 19 20
## 50.63371 30.73789 45.00922 62.63889
No comments:
Post a Comment