Wednesday, 8 November 2017

Linear Regression on Market Survey Data


Linear Regression on Marketing Survy Data

Introduction

Today in this article I will show how linear regression can be applied to Marketing data. This markeing data is a survy of customer satisfaction where in the overall experience is the output and predictor variable.

Now lets go ahead and have a look of our data

Please Visit:https://experimentswithdatascience.blogspot.in/2017/10/german-credit-linear-regression-analysis_8.html where you can find diffrent approach for Linear Regression and Feature Selection

## 'data.frame':    500 obs. of  8 variables:
##  $ weekend  : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 2 1 1 2 ...
##  $ num.child: int  0 2 1 0 4 5 1 0 0 3 ...
##  $ distance : num  114.6 27 63.3 25.9 54.7 ...
##  $ rides    : int  87 87 85 88 84 81 77 82 90 88 ...
##  $ games    : int  73 78 80 72 87 79 73 70 88 86 ...
##  $ wait     : int  60 76 70 66 74 48 58 70 79 55 ...
##  $ clean    : int  89 87 88 89 87 79 85 83 95 88 ...
##  $ overall  : int  47 65 61 37 68 27 40 30 58 36 ...
##   weekend num.child  distance rides games wait clean overall
## 1     yes         0 114.64826    87    73   60    89      47
## 2     yes         2  27.01410    87    78   76    87      65
## 3      no         1  63.30098    85    80   70    88      61
## 4     yes         0  25.90993    88    72   66    89      37
## 5      no         4  54.71831    84    87   74    87      68
## 6      no         5  22.67934    81    79   48    79      27
##     weekend num.child distance rides games wait clean overall
## 495      no         5 41.47010    83    84   77    90      55
## 496      no         0 11.05258    90    72   68    90      46
## 497     yes         0  8.18774    91    83   82    91      47
## 498      no         2 45.17740    95    92   85    93      71
## 499      no         3 27.08838    83    83   80    88      54
## 500      no         1 38.40876    86    88   77    85      62

When we see the data we have 500 hundred observation and 8 features wherein overall is an output. Our objective of this study is to find weather the 7 features are affecting the output? If yes, we will understand the significance of the influencing factors/feature.Normally and commonly these types of data are affected by “halo effect” and get a genralised persective and responders tend to give extreem result of good and bad. later in this analysis we will also look at how to crack these data

Now we will check for the abnormality in the data and will look for outliers, skewers, missing values and these is the 1st process to analyses any data and in R we can do it with couple of packages but the most popular functions are describe() and summary(). Now lets look at the summary of the data

##  weekend     num.child        distance            rides       
##  no :259   Min.   :0.000   Min.   :  0.5267   Min.   : 72.00  
##  yes:241   1st Qu.:0.000   1st Qu.: 10.3181   1st Qu.: 82.00  
##            Median :2.000   Median : 19.0191   Median : 86.00  
##            Mean   :1.738   Mean   : 31.0475   Mean   : 85.85  
##            3rd Qu.:3.000   3rd Qu.: 39.5821   3rd Qu.: 90.00  
##            Max.   :5.000   Max.   :239.1921   Max.   :100.00  
##      games             wait           clean          overall      
##  Min.   : 57.00   Min.   : 40.0   Min.   : 74.0   Min.   :  6.00  
##  1st Qu.: 73.00   1st Qu.: 62.0   1st Qu.: 84.0   1st Qu.: 40.00  
##  Median : 78.00   Median : 70.0   Median : 88.0   Median : 50.00  
##  Mean   : 78.67   Mean   : 69.9   Mean   : 87.9   Mean   : 51.26  
##  3rd Qu.: 85.00   3rd Qu.: 77.0   3rd Qu.: 91.0   3rd Qu.: 62.00  
##  Max.   :100.00   Max.   :100.0   Max.   :100.0   Max.   :100.00

Now I am able to see something is wrong with distance. The mean and median are wider than it used to be and the min and max are also abnormal. Now we will cross-clarify with the help of kurtosis. And find is it having any abnormal skewers?

##           vars   n  mean    sd median trimmed   mad   min    max  range
## weekend*     1 500  1.48  0.50   1.00    1.48  0.00  1.00   2.00   1.00
## num.child    2 500  1.74  1.50   2.00    1.61  1.48  0.00   5.00   5.00
## distance     3 500 31.05 33.15  19.02   24.65 17.26  0.53 239.19 238.67
## rides        4 500 85.85  5.46  86.00   85.81  5.93 72.00 100.00  28.00
## games        5 500 78.67  8.12  78.00   78.72  8.90 57.00 100.00  43.00
## wait         6 500 69.90 10.77  70.00   70.00 11.86 40.00 100.00  60.00
## clean        7 500 87.90  5.12  88.00   87.87  5.19 74.00 100.00  26.00
## overall      8 500 51.26 15.88  50.00   50.92 16.31  6.00 100.00  94.00
##            skew kurtosis   se
## weekend*   0.07    -2.00 0.02
## num.child  0.44    -0.75 0.07
## distance   2.57     9.00 1.48
## rides      0.06    -0.47 0.24
## games     -0.05    -0.35 0.36
## wait      -0.07    -0.24 0.48
## clean      0.01    -0.45 0.23
## overall    0.19    -0.12 0.71

As per the kurtosis value we can clearly say there is a skewness in the data. So let me plot the histogram of distance data and check is it normally distributed or not

Here we will use log normal transformation and box-cox the reason why I select log normal is based on the skewness and histogram plot. It is evident that when data is too skewed log transformation works well; as it visuvally drag the data towords normal.

Please Note: Whenever we find there is skewness we will do transformation with the data. One reason for transformation is linear regression ask the data to be normal and so we need to do transformation.

Types of Data Transformation?

Log-Normal Transformation

Square Root Transformation

Box-cox Transformation

Removal of outliers

So we will do log transformation and check it with the help of plotting a histogram.

Now the data is concentrated towards the center and looks fine.

## Warning: package 'corrplot' was built under R version 3.4.1

Now we can clearly see the clean and rides are having high correlation. Overall with clean and rides factors are having moderate influence of correlation. Along with it we do not see muti-colinearity with the data. If interested futher we can apply VIF and check the reading and I am satisfactory with the data.

Please Note: If you find any multi colinearity in the data it is better to remove those features.

lm.model.fit1<-lm(overall~rides+games+wait+clean+weekend+loggdist,data = mydata)

summary(lm.model.fit1)
## 
## Call:
## lm(formula = overall ~ rides + games + wait + clean + weekend + 
##     loggdist, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.366  -6.349   0.978   7.283  28.629 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -133.49660    8.45908 -15.781  < 2e-16 ***
## rides          0.53815    0.14184   3.794 0.000167 ***
## games          0.15385    0.06885   2.235 0.025896 *  
## wait           0.55270    0.04766  11.598  < 2e-16 ***
## clean          0.96975    0.15950   6.080 2.41e-09 ***
## weekendyes    -0.83767    0.95033  -0.881 0.378508    
## loggdist       1.00272    0.48212   2.080 0.038060 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.55 on 493 degrees of freedom
## Multiple R-squared:  0.5635, Adjusted R-squared:  0.5582 
## F-statistic: 106.1 on 6 and 493 DF,  p-value: < 2.2e-16

Now seeing this model i can cearly see that weekends is having no impact on the model and I will try to remove it. one reason for removing this is its effect on the models; the number of unwanted features you add to the model, more the noise is created. thus we remove weekend feature.

lm.model.fit2<-lm(overall~rides+games+wait+clean+loggdist,data = mydata)
summary(lm.model.fit2)
## 
## Call:
## lm(formula = overall ~ rides + games + wait + clean + loggdist, 
##     data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.8343  -6.6544   0.8957   7.2803  29.0467 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -134.29859    8.40811 -15.973  < 2e-16 ***
## rides          0.54326    0.14169   3.834 0.000142 ***
## games          0.15465    0.06883   2.247 0.025092 *  
## wait           0.55197    0.04764  11.587  < 2e-16 ***
## clean          0.96778    0.15945   6.070 2.56e-09 ***
## loggdist       1.04366    0.47977   2.175 0.030080 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.55 on 494 degrees of freedom
## Multiple R-squared:  0.5628, Adjusted R-squared:  0.5584 
## F-statistic: 127.2 on 5 and 494 DF,  p-value: < 2.2e-16

As I have told we are interested in reducing the noise and these noise will eventually reduce the Multiplied R^2 from 0.5635 to 0.5628.

Here it says that 0.5628 i.e 56.28% of the variation is explained by the model; rest 43.725 of the data is still random which means the model is not predicting it. Even we can take Multiple R^2 56.28 of lm.model.fit2 as accuracy of the model

The “*" give the significance of the features on the model. Where in the Rides, Clean and wait have higher significance

The Residual standard error of the model which is 10.55 say this is the error factor in the model and our objective is to reduce these to minimum as possible which eventually increase our predictability of the model. We can look at systematic residual nature of the data using Residual 1st Quartile and 3rd Quartile. It will give you how far they are from median and are there similar Q1 and Q3. If they (i.e Q1 and Q3)are similar then we can say the data is normally distributed.

Now we will do few important data visivalisation of the same data.

If you look at Normal QQ plot you can see it is not on a straight line which says the data is not proper and contain outliers which are numbered and observation are away from the straight line which say the data is not normally distributed.

By lookin at Residual Vs Fitted: we can clearly see there is no linear pattern looking into this data.

Even Residual Vs Leverage: We can see no dominance i.e no single value is showing effective consentraion. the data is towords the left. Also they are showing few outliers in the data numbered 55,478 and 441

Now when we want to compare any 2 models then we basically go with R^2. Higher the R^2; higher the predictability. Now we will look at R^2 of our model-1

## [1] 0.5634968

Now we will do the same for model-2

## [1] 0.5628089

We can see there is a negligible contribution of the feature towords the model.

let me construct a model based with weekend

## [1] 0.002812546

We can conclude that any model with all features will have the high R^2.either it contribute to the model or not

Now let’s do null hypothesis on the same, with an assumption that M1=M2.With the alternate hypothesis saying they are not same.

## Analysis of Variance Table
## 
## Model 1: overall ~ rides + games + wait + clean + weekend + loggdist
## Model 2: overall ~ rides + games + wait + clean + loggdist
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1    493 54918                           
## 2    494 55005 -1   -86.548 0.7769 0.3785

We can see the P-Value is .3785 which is greater than 0.05.;And the null hypothesis is right.

now we will predict the model

##        1        2        3        4        5        6        7        8 
## 47.94096 54.16450 52.74094 50.15020 54.37468 28.51799 36.11193 43.01148 
##        9       10       11       12       13       14       15       16 
## 66.92028 45.06093 64.04987 37.79224 65.60674 43.54052 59.96347 58.07422 
##       17       18       19       20 
## 50.63371 30.73789 45.00922 62.63889

No comments:

Post a Comment