Sunday, 8 October 2017

German Credit: Linear Regression Analysis


German Credit: Linear Regression Analysis

Introduction

we will try to understand the data using linear regression and perform feature selection and other test on the data and try to optimize the model.Lets import the data with the name mydata using choose.file and read.csv function

mydata<-read.csv("C:/Users/Sangmesh/Google Drive/Big Data using R/Kaggle/germancredit.csv",header = TRUE)

Now Lets view the structure of the data

str(mydata)
## 'data.frame':    1000 obs. of  21 variables:
##  $ Default        : int  0 1 0 0 1 0 0 0 0 1 ...
##  $ checkingstatus1: Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
##  $ duration       : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ history        : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
##  $ purpose        : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
##  $ amount         : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ savings        : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
##  $ employ         : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
##  $ installment    : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ status         : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
##  $ others         : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
##  $ residence      : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ property       : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
##  $ age            : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ otherplans     : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ housing        : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
##  $ cards          : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ job            : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
##  $ liable         : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ tele           : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
##  $ foreign        : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...

the structure of the data reveals many things most of the list are integer or character. Now my next quest is there any missing values? Let’s check it using is.na function and View function. As we know the data is a treated data free from missing values and all the data in the lists are converted into coded characters.which help us in computation

Now let me show you a small sample how the data is distributed with a example plotting it on a barograph.

barplot(table(mydata$checkingstatus1),main ="Plot of checking Status")

you can see the data is distributed in 4 coded characters i.e A11 to A14. and with the help of data visualization I have shown you the ##frequency distribution of character

barplot(table(mydata$history))

now let’s see for history in mydata and plot a bar plot if I want to have a check the frequency distribution in tabular format. Please check the same below

table(mydata$history)
## 
## A30 A31 A32 A33 A34 
##  40  49 530  88 293

Now we came to know about the data. Its important to know what to do with this data? what are the various algorithms we can apply.

I want to start from basic lm function and let me try with 2 or 3 algorithms and test which fit the best among these 2 or 3 algos

1st I am interested applying linear models to understand the data and its significance and co-linearity

Data preparation

There is a small problem applying the linear model. the data are not numerical so we will convert the data into numerical format

So we will re code the vectors into numerical format

Now I will remove “A” form mydata and convert everything into numerical. So we will come up with a new dataset and keep the original mydata undisturbed

newdata<-mydata

After creating a new data

We will look at data wangling of newdata that we have recently

table(newdata$checkingstatus1)
## 
## A11 A12 A13 A14 
## 274 269  63 394

Now we will remove all A and convert it into a numaric format

newdata$checkingstatus1<-gsub("A11","11",newdata$checkingstatus1)
newdata$checkingstatus1<-gsub("A12","12",newdata$checkingstatus1)
newdata$checkingstatus1<-gsub("A13","13",newdata$checkingstatus1)
newdata$checkingstatus1<-gsub("A14","14",newdata$checkingstatus1)

now lets check it it done?

table(newdata$checkingstatus1)
## 
##  11  12  13  14 
## 274 269  63 394

now lets look at history

table(newdata$history)
## 
## A30 A31 A32 A33 A34 
##  40  49 530  88 293

and perfrom the same

newdata$history<-gsub("A30","30",newdata$history)
newdata$history<-gsub("A31","31",newdata$history)
newdata$history<-gsub("A32","32",newdata$history)
newdata$history<-gsub("A33","33",newdata$history)
newdata$history<-gsub("A34","34",newdata$history)
table(newdata$history)
## 
##  30  31  32  33  34 
##  40  49 530  88 293

repete the same process for purpose

table(newdata$purpose)
## 
##  A40  A41 A410  A42  A43  A44  A45  A46  A48  A49 
##  234  103   12  181  280   12   22   50    9   97
newdata$purpose<-gsub("A40","40",newdata$purpose)
newdata$purpose<-gsub("A41","41",newdata$purpose)
newdata$purpose<-gsub("A410","410",newdata$purpose)
newdata$purpose<-gsub("A42","42",newdata$purpose)
newdata$purpose<-gsub("A44","44",newdata$purpose)
newdata$purpose<-gsub("A45","45",newdata$purpose)
newdata$purpose<-gsub("A46","46",newdata$purpose)
newdata$purpose<-gsub("A48","48",newdata$purpose)
newdata$purpose<-gsub("A49","49",newdata$purpose)
newdata$purpose<-gsub("A43","43",newdata$purpose)
table(newdata$purpose)
## 
##  40  41 410  42  43  44  45  46  48  49 
## 234 103  12 181 280  12  22  50   9  97

Lets look at saving

table(newdata$savings)
## 
## A61 A62 A63 A64 A65 
## 603 103  63  48 183
newdata$savings<-gsub("A61","61",newdata$savings)
newdata$savings<-gsub("A62","62",newdata$savings)
newdata$savings<-gsub("A63","63",newdata$savings)
newdata$savings<-gsub("A64","64",newdata$savings)
newdata$savings<-gsub("A65","65",newdata$savings)
table(newdata$savings)
## 
##  61  62  63  64  65 
## 603 103  63  48 183

Lets look ar employ

table(newdata$employ)
## 
## A71 A72 A73 A74 A75 
##  62 172 339 174 253
newdata$employ<-gsub("A71","71",newdata$employ)
newdata$employ<-gsub("A72","72",newdata$employ)
newdata$employ<-gsub("A73","73",newdata$employ)
newdata$employ<-gsub("A74","74",newdata$employ)
newdata$employ<-gsub("A75","75",newdata$employ)
table(newdata$employ)
## 
##  71  72  73  74  75 
##  62 172 339 174 253

Perform the same with all the data list

table(newdata$status)
## 
## A91 A92 A93 A94 
##  50 310 548  92
newdata$status<-gsub("A91","91",newdata$status)
newdata$status<-gsub("A92","92",newdata$status)
newdata$status<-gsub("A93","93",newdata$status)
newdata$status<-gsub("A94","94",newdata$status)
table(newdata$status)
## 
##  91  92  93  94 
##  50 310 548  92
table(newdata$others)
## 
## A101 A102 A103 
##  907   41   52
newdata$others<-gsub("A101","101",newdata$others)
newdata$others<-gsub("A102","102",newdata$others)
newdata$others<-gsub("A103","103",newdata$others)
table(newdata$others)
## 
## 101 102 103 
## 907  41  52
table(newdata$property)
## 
## A121 A122 A123 A124 
##  282  232  332  154
newdata$property<-gsub("A121","121",newdata$property)
newdata$property<-gsub("A122","122",newdata$property)
newdata$property<-gsub("A123","123",newdata$property)
newdata$property<-gsub("A124","124",newdata$property)
table(newdata$property)
## 
## 121 122 123 124 
## 282 232 332 154
table(newdata$otherplans)
## 
## A141 A142 A143 
##  139   47  814
newdata$otherplans<-gsub("A141","141",newdata$otherplans)
newdata$otherplans<-gsub("A142","142",newdata$otherplans)
newdata$otherplans<-gsub("A143","143",newdata$otherplans)
table(newdata$otherplans)
## 
## 141 142 143 
## 139  47 814
table(newdata$housing)
## 
## A151 A152 A153 
##  179  713  108
newdata$housing<-gsub("A151","151",newdata$housing)
newdata$housing<-gsub("A152","152",newdata$housing)
newdata$housing<-gsub("A153","153",newdata$housing)
table(newdata$housing)
## 
## 151 152 153 
## 179 713 108
table(newdata$job)
## 
## A171 A172 A173 A174 
##   22  200  630  148
newdata$job<-gsub("A171","171",newdata$job)
newdata$job<-gsub("A172","172",newdata$job)
newdata$job<-gsub("A173","173",newdata$job)
newdata$job<-gsub("A174","174",newdata$job)
table(newdata$job)
## 
## 171 172 173 174 
##  22 200 630 148
table(newdata$tele)
## 
## A191 A192 
##  596  404
newdata$tele<-gsub("A191","191",newdata$tele)
newdata$tele<-gsub("A192","192",newdata$tele)
table(newdata$tele)
## 
## 191 192 
## 596 404
table(newdata$foreign)
## 
## A201 A202 
##  963   37
newdata$foreign<-gsub("A201","201",newdata$foreign)
newdata$foreign<-gsub("A202","202",newdata$foreign)
table(newdata$foreign)
## 
## 201 202 
## 963  37

After converting character to numeric the r is not going to read it as numeric so we will convert it to numeric

newdata$foreign<-as.numeric(newdata$foreign)
newdata$checkingstatus1<-as.numeric(newdata$checkingstatus1)
newdata$duration<-as.numeric(newdata$duration)
newdata$Default<-as.numeric(newdata$Default)
newdata$history<-as.numeric(newdata$history)
newdata$purpose<-as.numeric(newdata$purpose)
newdata$amount<-as.numeric(newdata$amount)
newdata$savings<-as.numeric(newdata$savings)
newdata$employ<-as.numeric(newdata$employ)
newdata$installment<-as.numeric(newdata$installment)
newdata$status<-as.numeric(newdata$status)
newdata$others<-as.numeric(newdata$others)
newdata$residence<-as.numeric(newdata$residence)
newdata$property<-as.numeric(newdata$property)
newdata$age<-as.numeric(newdata$age)
newdata$otherplans<-as.numeric(newdata$otherplans)
newdata$housing<-as.numeric(newdata$housing)
newdata$cards<-as.numeric(newdata$cards)
newdata$job<-as.numeric(newdata$job)
newdata$liable<-as.numeric(newdata$liable)
newdata$tele<-as.numeric(newdata$tele)
newdata$foreign<-as.numeric(newdata$foreign)
str(newdata)
## 'data.frame':    1000 obs. of  21 variables:
##  $ Default        : num  0 1 0 0 1 0 0 0 0 1 ...
##  $ checkingstatus1: num  11 12 14 11 11 14 14 12 14 12 ...
##  $ duration       : num  6 48 12 42 24 36 24 36 12 30 ...
##  $ history        : num  34 32 34 32 33 32 32 32 32 34 ...
##  $ purpose        : num  43 43 46 42 40 46 42 41 43 40 ...
##  $ amount         : num  1169 5951 2096 7882 4870 ...
##  $ savings        : num  65 61 61 61 61 65 63 61 64 61 ...
##  $ employ         : num  75 73 74 74 73 73 75 73 74 71 ...
##  $ installment    : num  4 2 2 2 3 2 3 2 2 4 ...
##  $ status         : num  93 92 93 93 93 93 93 93 91 94 ...
##  $ others         : num  101 101 101 103 101 101 101 101 101 101 ...
##  $ residence      : num  4 2 3 4 4 4 4 2 4 2 ...
##  $ property       : num  121 121 121 122 124 124 122 123 121 123 ...
##  $ age            : num  67 22 49 45 53 35 53 35 61 28 ...
##  $ otherplans     : num  143 143 143 143 143 143 143 143 143 143 ...
##  $ housing        : num  152 152 152 153 153 153 152 151 152 152 ...
##  $ cards          : num  2 1 1 1 2 1 1 1 1 2 ...
##  $ job            : num  173 173 172 173 173 172 173 174 172 174 ...
##  $ liable         : num  1 1 2 2 2 2 1 1 1 1 ...
##  $ tele           : num  192 191 191 191 191 192 191 192 191 191 ...
##  $ foreign        : num  201 201 201 201 201 201 201 201 201 201 ...

Let me apply linear model on to the data

fit_lm_1<-lm(Default~.,data = newdata)
summary(fit_lm_1)
## 
## Call:
## lm(formula = Default ~ ., data = newdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8203 -0.2974 -0.1150  0.3458  1.0573 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      6.078e+01  1.651e+01   3.680 0.000246 ***
## checkingstatus1 -9.972e-02  1.082e-02  -9.216  < 2e-16 ***
## duration         4.187e-03  1.471e-03   2.846 0.004518 ** 
## history         -6.426e-02  1.366e-02  -4.705 2.91e-06 ***
## purpose         -3.074e-04  3.321e-04  -0.926 0.354880    
## amount           1.631e-05  6.850e-06   2.381 0.017455 *  
## savings         -3.405e-02  8.473e-03  -4.019 6.29e-05 ***
## employ          -2.649e-02  1.157e-02  -2.290 0.022216 *  
## installment      4.704e-02  1.307e-02   3.600 0.000335 ***
## status          -4.465e-02  1.864e-02  -2.395 0.016824 *  
## others          -5.962e-02  2.777e-02  -2.147 0.032025 *  
## residence        3.507e-03  1.258e-02   0.279 0.780390    
## property         3.285e-02  1.434e-02   2.291 0.022169 *  
## age             -1.094e-03  1.283e-03  -0.853 0.393976    
## otherplans      -4.883e-02  1.873e-02  -2.607 0.009286 ** 
## housing         -4.866e-02  2.753e-02  -1.767 0.077463 .  
## cards            4.076e-02  2.518e-02   1.618 0.105894    
## job             -3.702e-03  2.257e-02  -0.164 0.869744    
## liable           2.895e-02  3.667e-02   0.789 0.430048    
## tele            -5.265e-02  2.941e-02  -1.790 0.073747 .  
## foreign         -1.103e-01  6.998e-02  -1.576 0.115267    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4038 on 979 degrees of freedom
## Multiple R-squared:   0.24,  Adjusted R-squared:  0.2244 
## F-statistic: 15.45 on 20 and 979 DF,  p-value: < 2.2e-16

Now lets check the VIF colinearity

library(car)
## Warning: package 'car' was built under R version 3.4.1
vif(fit_lm_1)
## checkingstatus1        duration         history         purpose 
##        1.134732        1.928471        1.341004        1.086295 
##          amount         savings          employ     installment 
##        2.290836        1.098126        1.197026        1.309721 
##          status          others       residence        property 
##        1.067936        1.078030        1.180578        1.389551 
##             age      otherplans         housing           cards 
##        1.305363        1.070639        1.311005        1.296714 
##             job          liable            tele         foreign 
##        1.333466        1.080431        1.277645        1.070195

There is no co-linearity in the model

By applying linear model we are able to identify that checkingstatus,history,savings and installment are highly significant

Duration and otherplans have moderate significance

Others are not hiving significance or have least significance

Lets load leaps packages to carry further analysis

library(leaps)
## Warning: package 'leaps' was built under R version 3.4.1

Now we will do subset and later we will do feature selection and try to find out the min and max features for the model.

We will use regsubset to create a subset

sub.fit<-regsubsets(Default~.,data = newdata)

now from the subset we will create summary to slelect the best fit

best.summary<-summary(sub.fit)

Now we will use which.min fuction to understand the min features to be involved in the model with the help of RSS

which.min(best.summary$rss)
## [1] 8

Now with the help of data data visuvilization let me show performence of Mallow’s Cp

which.min(best.summary$cp)
## [1] 8

Lets do some data Viz stuff

par(mfrow=c(1,2))
plot(best.summary$cp,xlab = "Number of Features",ylab = "Mallow's CP")
plot(sub.fit,scale = "Cp")

Let me try the same with BIC

which.min(best.summary$bic)
## [1] 5

Data visuvalisation of BIC

par(mfrow=c(1,2))
plot(best.summary$bic,xlab = "Number of Features",ylab = "BIC")
plot(sub.fit,scale = "bic")

let me plot adjusted R^2

plot(best.summary$adjr2)

Now my main objective for the best model feature selection is to select the model with highest adjt R^2 and low Cp

There is no need for stepwise i.e forward and backward test for optimizing feature selection we are clearly able to see checking status, history,duration,savings,installment, status, others and otherplans are significantly influencing the model

thus we have done possible feature selection. now we will create a model based on the analysis of feature selection

fit.lm.2<-lm(Default~checkingstatus1+duration+history+savings+installment+status+others+otherplans,data=newdata)
summary(fit.lm.2)
## 
## Call:
## lm(formula = Default ~ checkingstatus1 + duration + history + 
##     savings + installment + status + others + otherplans, data = newdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7322 -0.3043 -0.1158  0.3621  1.1285 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     24.770992   4.302522   5.757 1.14e-08 ***
## checkingstatus1 -0.103332   0.010801  -9.567  < 2e-16 ***
## duration         0.006787   0.001081   6.281 5.04e-10 ***
## history         -0.063290   0.012250  -5.166 2.89e-07 ***
## savings         -0.036733   0.008411  -4.367 1.39e-05 ***
## installment      0.030879   0.011649   2.651  0.00816 ** 
## status          -0.050493   0.018404  -2.744  0.00619 ** 
## others          -0.072948   0.027386  -2.664  0.00785 ** 
## otherplans      -0.049104   0.018465  -2.659  0.00796 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4073 on 991 degrees of freedom
## Multiple R-squared:  0.2172, Adjusted R-squared:  0.2109 
## F-statistic: 34.37 on 8 and 991 DF,  p-value: < 2.2e-16

Lets again check for the colineraity

vif(fit.lm.2)
## checkingstatus1        duration         history         savings 
##        1.111266        1.022862        1.060301        1.063681 
##     installment          status          others      otherplans 
##        1.022836        1.022688        1.030762        1.022322

The model looks good without significant colinerity now lets do Breush Pagan test on the new linear model that we have selected

library(lmtest)
## Warning: package 'lmtest' was built under R version 3.4.1
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(fit.lm.2)
## 
##  studentized Breusch-Pagan test
## 
## data:  fit.lm.2
## BP = 96.231, df = 8, p-value < 2.2e-16

we got a more extreme P value we have non constant variance homoscedasticity is one of the assumptions for linear regression so we will drop this and let apply other model.

If you are interested applying linear regression. Kindly do the transformation of variables with log normal and carry further analysis. Based on these facts I am able to know what are the feature affecting the output.

Logistic regression will work fine for the same.So in our next article I will be covering the same.

Wednesday, 20 September 2017

New Online Video Series to Teach Data Science!!!

It's been little over a month. I really didn't know why I created this blog and its purpose. Later couple of days ago I got an idea. Why I don't start a blog/ video series to guide fellow aspirants!!

That's it!!


I started making a video on Youtube describing what are the parts I am willing to covering the video series. Currently, I have uploaded a video. Which you can find below.


I divided data science series into 3 parts!

  1. Programming
  2. Machine Learning 
  3. Data Science
On Initial Videos, We will know how to program later we will also look into machine learning. And will try to solve real-life problems. End to End data science projects live!!