German Credit: Linear Regression Analysis
Sangamesh K S
October 8, 2017
Introduction
we will try to understand the data using linear regression and perform feature selection and other test on the data and try to optimize the model.Lets import the data with the name mydata using choose.file and read.csv function
mydata<-read.csv("C:/Users/Sangmesh/Google Drive/Big Data using R/Kaggle/germancredit.csv",header = TRUE)
Now Lets view the structure of the data
str(mydata)
## 'data.frame': 1000 obs. of 21 variables:
## $ Default : int 0 1 0 0 1 0 0 0 0 1 ...
## $ checkingstatus1: Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
## $ duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ history : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
## $ purpose : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
## $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ savings : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
## $ employ : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
## $ installment : int 4 2 2 2 3 2 3 2 2 4 ...
## $ status : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
## $ others : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
## $ residence : int 4 2 3 4 4 4 4 2 4 2 ...
## $ property : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ otherplans : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ housing : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
## $ cards : int 2 1 1 1 2 1 1 1 1 2 ...
## $ job : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
## $ liable : int 1 1 2 2 2 2 1 1 1 1 ...
## $ tele : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
## $ foreign : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
the structure of the data reveals many things most of the list are integer or character. Now my next quest is there any missing values? Let’s check it using is.na function and View function. As we know the data is a treated data free from missing values and all the data in the lists are converted into coded characters.which help us in computation
Now let me show you a small sample how the data is distributed with a example plotting it on a barograph.
barplot(table(mydata$checkingstatus1),main ="Plot of checking Status")
you can see the data is distributed in 4 coded characters i.e A11 to A14. and with the help of data visualization I have shown you the ##frequency distribution of character
barplot(table(mydata$history))
now let’s see for history in mydata and plot a bar plot if I want to have a check the frequency distribution in tabular format. Please check the same below
table(mydata$history)
##
## A30 A31 A32 A33 A34
## 40 49 530 88 293
Now we came to know about the data. Its important to know what to do with this data? what are the various algorithms we can apply.
I want to start from basic lm function and let me try with 2 or 3 algorithms and test which fit the best among these 2 or 3 algos
1st I am interested applying linear models to understand the data and its significance and co-linearity
Data preparation
There is a small problem applying the linear model. the data are not numerical so we will convert the data into numerical format
So we will re code the vectors into numerical format
Now I will remove “A” form mydata and convert everything into numerical. So we will come up with a new dataset and keep the original mydata undisturbed
newdata<-mydata
After creating a new data
We will look at data wangling of newdata that we have recently
table(newdata$checkingstatus1)
##
## A11 A12 A13 A14
## 274 269 63 394
Now we will remove all A and convert it into a numaric format
newdata$checkingstatus1<-gsub("A11","11",newdata$checkingstatus1)
newdata$checkingstatus1<-gsub("A12","12",newdata$checkingstatus1)
newdata$checkingstatus1<-gsub("A13","13",newdata$checkingstatus1)
newdata$checkingstatus1<-gsub("A14","14",newdata$checkingstatus1)
now lets check it it done?
table(newdata$checkingstatus1)
##
## 11 12 13 14
## 274 269 63 394
now lets look at history
table(newdata$history)
##
## A30 A31 A32 A33 A34
## 40 49 530 88 293
and perfrom the same
newdata$history<-gsub("A30","30",newdata$history)
newdata$history<-gsub("A31","31",newdata$history)
newdata$history<-gsub("A32","32",newdata$history)
newdata$history<-gsub("A33","33",newdata$history)
newdata$history<-gsub("A34","34",newdata$history)
table(newdata$history)
##
## 30 31 32 33 34
## 40 49 530 88 293
repete the same process for purpose
table(newdata$purpose)
##
## A40 A41 A410 A42 A43 A44 A45 A46 A48 A49
## 234 103 12 181 280 12 22 50 9 97
newdata$purpose<-gsub("A40","40",newdata$purpose)
newdata$purpose<-gsub("A41","41",newdata$purpose)
newdata$purpose<-gsub("A410","410",newdata$purpose)
newdata$purpose<-gsub("A42","42",newdata$purpose)
newdata$purpose<-gsub("A44","44",newdata$purpose)
newdata$purpose<-gsub("A45","45",newdata$purpose)
newdata$purpose<-gsub("A46","46",newdata$purpose)
newdata$purpose<-gsub("A48","48",newdata$purpose)
newdata$purpose<-gsub("A49","49",newdata$purpose)
newdata$purpose<-gsub("A43","43",newdata$purpose)
table(newdata$purpose)
##
## 40 41 410 42 43 44 45 46 48 49
## 234 103 12 181 280 12 22 50 9 97
Lets look at saving
table(newdata$savings)
##
## A61 A62 A63 A64 A65
## 603 103 63 48 183
newdata$savings<-gsub("A61","61",newdata$savings)
newdata$savings<-gsub("A62","62",newdata$savings)
newdata$savings<-gsub("A63","63",newdata$savings)
newdata$savings<-gsub("A64","64",newdata$savings)
newdata$savings<-gsub("A65","65",newdata$savings)
table(newdata$savings)
##
## 61 62 63 64 65
## 603 103 63 48 183
Lets look ar employ
table(newdata$employ)
##
## A71 A72 A73 A74 A75
## 62 172 339 174 253
newdata$employ<-gsub("A71","71",newdata$employ)
newdata$employ<-gsub("A72","72",newdata$employ)
newdata$employ<-gsub("A73","73",newdata$employ)
newdata$employ<-gsub("A74","74",newdata$employ)
newdata$employ<-gsub("A75","75",newdata$employ)
table(newdata$employ)
##
## 71 72 73 74 75
## 62 172 339 174 253
Perform the same with all the data list
table(newdata$status)
##
## A91 A92 A93 A94
## 50 310 548 92
newdata$status<-gsub("A91","91",newdata$status)
newdata$status<-gsub("A92","92",newdata$status)
newdata$status<-gsub("A93","93",newdata$status)
newdata$status<-gsub("A94","94",newdata$status)
table(newdata$status)
##
## 91 92 93 94
## 50 310 548 92
table(newdata$others)
##
## A101 A102 A103
## 907 41 52
newdata$others<-gsub("A101","101",newdata$others)
newdata$others<-gsub("A102","102",newdata$others)
newdata$others<-gsub("A103","103",newdata$others)
table(newdata$others)
##
## 101 102 103
## 907 41 52
table(newdata$property)
##
## A121 A122 A123 A124
## 282 232 332 154
newdata$property<-gsub("A121","121",newdata$property)
newdata$property<-gsub("A122","122",newdata$property)
newdata$property<-gsub("A123","123",newdata$property)
newdata$property<-gsub("A124","124",newdata$property)
table(newdata$property)
##
## 121 122 123 124
## 282 232 332 154
table(newdata$otherplans)
##
## A141 A142 A143
## 139 47 814
newdata$otherplans<-gsub("A141","141",newdata$otherplans)
newdata$otherplans<-gsub("A142","142",newdata$otherplans)
newdata$otherplans<-gsub("A143","143",newdata$otherplans)
table(newdata$otherplans)
##
## 141 142 143
## 139 47 814
table(newdata$housing)
##
## A151 A152 A153
## 179 713 108
newdata$housing<-gsub("A151","151",newdata$housing)
newdata$housing<-gsub("A152","152",newdata$housing)
newdata$housing<-gsub("A153","153",newdata$housing)
table(newdata$housing)
##
## 151 152 153
## 179 713 108
table(newdata$job)
##
## A171 A172 A173 A174
## 22 200 630 148
newdata$job<-gsub("A171","171",newdata$job)
newdata$job<-gsub("A172","172",newdata$job)
newdata$job<-gsub("A173","173",newdata$job)
newdata$job<-gsub("A174","174",newdata$job)
table(newdata$job)
##
## 171 172 173 174
## 22 200 630 148
table(newdata$tele)
##
## A191 A192
## 596 404
newdata$tele<-gsub("A191","191",newdata$tele)
newdata$tele<-gsub("A192","192",newdata$tele)
table(newdata$tele)
##
## 191 192
## 596 404
table(newdata$foreign)
##
## A201 A202
## 963 37
newdata$foreign<-gsub("A201","201",newdata$foreign)
newdata$foreign<-gsub("A202","202",newdata$foreign)
table(newdata$foreign)
##
## 201 202
## 963 37
After converting character to numeric the r is not going to read it as numeric so we will convert it to numeric
newdata$foreign<-as.numeric(newdata$foreign)
newdata$checkingstatus1<-as.numeric(newdata$checkingstatus1)
newdata$duration<-as.numeric(newdata$duration)
newdata$Default<-as.numeric(newdata$Default)
newdata$history<-as.numeric(newdata$history)
newdata$purpose<-as.numeric(newdata$purpose)
newdata$amount<-as.numeric(newdata$amount)
newdata$savings<-as.numeric(newdata$savings)
newdata$employ<-as.numeric(newdata$employ)
newdata$installment<-as.numeric(newdata$installment)
newdata$status<-as.numeric(newdata$status)
newdata$others<-as.numeric(newdata$others)
newdata$residence<-as.numeric(newdata$residence)
newdata$property<-as.numeric(newdata$property)
newdata$age<-as.numeric(newdata$age)
newdata$otherplans<-as.numeric(newdata$otherplans)
newdata$housing<-as.numeric(newdata$housing)
newdata$cards<-as.numeric(newdata$cards)
newdata$job<-as.numeric(newdata$job)
newdata$liable<-as.numeric(newdata$liable)
newdata$tele<-as.numeric(newdata$tele)
newdata$foreign<-as.numeric(newdata$foreign)
str(newdata)
## 'data.frame': 1000 obs. of 21 variables:
## $ Default : num 0 1 0 0 1 0 0 0 0 1 ...
## $ checkingstatus1: num 11 12 14 11 11 14 14 12 14 12 ...
## $ duration : num 6 48 12 42 24 36 24 36 12 30 ...
## $ history : num 34 32 34 32 33 32 32 32 32 34 ...
## $ purpose : num 43 43 46 42 40 46 42 41 43 40 ...
## $ amount : num 1169 5951 2096 7882 4870 ...
## $ savings : num 65 61 61 61 61 65 63 61 64 61 ...
## $ employ : num 75 73 74 74 73 73 75 73 74 71 ...
## $ installment : num 4 2 2 2 3 2 3 2 2 4 ...
## $ status : num 93 92 93 93 93 93 93 93 91 94 ...
## $ others : num 101 101 101 103 101 101 101 101 101 101 ...
## $ residence : num 4 2 3 4 4 4 4 2 4 2 ...
## $ property : num 121 121 121 122 124 124 122 123 121 123 ...
## $ age : num 67 22 49 45 53 35 53 35 61 28 ...
## $ otherplans : num 143 143 143 143 143 143 143 143 143 143 ...
## $ housing : num 152 152 152 153 153 153 152 151 152 152 ...
## $ cards : num 2 1 1 1 2 1 1 1 1 2 ...
## $ job : num 173 173 172 173 173 172 173 174 172 174 ...
## $ liable : num 1 1 2 2 2 2 1 1 1 1 ...
## $ tele : num 192 191 191 191 191 192 191 192 191 191 ...
## $ foreign : num 201 201 201 201 201 201 201 201 201 201 ...
Let me apply linear model on to the data
fit_lm_1<-lm(Default~.,data = newdata)
summary(fit_lm_1)
##
## Call:
## lm(formula = Default ~ ., data = newdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8203 -0.2974 -0.1150 0.3458 1.0573
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.078e+01 1.651e+01 3.680 0.000246 ***
## checkingstatus1 -9.972e-02 1.082e-02 -9.216 < 2e-16 ***
## duration 4.187e-03 1.471e-03 2.846 0.004518 **
## history -6.426e-02 1.366e-02 -4.705 2.91e-06 ***
## purpose -3.074e-04 3.321e-04 -0.926 0.354880
## amount 1.631e-05 6.850e-06 2.381 0.017455 *
## savings -3.405e-02 8.473e-03 -4.019 6.29e-05 ***
## employ -2.649e-02 1.157e-02 -2.290 0.022216 *
## installment 4.704e-02 1.307e-02 3.600 0.000335 ***
## status -4.465e-02 1.864e-02 -2.395 0.016824 *
## others -5.962e-02 2.777e-02 -2.147 0.032025 *
## residence 3.507e-03 1.258e-02 0.279 0.780390
## property 3.285e-02 1.434e-02 2.291 0.022169 *
## age -1.094e-03 1.283e-03 -0.853 0.393976
## otherplans -4.883e-02 1.873e-02 -2.607 0.009286 **
## housing -4.866e-02 2.753e-02 -1.767 0.077463 .
## cards 4.076e-02 2.518e-02 1.618 0.105894
## job -3.702e-03 2.257e-02 -0.164 0.869744
## liable 2.895e-02 3.667e-02 0.789 0.430048
## tele -5.265e-02 2.941e-02 -1.790 0.073747 .
## foreign -1.103e-01 6.998e-02 -1.576 0.115267
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4038 on 979 degrees of freedom
## Multiple R-squared: 0.24, Adjusted R-squared: 0.2244
## F-statistic: 15.45 on 20 and 979 DF, p-value: < 2.2e-16
Now lets check the VIF colinearity
library(car)
## Warning: package 'car' was built under R version 3.4.1
vif(fit_lm_1)
## checkingstatus1 duration history purpose
## 1.134732 1.928471 1.341004 1.086295
## amount savings employ installment
## 2.290836 1.098126 1.197026 1.309721
## status others residence property
## 1.067936 1.078030 1.180578 1.389551
## age otherplans housing cards
## 1.305363 1.070639 1.311005 1.296714
## job liable tele foreign
## 1.333466 1.080431 1.277645 1.070195
There is no co-linearity in the model
By applying linear model we are able to identify that checkingstatus,history,savings and installment are highly significant
Duration and otherplans have moderate significance
Others are not hiving significance or have least significance
Lets load leaps packages to carry further analysis
library(leaps)
## Warning: package 'leaps' was built under R version 3.4.1
Now we will do subset and later we will do feature selection and try to find out the min and max features for the model.
We will use regsubset to create a subset
sub.fit<-regsubsets(Default~.,data = newdata)
now from the subset we will create summary to slelect the best fit
best.summary<-summary(sub.fit)
Now we will use which.min fuction to understand the min features to be involved in the model with the help of RSS
which.min(best.summary$rss)
## [1] 8
Now with the help of data data visuvilization let me show performence of Mallow’s Cp
which.min(best.summary$cp)
## [1] 8
Lets do some data Viz stuff
par(mfrow=c(1,2))
plot(best.summary$cp,xlab = "Number of Features",ylab = "Mallow's CP")
plot(sub.fit,scale = "Cp")
Let me try the same with BIC
which.min(best.summary$bic)
## [1] 5
Data visuvalisation of BIC
par(mfrow=c(1,2))
plot(best.summary$bic,xlab = "Number of Features",ylab = "BIC")
plot(sub.fit,scale = "bic")
let me plot adjusted R^2
plot(best.summary$adjr2)
Now my main objective for the best model feature selection is to select the model with highest adjt R^2 and low Cp
There is no need for stepwise i.e forward and backward test for optimizing feature selection we are clearly able to see checking status, history,duration,savings,installment, status, others and otherplans are significantly influencing the model
thus we have done possible feature selection. now we will create a model based on the analysis of feature selection
fit.lm.2<-lm(Default~checkingstatus1+duration+history+savings+installment+status+others+otherplans,data=newdata)
summary(fit.lm.2)
##
## Call:
## lm(formula = Default ~ checkingstatus1 + duration + history +
## savings + installment + status + others + otherplans, data = newdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7322 -0.3043 -0.1158 0.3621 1.1285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.770992 4.302522 5.757 1.14e-08 ***
## checkingstatus1 -0.103332 0.010801 -9.567 < 2e-16 ***
## duration 0.006787 0.001081 6.281 5.04e-10 ***
## history -0.063290 0.012250 -5.166 2.89e-07 ***
## savings -0.036733 0.008411 -4.367 1.39e-05 ***
## installment 0.030879 0.011649 2.651 0.00816 **
## status -0.050493 0.018404 -2.744 0.00619 **
## others -0.072948 0.027386 -2.664 0.00785 **
## otherplans -0.049104 0.018465 -2.659 0.00796 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4073 on 991 degrees of freedom
## Multiple R-squared: 0.2172, Adjusted R-squared: 0.2109
## F-statistic: 34.37 on 8 and 991 DF, p-value: < 2.2e-16
Lets again check for the colineraity
vif(fit.lm.2)
## checkingstatus1 duration history savings
## 1.111266 1.022862 1.060301 1.063681
## installment status others otherplans
## 1.022836 1.022688 1.030762 1.022322
The model looks good without significant colinerity now lets do Breush Pagan test on the new linear model that we have selected
library(lmtest)
## Warning: package 'lmtest' was built under R version 3.4.1
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(fit.lm.2)
##
## studentized Breusch-Pagan test
##
## data: fit.lm.2
## BP = 96.231, df = 8, p-value < 2.2e-16
we got a more extreme P value we have non constant variance homoscedasticity is one of the assumptions for linear regression so we will drop this and let apply other model.
If you are interested applying linear regression. Kindly do the transformation of variables with log normal and carry further analysis. Based on these facts I am able to know what are the feature affecting the output.
Logistic regression will work fine for the same.So in our next article I will be covering the same.