Discriminate Analysis on German_credit
Sangamesh K S
October 23, 2017
In my previous article I have applied Linear Regression (Please Vist:https://experimentswithdatascience.blogspot.in/2017/10/german-credit-linear-regression-analysis_8.html) and tried to understand the data, The reason , I did not continue to predict is because of its lack of predictability. It’s very common that we cannot do classification using linear regression. Before I get into Logistic regression let me try Discriminate Analysis. Now here I will apply LDA and QDA for classification of the Default or no-Default.
Why Discriminate Analysis
It can be a alternative to logistic regression and can do well when the class are well separated unlike logistic which have a risk to underperform. But discriminate analysis may poorly perform when the relationship between the data and the output are high.
Let’s exclude everything and apply discriminant analysis on the data and check how it will perform. Lets assume DA hold good and apply it for classification
Let me start loading the the data into R. And have a look on the structure of the data.
mydata<-read.csv("C:/Users/Sangmesh/Google Drive/Big Data using R/Kaggle/germancredit.csv",header = TRUE)
str(mydata)
## 'data.frame': 1000 obs. of 21 variables:
## $ Default : int 0 1 0 0 1 0 0 0 0 1 ...
## $ checkingstatus1: Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
## $ duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ history : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
## $ purpose : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
## $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ savings : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
## $ employ : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
## $ installment : int 4 2 2 2 3 2 3 2 2 4 ...
## $ status : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
## $ others : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
## $ residence : int 4 2 3 4 4 4 4 2 4 2 ...
## $ property : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ otherplans : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ housing : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
## $ cards : int 2 1 1 1 2 1 1 1 1 2 ...
## $ job : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
## $ liable : int 1 1 2 2 2 2 1 1 1 1 ...
## $ tele : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
## $ foreign : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
As we look at the data and try to figure out the type of structure most of the list are in the factors or integer.
Let’s go ahead and plot it and have a better understanding of a feature called checking status and its type
barplot(table(mydata$checkingstatus1),main ="Plot of checking Status",ylab = "no of observation",xlab = "types of checking status",col = "red")
Now lets look at history
barplot(table(mydata$history),main ="Plot of History",ylab = "no of observation",xlab = "types of History",col = "green")
instead of a barplot if we are interested looking in a table format. then find below
table(mydata$history)
##
## A30 A31 A32 A33 A34
## 40 49 530 88 293
Now we have a small problem as we have most of the feature in factors we need to convert it to numeric and make it feasible to compute. So we will do transformation of the data and later do discriminate analysis.
Let’s keep the original data undisturbed and create a new data set called newdata
newdata<-mydata
After creating a new data
We will look at data wangling of newdata that we have recently
table(newdata$checkingstatus1)
##
## A11 A12 A13 A14
## 274 269 63 394
now we will remove all A and convert it into a numaric format
newdata$checkingstatus1<-gsub("A11","11",newdata$checkingstatus1)
newdata$checkingstatus1<-gsub("A12","12",newdata$checkingstatus1)
newdata$checkingstatus1<-gsub("A13","13",newdata$checkingstatus1)
newdata$checkingstatus1<-gsub("A14","14",newdata$checkingstatus1)
Now lets check it it done?
table(newdata$checkingstatus1)
##
## 11 12 13 14
## 274 269 63 394
now lets look at history
table(newdata$history)
##
## A30 A31 A32 A33 A34
## 40 49 530 88 293
And perfrom the same
newdata$history<-gsub("A30","30",newdata$history)
newdata$history<-gsub("A31","31",newdata$history)
newdata$history<-gsub("A32","32",newdata$history)
newdata$history<-gsub("A33","33",newdata$history)
newdata$history<-gsub("A34","34",newdata$history)
lets have a look at the transformed data
table(newdata$history)
##
## 30 31 32 33 34
## 40 49 530 88 293
Repete the same process for purpose
table(newdata$purpose)
##
## A40 A41 A410 A42 A43 A44 A45 A46 A48 A49
## 234 103 12 181 280 12 22 50 9 97
newdata$purpose<-gsub("A40","40",newdata$purpose)
newdata$purpose<-gsub("A41","41",newdata$purpose)
newdata$purpose<-gsub("A410","410",newdata$purpose)
newdata$purpose<-gsub("A42","42",newdata$purpose)
newdata$purpose<-gsub("A44","44",newdata$purpose)
newdata$purpose<-gsub("A45","45",newdata$purpose)
newdata$purpose<-gsub("A46","46",newdata$purpose)
newdata$purpose<-gsub("A48","48",newdata$purpose)
newdata$purpose<-gsub("A49","49",newdata$purpose)
newdata$purpose<-gsub("A43","43",newdata$purpose)
table(newdata$purpose)
##
## 40 41 410 42 43 44 45 46 48 49
## 234 103 12 181 280 12 22 50 9 97
Lets look at saving
table(newdata$savings)
##
## A61 A62 A63 A64 A65
## 603 103 63 48 183
newdata$savings<-gsub("A61","61",newdata$savings)
newdata$savings<-gsub("A62","62",newdata$savings)
newdata$savings<-gsub("A63","63",newdata$savings)
newdata$savings<-gsub("A64","64",newdata$savings)
newdata$savings<-gsub("A65","65",newdata$savings)
table(newdata$savings)
##
## 61 62 63 64 65
## 603 103 63 48 183
Lets look ar employ
table(newdata$employ)
##
## A71 A72 A73 A74 A75
## 62 172 339 174 253
newdata$employ<-gsub("A71","71",newdata$employ)
newdata$employ<-gsub("A72","72",newdata$employ)
newdata$employ<-gsub("A73","73",newdata$employ)
newdata$employ<-gsub("A74","74",newdata$employ)
newdata$employ<-gsub("A75","75",newdata$employ)
table(newdata$employ)
##
## 71 72 73 74 75
## 62 172 339 174 253
Perform the same with all the data list
table(newdata$status)
##
## A91 A92 A93 A94
## 50 310 548 92
newdata$status<-gsub("A91","91",newdata$status)
newdata$status<-gsub("A92","92",newdata$status)
newdata$status<-gsub("A93","93",newdata$status)
newdata$status<-gsub("A94","94",newdata$status)
table(newdata$status)
##
## 91 92 93 94
## 50 310 548 92
table(newdata$others)
##
## A101 A102 A103
## 907 41 52
newdata$others<-gsub("A101","101",newdata$others)
newdata$others<-gsub("A102","102",newdata$others)
newdata$others<-gsub("A103","103",newdata$others)
table(newdata$others)
##
## 101 102 103
## 907 41 52
table(newdata$property)
##
## A121 A122 A123 A124
## 282 232 332 154
newdata$property<-gsub("A121","121",newdata$property)
newdata$property<-gsub("A122","122",newdata$property)
newdata$property<-gsub("A123","123",newdata$property)
newdata$property<-gsub("A124","124",newdata$property)
table(newdata$property)
##
## 121 122 123 124
## 282 232 332 154
table(newdata$otherplans)
##
## A141 A142 A143
## 139 47 814
newdata$otherplans<-gsub("A141","141",newdata$otherplans)
newdata$otherplans<-gsub("A142","142",newdata$otherplans)
newdata$otherplans<-gsub("A143","143",newdata$otherplans)
table(newdata$otherplans)
##
## 141 142 143
## 139 47 814
table(newdata$housing)
##
## A151 A152 A153
## 179 713 108
newdata$housing<-gsub("A151","151",newdata$housing)
newdata$housing<-gsub("A152","152",newdata$housing)
newdata$housing<-gsub("A153","153",newdata$housing)
table(newdata$housing)
##
## 151 152 153
## 179 713 108
table(newdata$job)
##
## A171 A172 A173 A174
## 22 200 630 148
newdata$job<-gsub("A171","171",newdata$job)
newdata$job<-gsub("A172","172",newdata$job)
newdata$job<-gsub("A173","173",newdata$job)
newdata$job<-gsub("A174","174",newdata$job)
table(newdata$job)
##
## 171 172 173 174
## 22 200 630 148
table(newdata$tele)
##
## A191 A192
## 596 404
newdata$tele<-gsub("A191","191",newdata$tele)
newdata$tele<-gsub("A192","192",newdata$tele)
table(newdata$tele)
##
## 191 192
## 596 404
table(newdata$foreign)
##
## A201 A202
## 963 37
newdata$foreign<-gsub("A201","201",newdata$foreign)
newdata$foreign<-gsub("A202","202",newdata$foreign)
table(newdata$foreign)
##
## 201 202
## 963 37
After converting character to numeric the r is not going to read it as numeric so we will convert it to numeric
newdata$foreign<-as.numeric(newdata$foreign)
newdata$checkingstatus1<-as.numeric(newdata$checkingstatus1)
newdata$duration<-as.numeric(newdata$duration)
newdata$Default<-as.numeric(newdata$Default)
newdata$history<-as.numeric(newdata$history)
newdata$purpose<-as.numeric(newdata$purpose)
newdata$amount<-as.numeric(newdata$amount)
newdata$savings<-as.numeric(newdata$savings)
newdata$employ<-as.numeric(newdata$employ)
newdata$installment<-as.numeric(newdata$installment)
newdata$status<-as.numeric(newdata$status)
newdata$others<-as.numeric(newdata$others)
newdata$residence<-as.numeric(newdata$residence)
newdata$property<-as.numeric(newdata$property)
newdata$age<-as.numeric(newdata$age)
newdata$otherplans<-as.numeric(newdata$otherplans)
newdata$housing<-as.numeric(newdata$housing)
newdata$cards<-as.numeric(newdata$cards)
newdata$job<-as.numeric(newdata$job)
newdata$liable<-as.numeric(newdata$liable)
newdata$tele<-as.numeric(newdata$tele)
newdata$foreign<-as.numeric(newdata$foreign)
str(newdata)
## 'data.frame': 1000 obs. of 21 variables:
## $ Default : num 0 1 0 0 1 0 0 0 0 1 ...
## $ checkingstatus1: num 11 12 14 11 11 14 14 12 14 12 ...
## $ duration : num 6 48 12 42 24 36 24 36 12 30 ...
## $ history : num 34 32 34 32 33 32 32 32 32 34 ...
## $ purpose : num 43 43 46 42 40 46 42 41 43 40 ...
## $ amount : num 1169 5951 2096 7882 4870 ...
## $ savings : num 65 61 61 61 61 65 63 61 64 61 ...
## $ employ : num 75 73 74 74 73 73 75 73 74 71 ...
## $ installment : num 4 2 2 2 3 2 3 2 2 4 ...
## $ status : num 93 92 93 93 93 93 93 93 91 94 ...
## $ others : num 101 101 101 103 101 101 101 101 101 101 ...
## $ residence : num 4 2 3 4 4 4 4 2 4 2 ...
## $ property : num 121 121 121 122 124 124 122 123 121 123 ...
## $ age : num 67 22 49 45 53 35 53 35 61 28 ...
## $ otherplans : num 143 143 143 143 143 143 143 143 143 143 ...
## $ housing : num 152 152 152 153 153 153 152 151 152 152 ...
## $ cards : num 2 1 1 1 2 1 1 1 1 2 ...
## $ job : num 173 173 172 173 173 172 173 174 172 174 ...
## $ liable : num 1 1 2 2 2 2 1 1 1 1 ...
## $ tele : num 192 191 191 191 191 192 191 192 191 191 ...
## $ foreign : num 201 201 201 201 201 201 201 201 201 201 ...
Now we will perform discriminate analysis LDA and QDA methods based on probability method. Before that we will have a look at output and create training and testing set.
table(newdata$Default)
##
## 0 1
## 700 300
Everting looks fine. lets set the seed and group into train and test sets
set.seed(123)
ind<-sample(2,nrow(newdata),replace = TRUE,prob = c(0.7,0.3))
train<-newdata[ind==1,]
test<-newdata[ind==2,]
Now look at the structure of test
str(test)
## 'data.frame': 295 obs. of 21 variables:
## $ Default : num 1 0 1 0 1 1 0 0 0 0 ...
## $ checkingstatus1: num 12 11 11 12 12 11 14 14 12 11 ...
## $ duration : num 48 42 24 36 12 24 24 9 12 6 ...
## $ history : num 32 32 33 32 32 32 32 34 34 32 ...
## $ purpose : num 43 42 40 41 40 43 43 40 41 42 ...
## $ amount : num 5951 7882 4870 6948 1295 ...
## $ savings : num 61 61 61 61 61 62 63 61 62 61 ...
## $ employ : num 73 74 73 73 72 73 75 73 72 73 ...
## $ installment : num 2 2 3 2 3 4 3 4 3 1 ...
## $ status : num 92 93 93 93 92 92 93 93 93 93 ...
## $ others : num 101 103 101 101 101 101 101 101 101 101 ...
## $ residence : num 2 4 4 2 1 2 2 4 4 2 ...
## $ property : num 121 122 124 123 123 123 123 123 122 121 ...
## $ age : num 22 45 53 35 25 32 31 48 44 36 ...
## $ otherplans : num 143 143 143 143 143 143 143 143 143 141 ...
## $ housing : num 152 153 153 151 151 152 152 152 152 152 ...
## $ cards : num 1 1 2 1 1 1 1 3 1 1 ...
## $ job : num 173 173 173 174 173 172 173 173 173 172 ...
## $ liable : num 1 2 2 1 1 1 2 1 1 1 ...
## $ tele : num 191 191 191 192 191 191 192 192 191 192 ...
## $ foreign : num 201 201 201 201 201 201 201 201 201 201 ...
Lets load some required packages
library(caret)
## Warning: package 'caret' was built under R version 3.4.1
## Loading required package: lattice
## Loading required package: ggplot2
library(MASS)
lda.fit<-lda(Default~.,data = train)
lda.fit
## Call:
## lda(Default ~ ., data = train)
##
## Prior probabilities of groups:
## 0 1
## 0.706383 0.293617
##
## Group means:
## checkingstatus1 duration history purpose amount savings employ
## 0 12.89357 19.30321 32.68675 45.69679 3006.044 62.34337 73.50803
## 1 11.88889 24.26570 32.13043 48.10628 3914.710 61.70048 73.26570
## installment status others residence property age otherplans
## 0 2.933735 92.73896 101.1426 2.851406 122.2129 36.53414 142.7651
## 1 3.135266 92.62319 101.1014 2.835749 122.5604 34.23188 142.5749
## housing cards job liable tele foreign
## 0 151.9257 1.423695 172.8755 1.148594 191.4237 201.0502
## 1 151.9275 1.410628 172.9565 1.144928 191.3961 201.0097
##
## Coefficients of linear discriminants:
## LD1
## checkingstatus1 -5.438993e-01
## duration 1.515962e-02
## history -3.444869e-01
## purpose -1.943200e-03
## amount 7.735336e-05
## savings -1.721170e-01
## employ -6.383013e-02
## installment 2.107677e-01
## status -2.017602e-01
## others -3.740777e-01
## residence -3.105179e-02
## property 1.309194e-01
## age -1.047138e-02
## otherplans -2.758089e-01
## housing -6.983417e-02
## cards 2.798958e-01
## job 7.201059e-02
## liable 2.564522e-01
## tele -2.401415e-01
## foreign -5.716642e-01
After looking at this I can clearly estimate the accuracy of the model. As per the model I can clearly say that it has around 70% of the data “0” as per prior probability.
Now let’s predict and try to see, how it perform with the test data
test.lda.probs<-predict(lda.fit,newdata = test)$class
We will use a confusion matrix to see the accuracy of the model
confusionMatrix(test$Default,test.lda.probs)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 176 26
## 1 46 47
##
## Accuracy : 0.7559
## 95% CI : (0.7028, 0.8039)
## No Information Rate : 0.7525
## P-Value [Acc > NIR] : 0.47763
##
## Kappa : 0.3999
## Mcnemar's Test P-Value : 0.02514
##
## Sensitivity : 0.7928
## Specificity : 0.6438
## Pos Pred Value : 0.8713
## Neg Pred Value : 0.5054
## Prevalence : 0.7525
## Detection Rate : 0.5966
## Detection Prevalence : 0.6847
## Balanced Accuracy : 0.7183
##
## 'Positive' Class : 0
##
Now Lets apply QDA
qda.fit<-qda(Default~.,data = train)
qda.fit
## Call:
## qda(Default ~ ., data = train)
##
## Prior probabilities of groups:
## 0 1
## 0.706383 0.293617
##
## Group means:
## checkingstatus1 duration history purpose amount savings employ
## 0 12.89357 19.30321 32.68675 45.69679 3006.044 62.34337 73.50803
## 1 11.88889 24.26570 32.13043 48.10628 3914.710 61.70048 73.26570
## installment status others residence property age otherplans
## 0 2.933735 92.73896 101.1426 2.851406 122.2129 36.53414 142.7651
## 1 3.135266 92.62319 101.1014 2.835749 122.5604 34.23188 142.5749
## housing cards job liable tele foreign
## 0 151.9257 1.423695 172.8755 1.148594 191.4237 201.0502
## 1 151.9275 1.410628 172.9565 1.144928 191.3961 201.0097
As we are using prior probability the group of estimate probability will be same with LDA and QDA. But the prediction of both will obviously differ from each other. let’s go ahead and predict
test.qda.probs<-predict(qda.fit,newdata = test)$class
confusionMatrix(test$Default,test.qda.probs)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 153 49
## 1 41 52
##
## Accuracy : 0.6949
## 95% CI : (0.6389, 0.747)
## No Information Rate : 0.6576
## P-Value [Acc > NIR] : 0.0980
##
## Kappa : 0.3094
## Mcnemar's Test P-Value : 0.4606
##
## Sensitivity : 0.7887
## Specificity : 0.5149
## Pos Pred Value : 0.7574
## Neg Pred Value : 0.5591
## Prevalence : 0.6576
## Detection Rate : 0.5186
## Detection Prevalence : 0.6847
## Balanced Accuracy : 0.6518
##
## 'Positive' Class : 0
##
Conclusion:
I do not want to judge Discriminant Analysis is the best classification technique but it has its own charm and using it we have received LDA-75% and QDA-69% accuracy. I do not recommend deploying this until it is near to 100 % accuracy and other parameter .
Even LDA have performed well with the data set and sensitivity and specificity is also improved. If all the models are for section then I will judge based on accuracy, sensitivity, specificity and kappa