Monday, 23 October 2017

Discriminant Analysis on German Credit Data


Discriminate Analysis on German_credit

In my previous article I have applied Linear Regression (Please Vist:https://experimentswithdatascience.blogspot.in/2017/10/german-credit-linear-regression-analysis_8.html) and tried to understand the data, The reason , I did not continue to predict is because of its lack of predictability. It’s very common that we cannot do classification using linear regression. Before I get into Logistic regression let me try Discriminate Analysis. Now here I will apply LDA and QDA for classification of the Default or no-Default.

Why Discriminate Analysis

It can be a alternative to logistic regression and can do well when the class are well separated unlike logistic which have a risk to underperform. But discriminate analysis may poorly perform when the relationship between the data and the output are high.

Let’s exclude everything and apply discriminant analysis on the data and check how it will perform. Lets assume DA hold good and apply it for classification

Let me start loading the the data into R. And have a look on the structure of the data.

mydata<-read.csv("C:/Users/Sangmesh/Google Drive/Big Data using R/Kaggle/germancredit.csv",header = TRUE)
str(mydata)
## 'data.frame':    1000 obs. of  21 variables:
##  $ Default        : int  0 1 0 0 1 0 0 0 0 1 ...
##  $ checkingstatus1: Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
##  $ duration       : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ history        : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
##  $ purpose        : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
##  $ amount         : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ savings        : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
##  $ employ         : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
##  $ installment    : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ status         : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
##  $ others         : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
##  $ residence      : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ property       : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
##  $ age            : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ otherplans     : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ housing        : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
##  $ cards          : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ job            : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
##  $ liable         : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ tele           : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
##  $ foreign        : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...

As we look at the data and try to figure out the type of structure most of the list are in the factors or integer.

Let’s go ahead and plot it and have a better understanding of a feature called checking status and its type

barplot(table(mydata$checkingstatus1),main ="Plot of checking Status",ylab = "no of observation",xlab = "types of checking status",col = "red")

Now lets look at history

barplot(table(mydata$history),main ="Plot of History",ylab = "no of observation",xlab = "types of History",col = "green")

instead of a barplot if we are interested looking in a table format. then find below

table(mydata$history)
## 
## A30 A31 A32 A33 A34 
##  40  49 530  88 293

Now we have a small problem as we have most of the feature in factors we need to convert it to numeric and make it feasible to compute. So we will do transformation of the data and later do discriminate analysis.

Let’s keep the original data undisturbed and create a new data set called newdata

newdata<-mydata

After creating a new data

We will look at data wangling of newdata that we have recently

table(newdata$checkingstatus1)
## 
## A11 A12 A13 A14 
## 274 269  63 394

now we will remove all A and convert it into a numaric format

newdata$checkingstatus1<-gsub("A11","11",newdata$checkingstatus1)
newdata$checkingstatus1<-gsub("A12","12",newdata$checkingstatus1)
newdata$checkingstatus1<-gsub("A13","13",newdata$checkingstatus1)
newdata$checkingstatus1<-gsub("A14","14",newdata$checkingstatus1)

Now lets check it it done?

table(newdata$checkingstatus1)
## 
##  11  12  13  14 
## 274 269  63 394

now lets look at history

table(newdata$history)
## 
## A30 A31 A32 A33 A34 
##  40  49 530  88 293

And perfrom the same

newdata$history<-gsub("A30","30",newdata$history)
newdata$history<-gsub("A31","31",newdata$history)
newdata$history<-gsub("A32","32",newdata$history)
newdata$history<-gsub("A33","33",newdata$history)
newdata$history<-gsub("A34","34",newdata$history)

lets have a look at the transformed data

table(newdata$history)
## 
##  30  31  32  33  34 
##  40  49 530  88 293

Repete the same process for purpose

table(newdata$purpose)
## 
##  A40  A41 A410  A42  A43  A44  A45  A46  A48  A49 
##  234  103   12  181  280   12   22   50    9   97
newdata$purpose<-gsub("A40","40",newdata$purpose)
newdata$purpose<-gsub("A41","41",newdata$purpose)
newdata$purpose<-gsub("A410","410",newdata$purpose)
newdata$purpose<-gsub("A42","42",newdata$purpose)
newdata$purpose<-gsub("A44","44",newdata$purpose)
newdata$purpose<-gsub("A45","45",newdata$purpose)
newdata$purpose<-gsub("A46","46",newdata$purpose)
newdata$purpose<-gsub("A48","48",newdata$purpose)
newdata$purpose<-gsub("A49","49",newdata$purpose)
newdata$purpose<-gsub("A43","43",newdata$purpose)
table(newdata$purpose)
## 
##  40  41 410  42  43  44  45  46  48  49 
## 234 103  12 181 280  12  22  50   9  97

Lets look at saving

table(newdata$savings)
## 
## A61 A62 A63 A64 A65 
## 603 103  63  48 183
newdata$savings<-gsub("A61","61",newdata$savings)
newdata$savings<-gsub("A62","62",newdata$savings)
newdata$savings<-gsub("A63","63",newdata$savings)
newdata$savings<-gsub("A64","64",newdata$savings)
newdata$savings<-gsub("A65","65",newdata$savings)
table(newdata$savings)
## 
##  61  62  63  64  65 
## 603 103  63  48 183

Lets look ar employ

table(newdata$employ)
## 
## A71 A72 A73 A74 A75 
##  62 172 339 174 253
newdata$employ<-gsub("A71","71",newdata$employ)
newdata$employ<-gsub("A72","72",newdata$employ)
newdata$employ<-gsub("A73","73",newdata$employ)
newdata$employ<-gsub("A74","74",newdata$employ)
newdata$employ<-gsub("A75","75",newdata$employ)
table(newdata$employ)
## 
##  71  72  73  74  75 
##  62 172 339 174 253

Perform the same with all the data list

table(newdata$status)
## 
## A91 A92 A93 A94 
##  50 310 548  92
newdata$status<-gsub("A91","91",newdata$status)
newdata$status<-gsub("A92","92",newdata$status)
newdata$status<-gsub("A93","93",newdata$status)
newdata$status<-gsub("A94","94",newdata$status)
table(newdata$status)
## 
##  91  92  93  94 
##  50 310 548  92
table(newdata$others)
## 
## A101 A102 A103 
##  907   41   52
newdata$others<-gsub("A101","101",newdata$others)
newdata$others<-gsub("A102","102",newdata$others)
newdata$others<-gsub("A103","103",newdata$others)
table(newdata$others)
## 
## 101 102 103 
## 907  41  52
table(newdata$property)
## 
## A121 A122 A123 A124 
##  282  232  332  154
newdata$property<-gsub("A121","121",newdata$property)
newdata$property<-gsub("A122","122",newdata$property)
newdata$property<-gsub("A123","123",newdata$property)
newdata$property<-gsub("A124","124",newdata$property)
table(newdata$property)
## 
## 121 122 123 124 
## 282 232 332 154
table(newdata$otherplans)
## 
## A141 A142 A143 
##  139   47  814
newdata$otherplans<-gsub("A141","141",newdata$otherplans)
newdata$otherplans<-gsub("A142","142",newdata$otherplans)
newdata$otherplans<-gsub("A143","143",newdata$otherplans)
table(newdata$otherplans)
## 
## 141 142 143 
## 139  47 814
table(newdata$housing)
## 
## A151 A152 A153 
##  179  713  108
newdata$housing<-gsub("A151","151",newdata$housing)
newdata$housing<-gsub("A152","152",newdata$housing)
newdata$housing<-gsub("A153","153",newdata$housing)
table(newdata$housing)
## 
## 151 152 153 
## 179 713 108
table(newdata$job)
## 
## A171 A172 A173 A174 
##   22  200  630  148
newdata$job<-gsub("A171","171",newdata$job)
newdata$job<-gsub("A172","172",newdata$job)
newdata$job<-gsub("A173","173",newdata$job)
newdata$job<-gsub("A174","174",newdata$job)
table(newdata$job)
## 
## 171 172 173 174 
##  22 200 630 148
table(newdata$tele)
## 
## A191 A192 
##  596  404
newdata$tele<-gsub("A191","191",newdata$tele)
newdata$tele<-gsub("A192","192",newdata$tele)
table(newdata$tele)
## 
## 191 192 
## 596 404
table(newdata$foreign)
## 
## A201 A202 
##  963   37
newdata$foreign<-gsub("A201","201",newdata$foreign)
newdata$foreign<-gsub("A202","202",newdata$foreign)
table(newdata$foreign)
## 
## 201 202 
## 963  37

After converting character to numeric the r is not going to read it as numeric so we will convert it to numeric

newdata$foreign<-as.numeric(newdata$foreign)
newdata$checkingstatus1<-as.numeric(newdata$checkingstatus1)
newdata$duration<-as.numeric(newdata$duration)
newdata$Default<-as.numeric(newdata$Default)
newdata$history<-as.numeric(newdata$history)
newdata$purpose<-as.numeric(newdata$purpose)
newdata$amount<-as.numeric(newdata$amount)
newdata$savings<-as.numeric(newdata$savings)
newdata$employ<-as.numeric(newdata$employ)
newdata$installment<-as.numeric(newdata$installment)
newdata$status<-as.numeric(newdata$status)
newdata$others<-as.numeric(newdata$others)
newdata$residence<-as.numeric(newdata$residence)
newdata$property<-as.numeric(newdata$property)
newdata$age<-as.numeric(newdata$age)
newdata$otherplans<-as.numeric(newdata$otherplans)
newdata$housing<-as.numeric(newdata$housing)
newdata$cards<-as.numeric(newdata$cards)
newdata$job<-as.numeric(newdata$job)
newdata$liable<-as.numeric(newdata$liable)
newdata$tele<-as.numeric(newdata$tele)
newdata$foreign<-as.numeric(newdata$foreign)
str(newdata)
## 'data.frame':    1000 obs. of  21 variables:
##  $ Default        : num  0 1 0 0 1 0 0 0 0 1 ...
##  $ checkingstatus1: num  11 12 14 11 11 14 14 12 14 12 ...
##  $ duration       : num  6 48 12 42 24 36 24 36 12 30 ...
##  $ history        : num  34 32 34 32 33 32 32 32 32 34 ...
##  $ purpose        : num  43 43 46 42 40 46 42 41 43 40 ...
##  $ amount         : num  1169 5951 2096 7882 4870 ...
##  $ savings        : num  65 61 61 61 61 65 63 61 64 61 ...
##  $ employ         : num  75 73 74 74 73 73 75 73 74 71 ...
##  $ installment    : num  4 2 2 2 3 2 3 2 2 4 ...
##  $ status         : num  93 92 93 93 93 93 93 93 91 94 ...
##  $ others         : num  101 101 101 103 101 101 101 101 101 101 ...
##  $ residence      : num  4 2 3 4 4 4 4 2 4 2 ...
##  $ property       : num  121 121 121 122 124 124 122 123 121 123 ...
##  $ age            : num  67 22 49 45 53 35 53 35 61 28 ...
##  $ otherplans     : num  143 143 143 143 143 143 143 143 143 143 ...
##  $ housing        : num  152 152 152 153 153 153 152 151 152 152 ...
##  $ cards          : num  2 1 1 1 2 1 1 1 1 2 ...
##  $ job            : num  173 173 172 173 173 172 173 174 172 174 ...
##  $ liable         : num  1 1 2 2 2 2 1 1 1 1 ...
##  $ tele           : num  192 191 191 191 191 192 191 192 191 191 ...
##  $ foreign        : num  201 201 201 201 201 201 201 201 201 201 ...

Now we will perform discriminate analysis LDA and QDA methods based on probability method. Before that we will have a look at output and create training and testing set.

table(newdata$Default)
## 
##   0   1 
## 700 300

Everting looks fine. lets set the seed and group into train and test sets

set.seed(123)
ind<-sample(2,nrow(newdata),replace = TRUE,prob = c(0.7,0.3))
train<-newdata[ind==1,]
test<-newdata[ind==2,]

Now look at the structure of test

str(test)
## 'data.frame':    295 obs. of  21 variables:
##  $ Default        : num  1 0 1 0 1 1 0 0 0 0 ...
##  $ checkingstatus1: num  12 11 11 12 12 11 14 14 12 11 ...
##  $ duration       : num  48 42 24 36 12 24 24 9 12 6 ...
##  $ history        : num  32 32 33 32 32 32 32 34 34 32 ...
##  $ purpose        : num  43 42 40 41 40 43 43 40 41 42 ...
##  $ amount         : num  5951 7882 4870 6948 1295 ...
##  $ savings        : num  61 61 61 61 61 62 63 61 62 61 ...
##  $ employ         : num  73 74 73 73 72 73 75 73 72 73 ...
##  $ installment    : num  2 2 3 2 3 4 3 4 3 1 ...
##  $ status         : num  92 93 93 93 92 92 93 93 93 93 ...
##  $ others         : num  101 103 101 101 101 101 101 101 101 101 ...
##  $ residence      : num  2 4 4 2 1 2 2 4 4 2 ...
##  $ property       : num  121 122 124 123 123 123 123 123 122 121 ...
##  $ age            : num  22 45 53 35 25 32 31 48 44 36 ...
##  $ otherplans     : num  143 143 143 143 143 143 143 143 143 141 ...
##  $ housing        : num  152 153 153 151 151 152 152 152 152 152 ...
##  $ cards          : num  1 1 2 1 1 1 1 3 1 1 ...
##  $ job            : num  173 173 173 174 173 172 173 173 173 172 ...
##  $ liable         : num  1 2 2 1 1 1 2 1 1 1 ...
##  $ tele           : num  191 191 191 192 191 191 192 192 191 192 ...
##  $ foreign        : num  201 201 201 201 201 201 201 201 201 201 ...

Lets load some required packages

library(caret)
## Warning: package 'caret' was built under R version 3.4.1
## Loading required package: lattice
## Loading required package: ggplot2
library(MASS)
lda.fit<-lda(Default~.,data = train)
lda.fit
## Call:
## lda(Default ~ ., data = train)
## 
## Prior probabilities of groups:
##        0        1 
## 0.706383 0.293617 
## 
## Group means:
##   checkingstatus1 duration  history  purpose   amount  savings   employ
## 0        12.89357 19.30321 32.68675 45.69679 3006.044 62.34337 73.50803
## 1        11.88889 24.26570 32.13043 48.10628 3914.710 61.70048 73.26570
##   installment   status   others residence property      age otherplans
## 0    2.933735 92.73896 101.1426  2.851406 122.2129 36.53414   142.7651
## 1    3.135266 92.62319 101.1014  2.835749 122.5604 34.23188   142.5749
##    housing    cards      job   liable     tele  foreign
## 0 151.9257 1.423695 172.8755 1.148594 191.4237 201.0502
## 1 151.9275 1.410628 172.9565 1.144928 191.3961 201.0097
## 
## Coefficients of linear discriminants:
##                           LD1
## checkingstatus1 -5.438993e-01
## duration         1.515962e-02
## history         -3.444869e-01
## purpose         -1.943200e-03
## amount           7.735336e-05
## savings         -1.721170e-01
## employ          -6.383013e-02
## installment      2.107677e-01
## status          -2.017602e-01
## others          -3.740777e-01
## residence       -3.105179e-02
## property         1.309194e-01
## age             -1.047138e-02
## otherplans      -2.758089e-01
## housing         -6.983417e-02
## cards            2.798958e-01
## job              7.201059e-02
## liable           2.564522e-01
## tele            -2.401415e-01
## foreign         -5.716642e-01

After looking at this I can clearly estimate the accuracy of the model. As per the model I can clearly say that it has around 70% of the data “0” as per prior probability.

Now let’s predict and try to see, how it perform with the test data

test.lda.probs<-predict(lda.fit,newdata = test)$class

We will use a confusion matrix to see the accuracy of the model

confusionMatrix(test$Default,test.lda.probs)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 176  26
##          1  46  47
##                                           
##                Accuracy : 0.7559          
##                  95% CI : (0.7028, 0.8039)
##     No Information Rate : 0.7525          
##     P-Value [Acc > NIR] : 0.47763         
##                                           
##                   Kappa : 0.3999          
##  Mcnemar's Test P-Value : 0.02514         
##                                           
##             Sensitivity : 0.7928          
##             Specificity : 0.6438          
##          Pos Pred Value : 0.8713          
##          Neg Pred Value : 0.5054          
##              Prevalence : 0.7525          
##          Detection Rate : 0.5966          
##    Detection Prevalence : 0.6847          
##       Balanced Accuracy : 0.7183          
##                                           
##        'Positive' Class : 0               
## 

Now Lets apply QDA

qda.fit<-qda(Default~.,data = train)
qda.fit
## Call:
## qda(Default ~ ., data = train)
## 
## Prior probabilities of groups:
##        0        1 
## 0.706383 0.293617 
## 
## Group means:
##   checkingstatus1 duration  history  purpose   amount  savings   employ
## 0        12.89357 19.30321 32.68675 45.69679 3006.044 62.34337 73.50803
## 1        11.88889 24.26570 32.13043 48.10628 3914.710 61.70048 73.26570
##   installment   status   others residence property      age otherplans
## 0    2.933735 92.73896 101.1426  2.851406 122.2129 36.53414   142.7651
## 1    3.135266 92.62319 101.1014  2.835749 122.5604 34.23188   142.5749
##    housing    cards      job   liable     tele  foreign
## 0 151.9257 1.423695 172.8755 1.148594 191.4237 201.0502
## 1 151.9275 1.410628 172.9565 1.144928 191.3961 201.0097

As we are using prior probability the group of estimate probability will be same with LDA and QDA. But the prediction of both will obviously differ from each other. let’s go ahead and predict

test.qda.probs<-predict(qda.fit,newdata = test)$class
confusionMatrix(test$Default,test.qda.probs)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 153  49
##          1  41  52
##                                          
##                Accuracy : 0.6949         
##                  95% CI : (0.6389, 0.747)
##     No Information Rate : 0.6576         
##     P-Value [Acc > NIR] : 0.0980         
##                                          
##                   Kappa : 0.3094         
##  Mcnemar's Test P-Value : 0.4606         
##                                          
##             Sensitivity : 0.7887         
##             Specificity : 0.5149         
##          Pos Pred Value : 0.7574         
##          Neg Pred Value : 0.5591         
##              Prevalence : 0.6576         
##          Detection Rate : 0.5186         
##    Detection Prevalence : 0.6847         
##       Balanced Accuracy : 0.6518         
##                                          
##        'Positive' Class : 0              
## 

Conclusion:

I do not want to judge Discriminant Analysis is the best classification technique but it has its own charm and using it we have received LDA-75% and QDA-69% accuracy. I do not recommend deploying this until it is near to 100 % accuracy and other parameter .

Even LDA have performed well with the data set and sensitivity and specificity is also improved. If all the models are for section then I will judge based on accuracy, sensitivity, specificity and kappa