Thursday, 25 January 2018

Churn Analysis using Logistic Regression, Decision Trees, C5.0 Algo , Random Forest and others

Today in this article I will show how we can use machine learning approach to identify, classify and predict customer churn in an organization.

I recently got my IBM Watson Analytics certification and got introduced to a churn analysis dataset. In this article I will perform Churn Analysis using R.

For dataset you can visit:

What is Churn Analysis?

Cohort or Churn Analysis is normally done on weekly, monthly and yearly basis to know the attrition of the customer form the website/service/product usage. It is the analysis helps us to get an overview of customer behavior towards the product/service. It helps us to predict customer satisfaction, quality of service, convenience, competitive pressure and others.

In this article I will apply all the machine learning algorithms till I get** +90% accuracy**.

Now lets look at the dataset

##   customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female             0     Yes         No      1           No
## 2 5575-GNVDE   Male             0      No         No     34          Yes
## 3 3668-QPYBK   Male             0      No         No      2          Yes
## 4 7795-CFOCW   Male             0      No         No     45           No
## 5 9237-HQITU Female             0      No         No      2          Yes
## 6 9305-CDSKC Female             0      No         No      8          Yes
##      MultipleLines InternetService OnlineSecurity OnlineBackup
## 1 No phone service             DSL             No          Yes
## 2               No             DSL            Yes           No
## 3               No             DSL            Yes          Yes
## 4 No phone service             DSL            Yes           No
## 5               No     Fiber optic             No           No
## 6              Yes     Fiber optic             No           No
##   DeviceProtection TechSupport StreamingTV StreamingMovies       Contract
## 1               No          No          No              No Month-to-month
## 2              Yes          No          No              No       One year
## 3               No          No          No              No Month-to-month
## 4              Yes         Yes          No              No       One year
## 5               No          No          No              No Month-to-month
## 6              Yes          No         Yes             Yes Month-to-month
##   PaperlessBilling             PaymentMethod MonthlyCharges TotalCharges
## 1              Yes          Electronic check          29.85        29.85
## 2               No              Mailed check          56.95      1889.50
## 3              Yes              Mailed check          53.85       108.15
## 4               No Bank transfer (automatic)          42.30      1840.75
## 5              Yes          Electronic check          70.70       151.65
## 6              Yes          Electronic check          99.65       820.50
##   Churn
## 1    No
## 2    No
## 3   Yes
## 4    No
## 5   Yes
## 6   Yes

Now we will look for missing values and will treat it as we will apply decision tree and random forest they require values to be free from missing value.

sapply(df, function(x) sum(
##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0               11 
##            Churn 
##                0

As Total Charges have 11 missing data it is not huge and we will omit it and remove customer id as it is not having any significance for prediction.

##           gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges            Churn 
##                0                0                0                0

Now split the data into training(75%),testing(35%) and start applying all the Algos.We will start with Classification and Regression tree

## [1] 7032
## [1] 5274

Classification Tree

We will start with splitting criteria as deviance and gini

fit.tree_dev<-tree(Churn~.,data = train,split = "deviance")
fit.tree_gini<-tree(Churn~.,data = train,split = "gini")

Lets look at the output of the tree

## Classification tree:
## tree(formula = Churn ~ ., data = train, split = "deviance")
## Variables actually used in tree construction:
## [1] "Contract"        "StreamingMovies" "InternetService" "tenure"         
## Number of terminal nodes:  7 
## Residual mean deviance:  0.8551 = 4504 / 5267 
## Misclassification error rate: 0.2078 = 1096 / 5274

Even though, I feed enter feature to analyze. The algorithm found only 5 features to be valuable for prediction.miss classification errors are of 21%.

Lets plot the dataset

As per the tree we can see the people who are lesser than 5.5 tenure are moving out of the service.

Now lets look at gini split

## Classification tree:
## tree(formula = Churn ~ ., data = train, split = "gini")
## Number of terminal nodes:  505 
## Residual mean deviance:  0.5416 = 2583 / 4769 
## Misclassification error rate: 0.1331 = 702 / 5274

The miss classification error is less i.e 14% but in terminal nodes due to which the tree might be overfitted. Lets run cross validation

starting with Devience tree

Now with gini

We can see the gini tree is overfitted.

Now lets predict the 2 trees starting with devience.

## Confusion Matrix and Statistics
##           Reference
## Prediction   No  Yes
##        No  1204  306
##        Yes   80  168
##                Accuracy : 0.7804          
##                  95% CI : (0.7603, 0.7996)
##     No Information Rate : 0.7304          
##     P-Value [Acc > NIR] : 7.939e-07       
##                   Kappa : 0.3438          
##  Mcnemar's Test P-Value : < 2.2e-16       
##             Sensitivity : 0.9377          
##             Specificity : 0.3544          
##          Pos Pred Value : 0.7974          
##          Neg Pred Value : 0.6774          
##              Prevalence : 0.7304          
##          Detection Rate : 0.6849          
##    Detection Prevalence : 0.8589          
##       Balanced Accuracy : 0.6461          
##        'Positive' Class : No              

Now lets look at gini

## Confusion Matrix and Statistics
##           Reference
## Prediction   No  Yes
##        No  1049  247
##        Yes  235  227
##                Accuracy : 0.7258          
##                  95% CI : (0.7043, 0.7466)
##     No Information Rate : 0.7304          
##     P-Value [Acc > NIR] : 0.6773          
##                   Kappa : 0.2983          
##  Mcnemar's Test P-Value : 0.6163          
##             Sensitivity : 0.8170          
##             Specificity : 0.4789          
##          Pos Pred Value : 0.8094          
##          Neg Pred Value : 0.4913          
##              Prevalence : 0.7304          
##          Detection Rate : 0.5967          
##    Detection Prevalence : 0.7372          
##       Balanced Accuracy : 0.6479          
##        'Positive' Class : No              

As Gini is overfitting it is not giving a good accuracy.

C5.0 Decision Tree-Rules and Rules based Model

Now we will look at Rules based models wherein we can go for Quinlan’s formula. It is little bit advance compared to his early proposals like id3, ripper and oneR.

fit_c50<-C5.0(Churn~.,data = train)
## Call:
## C5.0.formula(formula = Churn ~ ., data = train)
## Classification Tree
## Number of samples: 5274 
## Number of predictors: 19 
## Tree size: 18 
## Non-standard options: attempt to group attributes

Lets look at the summary

## Call:
## C5.0.formula(formula = Churn ~ ., data = train)
## C5.0 [Release 2.07 GPL Edition]      Fri Jan 26 00:04:31 2018
## -------------------------------
## Class specified by attribute `outcome'
## Read 5274 cases (20 attributes) from
## Decision tree:
## Contract in {One year,Two year}: No (2392/154)
## Contract = Month-to-month:
## :...OnlineBackup = No internet service: No (392/65)
##     OnlineBackup in {No,Yes}:
##     :...tenure <= 7:
##         :...InternetService = No: Yes (0)
##         :   InternetService = Fiber optic:
##         :   :...OnlineSecurity in {No,No internet service}: Yes (476/114)
##         :   :   OnlineSecurity = Yes: No (38/16)
##         :   InternetService = DSL:
##         :   :...PaperlessBilling = No: No (183/71)
##         :       PaperlessBilling = Yes:
##         :       :...PhoneService = No: Yes (67/15)
##         :           PhoneService = Yes: No (162/76)
##         tenure > 7:
##         :...InternetService in {DSL,No}: No (508/106)
##             InternetService = Fiber optic:
##             :...tenure <= 17:
##                 :...MonthlyCharges > 80.05: Yes (203/61)
##                 :   MonthlyCharges <= 80.05:
##                 :   :...PaperlessBilling = No: No (24/6)
##                 :       PaperlessBilling = Yes: [S1]
##                 tenure > 17:
##                 :...TechSupport in {No internet service,Yes}: No (179/51)
##                     TechSupport = No:
##                     :...PaperlessBilling = No: No (111/32)
##                         PaperlessBilling = Yes:
##                         :...OnlineSecurity in {No internet service,
##                             :                  Yes}: No (97/36)
##                             OnlineSecurity = No: [S2]
## SubTree [S1]
## PaymentMethod in {Bank transfer (automatic),Credit card (automatic),
## :                 Mailed check}: No (40/15)
## PaymentMethod = Electronic check: Yes (45/17)
## SubTree [S2]
## StreamingMovies = No: No (139/57)
## StreamingMovies in {No internet service,Yes}: Yes (218/92)
## Evaluation on training data (5274 cases):
##      Decision Tree   
##    ----------------  
##    Size      Errors  
##      17  984(18.7%)   <<
##     (a)   (b)    <-classified as
##    ----  ----
##    3580   299    (a): class No
##     685   710    (b): class Yes
##  Attribute usage:
##  100.00% Contract
##   54.65% OnlineBackup
##   47.21% tenure
##   47.21% InternetService
##   20.59% PaperlessBilling
##   18.35% OnlineSecurity
##   14.11% TechSupport
##    6.77% StreamingMovies
##    5.92% MonthlyCharges
##    4.34% PhoneService
##    1.61% PaymentMethod
## Time: 0.1 secs

Lets predict the tree

## Confusion Matrix and Statistics
##           Reference
## Prediction   No  Yes
##        No  1148  260
##        Yes  136  214
##                Accuracy : 0.7747          
##                  95% CI : (0.7545, 0.7941)
##     No Information Rate : 0.7304          
##     P-Value [Acc > NIR] : 1.12e-05        
##                   Kappa : 0.3766          
##  Mcnemar's Test P-Value : 6.37e-10        
##             Sensitivity : 0.8941          
##             Specificity : 0.4515          
##          Pos Pred Value : 0.8153          
##          Neg Pred Value : 0.6114          
##              Prevalence : 0.7304          
##          Detection Rate : 0.6530          
##    Detection Prevalence : 0.8009          
##       Balanced Accuracy : 0.6728          
##        'Positive' Class : No              

Conditional Inference Tree

It is a recursive partioning for continuous, censored, ordered, nominal and multivariate response in conditional infernce framework.

fit_ctree<-ctree(Churn~.,data = train)

The plot is little oversized. Lets predict the tree.

## Confusion Matrix and Statistics
##           Reference
## Prediction   No  Yes
##        No  1110  216
##        Yes  174  258
##                Accuracy : 0.7782         
##                  95% CI : (0.758, 0.7974)
##     No Information Rate : 0.7304         
##     P-Value [Acc > NIR] : 2.377e-06      
##                   Kappa : 0.4205         
##  Mcnemar's Test P-Value : 0.03788        
##             Sensitivity : 0.8645         
##             Specificity : 0.5443         
##          Pos Pred Value : 0.8371         
##          Neg Pred Value : 0.5972         
##              Prevalence : 0.7304         
##          Detection Rate : 0.6314         
##    Detection Prevalence : 0.7543         
##       Balanced Accuracy : 0.7044         
##        'Positive' Class : No             

Evolutionary Learning of Globally Optimal Trees

It is a Globally optimal CART by evolutionary Algorithms.

By the plot we can see that people who are lesser than 95 monthly charges, who use fiber optics an montly charge less than 96.8% have higer churn.

Lets predict

## Confusion Matrix and Statistics
##           Reference
## Prediction   No  Yes
##        No  1178  271
##        Yes  106  203
##                Accuracy : 0.7856          
##                  95% CI : (0.7656, 0.8045)
##     No Information Rate : 0.7304          
##     P-Value [Acc > NIR] : 5.573e-08       
##                   Kappa : 0.3884          
##  Mcnemar's Test P-Value : < 2.2e-16       
##             Sensitivity : 0.9174          
##             Specificity : 0.4283          
##          Pos Pred Value : 0.8130          
##          Neg Pred Value : 0.6570          
##              Prevalence : 0.7304          
##          Detection Rate : 0.6701          
##    Detection Prevalence : 0.8242          
##       Balanced Accuracy : 0.6729          
##        'Positive' Class : No              

Recursive Partioning and Regression Tree

rpart is huge and i will show a default

rpart_tree<-rpart(Churn~.,data = train)

In the plot you can see how fiber optics, internet, monthly charges and tenure are affecting the churn.

Random Forest

Random forest is on eof the most popular algo because it predictability.

fit_rfmodel<-randomForest(Churn ~., data =test)

You can see how the errors are reduced.

pred_rfmodel<-predict(fit_rfmodel,newdata = test)
## Confusion Matrix and Statistics
##           Reference
## Prediction   No  Yes
##        No  1282   15
##        Yes    2  459
##                Accuracy : 0.9903          
##                  95% CI : (0.9846, 0.9944)
##     No Information Rate : 0.7304          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                   Kappa : 0.9752          
##  Mcnemar's Test P-Value : 0.003609        
##             Sensitivity : 0.9984          
##             Specificity : 0.9684          
##          Pos Pred Value : 0.9884          
##          Neg Pred Value : 0.9957          
##              Prevalence : 0.7304          
##          Detection Rate : 0.7292          
##    Detection Prevalence : 0.7378          
##       Balanced Accuracy : 0.9834          
##        'Positive' Class : No              

Now Random Forest is a better model compaired to all the algos.

Logistic Regression

I had appled Logistic regression in my previous article and have explained about it.

So lets apply logistic regression and see the summary

## [1] 7032
## [1] 5274
## Call:
## glm(formula = Churn ~ ., family = binomial, data = test)
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8118  -0.6771  -0.2902   0.4384   3.2501  
## Coefficients: (7 not defined because of singularities)
##                                        Estimate Std. Error z value
## (Intercept)                           0.8398957  1.6496836   0.509
## genderMale                           -0.0981772  0.1312773  -0.748
## SeniorCitizen                         0.2784775  0.1724575   1.615
## PartnerYes                            0.0113398  0.1534852   0.074
## DependentsYes                        -0.0172424  0.1835659  -0.094
## tenure                               -0.0701558  0.0127746  -5.492
## PhoneServiceYes                       0.1590474  1.3091417   0.121
## MultipleLinesNo phone service                NA         NA      NA
## MultipleLinesYes                      0.4988053  0.3511528   1.420
## InternetServiceFiber optic            1.4332474  1.6038410   0.894
## InternetServiceNo                    -1.5826532  1.6283423  -0.972
## OnlineSecurityNo internet service            NA         NA      NA
## OnlineSecurityYes                    -0.1396124  0.3554801  -0.393
## OnlineBackupNo internet service              NA         NA      NA
## OnlineBackupYes                       0.0562857  0.3462607   0.163
## DeviceProtectionNo internet service          NA         NA      NA
## DeviceProtectionYes                   0.1411947  0.3614722   0.391
## TechSupportNo internet service               NA         NA      NA
## TechSupportYes                       -0.1558189  0.3708652  -0.420
## StreamingTVNo internet service               NA         NA      NA
## StreamingTVYes                        0.2806776  0.6651838   0.422
## StreamingMoviesNo internet service           NA         NA      NA
## StreamingMoviesYes                    0.4735604  0.6567606   0.721
## ContractOne year                     -0.5261040  0.2165376  -2.430
## ContractTwo year                     -1.4350971  0.3787758  -3.789
## PaperlessBillingYes                   0.6422651  0.1535070   4.184
## PaymentMethodCredit card (automatic)  0.1907930  0.2221700   0.859
## PaymentMethodElectronic check         0.3789448  0.1918550   1.975
## PaymentMethodMailed check            -0.0264665  0.2398540  -0.110
## MonthlyCharges                       -0.0356453  0.0639724  -0.557
## TotalCharges                          0.0003890  0.0001456   2.671
##                                      Pr(>|z|)    
## (Intercept)                          0.610664    
## genderMale                           0.454544    
## SeniorCitizen                        0.106363    
## PartnerYes                           0.941104    
## DependentsYes                        0.925164    
## tenure                               3.98e-08 ***
## PhoneServiceYes                      0.903303    
## MultipleLinesNo phone service              NA    
## MultipleLinesYes                     0.155468    
## InternetServiceFiber optic           0.371518    
## InternetServiceNo                    0.331080    
## OnlineSecurityNo internet service          NA    
## OnlineSecurityYes                    0.694509    
## OnlineBackupNo internet service            NA    
## OnlineBackupYes                      0.870871    
## DeviceProtectionNo internet service        NA    
## DeviceProtectionYes                  0.696086    
## TechSupportNo internet service             NA    
## TechSupportYes                       0.674376    
## StreamingTVNo internet service             NA    
## StreamingTVYes                       0.673058    
## StreamingMoviesNo internet service         NA    
## StreamingMoviesYes                   0.470876    
## ContractOne year                     0.015115 *  
## ContractTwo year                     0.000151 ***
## PaperlessBillingYes                  2.86e-05 ***
## PaymentMethodCredit card (automatic) 0.390467    
## PaymentMethodElectronic check        0.048250 *  
## PaymentMethodMailed check            0.912137    
## MonthlyCharges                       0.577392    
## TotalCharges                         0.007559 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for binomial family taken to be 1)
##     Null deviance: 1978.3  on 1757  degrees of freedom
## Residual deviance: 1431.5  on 1734  degrees of freedom
## AIC: 1479.5
## Number of Fisher Scoring iterations: 6

Now lets see the prediction

## gg2    0    1
##   0 1202  220
##   1  116  220


Decision tree are easy to understand and have resonable accuracy. when we use random forest they have good predictability but has poor interpretation.

By using these we can understand why customers are leaving the service and customer behaviour.