Churn Analysis using Logistic Regression, Decision Trees, C5.0 Algo , Random Forest and others
Sangamesh K S
January 25, 2018
Introduction
Today in this article I will show how we can use machine learning approach to identify, classify and predict customer churn in an organization.
I recently got my IBM Watson Analytics certification and got introduced to a churn analysis dataset. In this article I will perform Churn Analysis using R.
For dataset you can visit: https://www.ibm.com/communities/analytics/watson-analytics-blog/predictive-insights-in-the-telco-customer-churn-data-set/
What is Churn Analysis?
Cohort or Churn Analysis is normally done on weekly, monthly and yearly basis to know the attrition of the customer form the website/service/product usage. It is the analysis helps us to get an overview of customer behavior towards the product/service. It helps us to predict customer satisfaction, quality of service, convenience, competitive pressure and others.
In this article I will apply all the machine learning algorithms till I get** +90% accuracy**.
Now lets look at the dataset
## customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female 0 Yes No 1 No
## 2 5575-GNVDE Male 0 No No 34 Yes
## 3 3668-QPYBK Male 0 No No 2 Yes
## 4 7795-CFOCW Male 0 No No 45 No
## 5 9237-HQITU Female 0 No No 2 Yes
## 6 9305-CDSKC Female 0 No No 8 Yes
## MultipleLines InternetService OnlineSecurity OnlineBackup
## 1 No phone service DSL No Yes
## 2 No DSL Yes No
## 3 No DSL Yes Yes
## 4 No phone service DSL Yes No
## 5 No Fiber optic No No
## 6 Yes Fiber optic No No
## DeviceProtection TechSupport StreamingTV StreamingMovies Contract
## 1 No No No No Month-to-month
## 2 Yes No No No One year
## 3 No No No No Month-to-month
## 4 Yes Yes No No One year
## 5 No No No No Month-to-month
## 6 Yes No Yes Yes Month-to-month
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 1 Yes Electronic check 29.85 29.85
## 2 No Mailed check 56.95 1889.50
## 3 Yes Mailed check 53.85 108.15
## 4 No Bank transfer (automatic) 42.30 1840.75
## 5 Yes Electronic check 70.70 151.65
## 6 Yes Electronic check 99.65 820.50
## Churn
## 1 No
## 2 No
## 3 Yes
## 4 No
## 5 Yes
## 6 Yes
Now we will look for missing values and will treat it as we will apply decision tree and random forest they require values to be free from missing value.
sapply(df, function(x) sum(is.na(x)))
## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Churn
## 0
As Total Charges have 11 missing data it is not huge and we will omit it and remove customer id as it is not having any significance for prediction.
## gender SeniorCitizen Partner Dependents
## 0 0 0 0
## tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges Churn
## 0 0 0 0
Now split the data into training(75%),testing(35%) and start applying all the Algos.We will start with Classification and Regression tree
## [1] 7032
## [1] 5274
Classification Tree
We will start with splitting criteria as deviance and gini
fit.tree_dev<-tree(Churn~.,data = train,split = "deviance")
fit.tree_gini<-tree(Churn~.,data = train,split = "gini")
Lets look at the output of the tree
summary(fit.tree_dev)
##
## Classification tree:
## tree(formula = Churn ~ ., data = train, split = "deviance")
## Variables actually used in tree construction:
## [1] "Contract" "StreamingMovies" "InternetService" "tenure"
## Number of terminal nodes: 7
## Residual mean deviance: 0.8551 = 4504 / 5267
## Misclassification error rate: 0.2078 = 1096 / 5274
Even though, I feed enter feature to analyze. The algorithm found only 5 features to be valuable for prediction.miss classification errors are of 21%.
Lets plot the dataset
As per the tree we can see the people who are lesser than 5.5 tenure are moving out of the service.
Now lets look at gini split
summary(fit.tree_gini)
##
## Classification tree:
## tree(formula = Churn ~ ., data = train, split = "gini")
## Number of terminal nodes: 505
## Residual mean deviance: 0.5416 = 2583 / 4769
## Misclassification error rate: 0.1331 = 702 / 5274
The miss classification error is less i.e 14% but in terminal nodes due to which the tree might be overfitted. Lets run cross validation
starting with Devience tree
Now with gini
We can see the gini tree is overfitted.
Now lets predict the 2 trees starting with devience.
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1204 306
## Yes 80 168
##
## Accuracy : 0.7804
## 95% CI : (0.7603, 0.7996)
## No Information Rate : 0.7304
## P-Value [Acc > NIR] : 7.939e-07
##
## Kappa : 0.3438
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9377
## Specificity : 0.3544
## Pos Pred Value : 0.7974
## Neg Pred Value : 0.6774
## Prevalence : 0.7304
## Detection Rate : 0.6849
## Detection Prevalence : 0.8589
## Balanced Accuracy : 0.6461
##
## 'Positive' Class : No
##
Now lets look at gini
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1049 247
## Yes 235 227
##
## Accuracy : 0.7258
## 95% CI : (0.7043, 0.7466)
## No Information Rate : 0.7304
## P-Value [Acc > NIR] : 0.6773
##
## Kappa : 0.2983
## Mcnemar's Test P-Value : 0.6163
##
## Sensitivity : 0.8170
## Specificity : 0.4789
## Pos Pred Value : 0.8094
## Neg Pred Value : 0.4913
## Prevalence : 0.7304
## Detection Rate : 0.5967
## Detection Prevalence : 0.7372
## Balanced Accuracy : 0.6479
##
## 'Positive' Class : No
##
As Gini is overfitting it is not giving a good accuracy.
C5.0 Decision Tree-Rules and Rules based Model
Now we will look at Rules based models wherein we can go for Quinlan’s formula. It is little bit advance compared to his early proposals like id3, ripper and oneR.
fit_c50<-C5.0(Churn~.,data = train)
fit_c50
##
## Call:
## C5.0.formula(formula = Churn ~ ., data = train)
##
## Classification Tree
## Number of samples: 5274
## Number of predictors: 19
##
## Tree size: 18
##
## Non-standard options: attempt to group attributes
Lets look at the summary
summary(fit_c50)
##
## Call:
## C5.0.formula(formula = Churn ~ ., data = train)
##
##
## C5.0 [Release 2.07 GPL Edition] Fri Jan 26 00:04:31 2018
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 5274 cases (20 attributes) from undefined.data
##
## Decision tree:
##
## Contract in {One year,Two year}: No (2392/154)
## Contract = Month-to-month:
## :...OnlineBackup = No internet service: No (392/65)
## OnlineBackup in {No,Yes}:
## :...tenure <= 7:
## :...InternetService = No: Yes (0)
## : InternetService = Fiber optic:
## : :...OnlineSecurity in {No,No internet service}: Yes (476/114)
## : : OnlineSecurity = Yes: No (38/16)
## : InternetService = DSL:
## : :...PaperlessBilling = No: No (183/71)
## : PaperlessBilling = Yes:
## : :...PhoneService = No: Yes (67/15)
## : PhoneService = Yes: No (162/76)
## tenure > 7:
## :...InternetService in {DSL,No}: No (508/106)
## InternetService = Fiber optic:
## :...tenure <= 17:
## :...MonthlyCharges > 80.05: Yes (203/61)
## : MonthlyCharges <= 80.05:
## : :...PaperlessBilling = No: No (24/6)
## : PaperlessBilling = Yes: [S1]
## tenure > 17:
## :...TechSupport in {No internet service,Yes}: No (179/51)
## TechSupport = No:
## :...PaperlessBilling = No: No (111/32)
## PaperlessBilling = Yes:
## :...OnlineSecurity in {No internet service,
## : Yes}: No (97/36)
## OnlineSecurity = No: [S2]
##
## SubTree [S1]
##
## PaymentMethod in {Bank transfer (automatic),Credit card (automatic),
## : Mailed check}: No (40/15)
## PaymentMethod = Electronic check: Yes (45/17)
##
## SubTree [S2]
##
## StreamingMovies = No: No (139/57)
## StreamingMovies in {No internet service,Yes}: Yes (218/92)
##
##
## Evaluation on training data (5274 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 17 984(18.7%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 3580 299 (a): class No
## 685 710 (b): class Yes
##
##
## Attribute usage:
##
## 100.00% Contract
## 54.65% OnlineBackup
## 47.21% tenure
## 47.21% InternetService
## 20.59% PaperlessBilling
## 18.35% OnlineSecurity
## 14.11% TechSupport
## 6.77% StreamingMovies
## 5.92% MonthlyCharges
## 4.34% PhoneService
## 1.61% PaymentMethod
##
##
## Time: 0.1 secs
Lets predict the tree
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1148 260
## Yes 136 214
##
## Accuracy : 0.7747
## 95% CI : (0.7545, 0.7941)
## No Information Rate : 0.7304
## P-Value [Acc > NIR] : 1.12e-05
##
## Kappa : 0.3766
## Mcnemar's Test P-Value : 6.37e-10
##
## Sensitivity : 0.8941
## Specificity : 0.4515
## Pos Pred Value : 0.8153
## Neg Pred Value : 0.6114
## Prevalence : 0.7304
## Detection Rate : 0.6530
## Detection Prevalence : 0.8009
## Balanced Accuracy : 0.6728
##
## 'Positive' Class : No
##
Conditional Inference Tree
It is a recursive partioning for continuous, censored, ordered, nominal and multivariate response in conditional infernce framework.
fit_ctree<-ctree(Churn~.,data = train)
plot(fit_ctree)
The plot is little oversized. Lets predict the tree.
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1110 216
## Yes 174 258
##
## Accuracy : 0.7782
## 95% CI : (0.758, 0.7974)
## No Information Rate : 0.7304
## P-Value [Acc > NIR] : 2.377e-06
##
## Kappa : 0.4205
## Mcnemar's Test P-Value : 0.03788
##
## Sensitivity : 0.8645
## Specificity : 0.5443
## Pos Pred Value : 0.8371
## Neg Pred Value : 0.5972
## Prevalence : 0.7304
## Detection Rate : 0.6314
## Detection Prevalence : 0.7543
## Balanced Accuracy : 0.7044
##
## 'Positive' Class : No
##
Evolutionary Learning of Globally Optimal Trees
It is a Globally optimal CART by evolutionary Algorithms.
By the plot we can see that people who are lesser than 95 monthly charges, who use fiber optics an montly charge less than 96.8% have higer churn.
Lets predict
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1178 271
## Yes 106 203
##
## Accuracy : 0.7856
## 95% CI : (0.7656, 0.8045)
## No Information Rate : 0.7304
## P-Value [Acc > NIR] : 5.573e-08
##
## Kappa : 0.3884
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9174
## Specificity : 0.4283
## Pos Pred Value : 0.8130
## Neg Pred Value : 0.6570
## Prevalence : 0.7304
## Detection Rate : 0.6701
## Detection Prevalence : 0.8242
## Balanced Accuracy : 0.6729
##
## 'Positive' Class : No
##
Recursive Partioning and Regression Tree
rpart is huge and i will show a default
rpart_tree<-rpart(Churn~.,data = train)
plot(as.party(rpart_tree))
In the plot you can see how fiber optics, internet, monthly charges and tenure are affecting the churn.
Random Forest
Random forest is on eof the most popular algo because it predictability.
fit_rfmodel<-randomForest(Churn ~., data =test)
plot(fit_rfmodel)
You can see how the errors are reduced.
pred_rfmodel<-predict(fit_rfmodel,newdata = test)
confusionMatrix(pred_rfmodel,test$Churn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1282 15
## Yes 2 459
##
## Accuracy : 0.9903
## 95% CI : (0.9846, 0.9944)
## No Information Rate : 0.7304
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9752
## Mcnemar's Test P-Value : 0.003609
##
## Sensitivity : 0.9984
## Specificity : 0.9684
## Pos Pred Value : 0.9884
## Neg Pred Value : 0.9957
## Prevalence : 0.7304
## Detection Rate : 0.7292
## Detection Prevalence : 0.7378
## Balanced Accuracy : 0.9834
##
## 'Positive' Class : No
##
Now Random Forest is a better model compaired to all the algos.
Logistic Regression
I had appled Logistic regression in my previous article and have explained about it.
So lets apply logistic regression and see the summary
## [1] 7032
## [1] 5274
##
## Call:
## glm(formula = Churn ~ ., family = binomial, data = test)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8118 -0.6771 -0.2902 0.4384 3.2501
##
## Coefficients: (7 not defined because of singularities)
## Estimate Std. Error z value
## (Intercept) 0.8398957 1.6496836 0.509
## genderMale -0.0981772 0.1312773 -0.748
## SeniorCitizen 0.2784775 0.1724575 1.615
## PartnerYes 0.0113398 0.1534852 0.074
## DependentsYes -0.0172424 0.1835659 -0.094
## tenure -0.0701558 0.0127746 -5.492
## PhoneServiceYes 0.1590474 1.3091417 0.121
## MultipleLinesNo phone service NA NA NA
## MultipleLinesYes 0.4988053 0.3511528 1.420
## InternetServiceFiber optic 1.4332474 1.6038410 0.894
## InternetServiceNo -1.5826532 1.6283423 -0.972
## OnlineSecurityNo internet service NA NA NA
## OnlineSecurityYes -0.1396124 0.3554801 -0.393
## OnlineBackupNo internet service NA NA NA
## OnlineBackupYes 0.0562857 0.3462607 0.163
## DeviceProtectionNo internet service NA NA NA
## DeviceProtectionYes 0.1411947 0.3614722 0.391
## TechSupportNo internet service NA NA NA
## TechSupportYes -0.1558189 0.3708652 -0.420
## StreamingTVNo internet service NA NA NA
## StreamingTVYes 0.2806776 0.6651838 0.422
## StreamingMoviesNo internet service NA NA NA
## StreamingMoviesYes 0.4735604 0.6567606 0.721
## ContractOne year -0.5261040 0.2165376 -2.430
## ContractTwo year -1.4350971 0.3787758 -3.789
## PaperlessBillingYes 0.6422651 0.1535070 4.184
## PaymentMethodCredit card (automatic) 0.1907930 0.2221700 0.859
## PaymentMethodElectronic check 0.3789448 0.1918550 1.975
## PaymentMethodMailed check -0.0264665 0.2398540 -0.110
## MonthlyCharges -0.0356453 0.0639724 -0.557
## TotalCharges 0.0003890 0.0001456 2.671
## Pr(>|z|)
## (Intercept) 0.610664
## genderMale 0.454544
## SeniorCitizen 0.106363
## PartnerYes 0.941104
## DependentsYes 0.925164
## tenure 3.98e-08 ***
## PhoneServiceYes 0.903303
## MultipleLinesNo phone service NA
## MultipleLinesYes 0.155468
## InternetServiceFiber optic 0.371518
## InternetServiceNo 0.331080
## OnlineSecurityNo internet service NA
## OnlineSecurityYes 0.694509
## OnlineBackupNo internet service NA
## OnlineBackupYes 0.870871
## DeviceProtectionNo internet service NA
## DeviceProtectionYes 0.696086
## TechSupportNo internet service NA
## TechSupportYes 0.674376
## StreamingTVNo internet service NA
## StreamingTVYes 0.673058
## StreamingMoviesNo internet service NA
## StreamingMoviesYes 0.470876
## ContractOne year 0.015115 *
## ContractTwo year 0.000151 ***
## PaperlessBillingYes 2.86e-05 ***
## PaymentMethodCredit card (automatic) 0.390467
## PaymentMethodElectronic check 0.048250 *
## PaymentMethodMailed check 0.912137
## MonthlyCharges 0.577392
## TotalCharges 0.007559 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1978.3 on 1757 degrees of freedom
## Residual deviance: 1431.5 on 1734 degrees of freedom
## AIC: 1479.5
##
## Number of Fisher Scoring iterations: 6
Now lets see the prediction
##
## gg2 0 1
## 0 1202 220
## 1 116 220
Conclusion
Decision tree are easy to understand and have resonable accuracy. when we use random forest they have good predictability but has poor interpretation.
By using these we can understand why customers are leaving the service and customer behaviour.