Tuesday, 2 January 2018

Segmentation: Cluster Analysis (Random Analysis)


Sermentation : An Cluster Analysis

We will do segmentation and cluster the transactions into groups using hierarchical clustering in R programming.

We will load the required packages

library(cluster) #For Clustering
library(dplyr) #For Data Wangling 
library(xlsx) #For Loading Data

we will load the data and rename the heading

df<-read.xlsx("C:/Users/Sangmesh/Downloads/InternChallenge/sample_data_test.xlsx",sheetName = "sample_data")
names(df)<-c("Cid_no","channel_id","transaction_id","transaction_datetime","transaction_amount","product","service_category","transaction_location","transaction_type","customer_type","sex","date_of_birth","education_status","income","minor","marital_status","dependents")
head(df)
##     Cid_no channel_id   transaction_id transaction_datetime
## 1 A0AA2953       ATM1 033ISAL171174163  2017-04-27 13:21:59
## 2 A2735A56       ATM1 033DBLC171091109  2017-04-20 03:56:00
## 3 A2735A56     Teller 071CHDP171090510  2017-04-19 00:00:00
## 4 A070599A        ATM 056CHDP171140032  2017-04-24 00:00:00
## 5 A06A3745       ATM1 045CHDP171140022  2017-04-24 00:00:00
## 6 A2866073       ATM1 008CHDP171030503  2017-04-13 00:00:00
##   transaction_amount product service_category transaction_location
## 1           209232.8    NULL       REMITTANCE                   19
## 2           275000.0    NULL       REMITTANCE                    3
## 3           240000.0    NULL     CASH DEPOSIT                    3
## 4           650000.0    NULL     CASH DEPOSIT                   45
## 5           300000.0    NULL     CASH DEPOSIT                   14
## 6           916500.0    NULL     CASH DEPOSIT                    8
##   transaction_type customer_type  sex date_of_birth education_status
## 1        FINANCIAL             I    M         23229      NON STUDENT
## 2        FINANCIAL             I    M         28151      NON STUDENT
## 3        FINANCIAL             I    M         28151      NON STUDENT
## 4        FINANCIAL             C NULL             ~             NULL
## 5        FINANCIAL             C NULL             ~             NULL
## 6        FINANCIAL             C NULL             ~             NULL
##     income minor marital_status dependents
## 1   SALARY     N              M          8
## 2 BUSINESS     N              M          2
## 3 BUSINESS     N              M          2
## 4     NULL  NULL           NULL          0
## 5     NULL     N           NULL          0
## 6     NULL     N           NULL          0

Lets start data wangling. 1st we will create a dublicate a database

df1<-df

I was able to identify that sex and date of birth are NULL rest of are null. so will rename null as NA

df1$sex<-na_if(df1$sex,"NULL")
df1$date_of_birth<-na_if(df1$date_of_birth,"NULL")
df1$date_of_birth<-na_if(df1$date_of_birth,"~")
head(df1$sex);head(df1$date_of_birth)
## [1] M    M    M    <NA> <NA> <NA>
## Levels: F M NULL
## [1] 23229 28151 28151 <NA>  <NA>  <NA> 
## 541 Levels: ~ 11156 11465 11505 12997 13634 13885 13957 14052 ... NULL

We will subset the data and remove the unwanted and name is df3

df3<-df1[complete.cases(df1[,11:12]),]
head(df3,n=15)
##      Cid_no channel_id   transaction_id transaction_datetime
## 1  A0AA2953       ATM1 033ISAL171174163  2017-04-27 13:21:59
## 2  A2735A56       ATM1 033DBLC171091109  2017-04-20 03:56:00
## 3  A2735A56     Teller 071CHDP171090510  2017-04-19 00:00:00
## 7  A2889968       ATM2 033DCRD171150139  2017-04-25 12:46:00
## 12 A2926537       ATM1 008CHDP170960010  2017-04-06 00:00:00
## 13 A053A344       ATM2 033ISAL171141064  2017-04-24 11:58:00
## 14 A03A9857       ATM1 033IWRT171150508  2017-04-25 09:27:00
## 15 A0002620       ATM1 020CHWL171150011  2017-04-25 00:00:00
## 25 A2900365        ATM 014CHDP171140016  2017-04-24 00:00:00
## 26 A0063037     Teller 033IWRT171000200  2017-04-10 08:47:00
## 27 A0A00623       ATM1 033CARD170912223  2017-04-01 13:09:59
## 28 A0A00623       ATM1 071CHWL170910023  2017-04-01 00:00:00
## 29 A0A00623       ATM1 004CHWL170910514  2017-04-01 00:00:00
## 30 A2432A25       ATM1 033IWRT171080882  2017-04-18 14:51:00
## 32 A2474A88       ATM1 033OWFR171080024  2017-04-18 19:43:00
##    transaction_amount product service_category transaction_location
## 1            209232.8    NULL       REMITTANCE                   19
## 2            275000.0    NULL       REMITTANCE                    3
## 3            240000.0    NULL     CASH DEPOSIT                    3
## 7            256050.0    NULL       REMITTANCE                   71
## 12           313530.0    NULL     CASH DEPOSIT                   91
## 13           227287.0    NULL       REMITTANCE                   12
## 14           253317.0    NULL       REMITTANCE                   17
## 15           700000.0    NULL  CASH WITHDRAWAL                   15
## 25           395000.0    NULL     CASH DEPOSIT                   24
## 26           294000.0    NULL       REMITTANCE                   24
## 27           860000.0    NULL       REMITTANCE                    3
## 28           300000.0    NULL  CASH WITHDRAWAL                    3
## 29           300000.0    NULL  CASH WITHDRAWAL                    3
## 30           300000.0    NULL       REMITTANCE                    3
## 32          1514000.0    NULL       REMITTANCE                   73
##    transaction_type customer_type sex date_of_birth education_status
## 1         FINANCIAL             I   M         23229      NON STUDENT
## 2         FINANCIAL             I   M         28151      NON STUDENT
## 3         FINANCIAL             I   M         28151      NON STUDENT
## 7         FINANCIAL             I   M         25109      NON STUDENT
## 12        FINANCIAL             I   M         25024      NON STUDENT
## 13        FINANCIAL             I   M         24269      NON STUDENT
## 14        FINANCIAL             I   M         23955      NON STUDENT
## 15        FINANCIAL             I   M         18264      NON STUDENT
## 25        FINANCIAL             I   M         21630             NULL
## 26        FINANCIAL             I   M         24772      NON STUDENT
## 27        FINANCIAL             I   M         26586      NON STUDENT
## 28        FINANCIAL             I   M         26586      NON STUDENT
## 29        FINANCIAL             I   M         26586      NON STUDENT
## 30        FINANCIAL             I   M         27111      NON STUDENT
## 32        FINANCIAL             I   M         19542      NON STUDENT
##        income minor marital_status dependents
## 1      SALARY     N              M          8
## 2    BUSINESS     N              M          2
## 3    BUSINESS     N              M          2
## 7      SALARY     N              M          2
## 12   BUSINESS     N              M          1
## 13     SALARY     N              M          0
## 14     SALARY     N              M          3
## 15     SALARY     N              M          0
## 25   BUSINESS     N              M          0
## 26     SALARY     N              S          0
## 27     RENTAL     N              M          3
## 28     RENTAL     N              M          3
## 29     RENTAL     N              M          3
## 30 INVESTMENT     N           NULL          0
## 32       NULL     N              M          0

After removing null there are few data which consist of null. lets rename it with the mean

df3$education_status<-na_if(df3$education_status,"NULL")
df3$education_status[is.na(df3$education_status)]="NON STUDENT"
df3$minor<-na_if(df3$minor,"NULL")
df3$minor[is.na(df3$minor)]="N"
df3$marital_status<-na_if(df3$marital_status,"NULL")
df3$marital_status[is.na(df3$marital_status)]="M"
df3$income<-na_if(df3$income,"NULL")
df3$income[is.na(df3$income)]="SALARY"
head(df3,n=15)
##      Cid_no channel_id   transaction_id transaction_datetime
## 1  A0AA2953       ATM1 033ISAL171174163  2017-04-27 13:21:59
## 2  A2735A56       ATM1 033DBLC171091109  2017-04-20 03:56:00
## 3  A2735A56     Teller 071CHDP171090510  2017-04-19 00:00:00
## 7  A2889968       ATM2 033DCRD171150139  2017-04-25 12:46:00
## 12 A2926537       ATM1 008CHDP170960010  2017-04-06 00:00:00
## 13 A053A344       ATM2 033ISAL171141064  2017-04-24 11:58:00
## 14 A03A9857       ATM1 033IWRT171150508  2017-04-25 09:27:00
## 15 A0002620       ATM1 020CHWL171150011  2017-04-25 00:00:00
## 25 A2900365        ATM 014CHDP171140016  2017-04-24 00:00:00
## 26 A0063037     Teller 033IWRT171000200  2017-04-10 08:47:00
## 27 A0A00623       ATM1 033CARD170912223  2017-04-01 13:09:59
## 28 A0A00623       ATM1 071CHWL170910023  2017-04-01 00:00:00
## 29 A0A00623       ATM1 004CHWL170910514  2017-04-01 00:00:00
## 30 A2432A25       ATM1 033IWRT171080882  2017-04-18 14:51:00
## 32 A2474A88       ATM1 033OWFR171080024  2017-04-18 19:43:00
##    transaction_amount product service_category transaction_location
## 1            209232.8    NULL       REMITTANCE                   19
## 2            275000.0    NULL       REMITTANCE                    3
## 3            240000.0    NULL     CASH DEPOSIT                    3
## 7            256050.0    NULL       REMITTANCE                   71
## 12           313530.0    NULL     CASH DEPOSIT                   91
## 13           227287.0    NULL       REMITTANCE                   12
## 14           253317.0    NULL       REMITTANCE                   17
## 15           700000.0    NULL  CASH WITHDRAWAL                   15
## 25           395000.0    NULL     CASH DEPOSIT                   24
## 26           294000.0    NULL       REMITTANCE                   24
## 27           860000.0    NULL       REMITTANCE                    3
## 28           300000.0    NULL  CASH WITHDRAWAL                    3
## 29           300000.0    NULL  CASH WITHDRAWAL                    3
## 30           300000.0    NULL       REMITTANCE                    3
## 32          1514000.0    NULL       REMITTANCE                   73
##    transaction_type customer_type sex date_of_birth education_status
## 1         FINANCIAL             I   M         23229      NON STUDENT
## 2         FINANCIAL             I   M         28151      NON STUDENT
## 3         FINANCIAL             I   M         28151      NON STUDENT
## 7         FINANCIAL             I   M         25109      NON STUDENT
## 12        FINANCIAL             I   M         25024      NON STUDENT
## 13        FINANCIAL             I   M         24269      NON STUDENT
## 14        FINANCIAL             I   M         23955      NON STUDENT
## 15        FINANCIAL             I   M         18264      NON STUDENT
## 25        FINANCIAL             I   M         21630      NON STUDENT
## 26        FINANCIAL             I   M         24772      NON STUDENT
## 27        FINANCIAL             I   M         26586      NON STUDENT
## 28        FINANCIAL             I   M         26586      NON STUDENT
## 29        FINANCIAL             I   M         26586      NON STUDENT
## 30        FINANCIAL             I   M         27111      NON STUDENT
## 32        FINANCIAL             I   M         19542      NON STUDENT
##        income minor marital_status dependents
## 1      SALARY     N              M          8
## 2    BUSINESS     N              M          2
## 3    BUSINESS     N              M          2
## 7      SALARY     N              M          2
## 12   BUSINESS     N              M          1
## 13     SALARY     N              M          0
## 14     SALARY     N              M          3
## 15     SALARY     N              M          0
## 25   BUSINESS     N              M          0
## 26     SALARY     N              S          0
## 27     RENTAL     N              M          3
## 28     RENTAL     N              M          3
## 29     RENTAL     N              M          3
## 30 INVESTMENT     N              M          0
## 32     SALARY     N              M          0

We will remove the unwanted colums which are genric and do not contribute to the model

df3<-df3[-c(1,3,4,6,9)]
names(df3)
##  [1] "channel_id"           "transaction_amount"   "service_category"    
##  [4] "transaction_location" "customer_type"        "sex"                 
##  [7] "date_of_birth"        "education_status"     "income"              
## [10] "minor"                "marital_status"       "dependents"

Lets apply hclust and plot the data

set.seed(1000)
d_daisy<-daisy(df3)
h_c<-hclust(d_daisy,method = "complete")
plot(h_c)

Lets see the CPCC of the model

cor(cophenetic(h_c),d_daisy)
## [1] 0.6644664

Using rect.clust fucntion we can specify the number of K and cluster and aggregate function to view the group and view what are the observation within the data.