Sermentation : An Cluster Analysis
Sangamesh K S
January 2, 2018
We will do segmentation and cluster the transactions into groups using hierarchical clustering in R programming.
We will load the required packages
library(cluster) #For Clustering
library(dplyr) #For Data Wangling
library(xlsx) #For Loading Data
we will load the data and rename the heading
df<-read.xlsx("C:/Users/Sangmesh/Downloads/InternChallenge/sample_data_test.xlsx",sheetName = "sample_data")
names(df)<-c("Cid_no","channel_id","transaction_id","transaction_datetime","transaction_amount","product","service_category","transaction_location","transaction_type","customer_type","sex","date_of_birth","education_status","income","minor","marital_status","dependents")
head(df)
## Cid_no channel_id transaction_id transaction_datetime
## 1 A0AA2953 ATM1 033ISAL171174163 2017-04-27 13:21:59
## 2 A2735A56 ATM1 033DBLC171091109 2017-04-20 03:56:00
## 3 A2735A56 Teller 071CHDP171090510 2017-04-19 00:00:00
## 4 A070599A ATM 056CHDP171140032 2017-04-24 00:00:00
## 5 A06A3745 ATM1 045CHDP171140022 2017-04-24 00:00:00
## 6 A2866073 ATM1 008CHDP171030503 2017-04-13 00:00:00
## transaction_amount product service_category transaction_location
## 1 209232.8 NULL REMITTANCE 19
## 2 275000.0 NULL REMITTANCE 3
## 3 240000.0 NULL CASH DEPOSIT 3
## 4 650000.0 NULL CASH DEPOSIT 45
## 5 300000.0 NULL CASH DEPOSIT 14
## 6 916500.0 NULL CASH DEPOSIT 8
## transaction_type customer_type sex date_of_birth education_status
## 1 FINANCIAL I M 23229 NON STUDENT
## 2 FINANCIAL I M 28151 NON STUDENT
## 3 FINANCIAL I M 28151 NON STUDENT
## 4 FINANCIAL C NULL ~ NULL
## 5 FINANCIAL C NULL ~ NULL
## 6 FINANCIAL C NULL ~ NULL
## income minor marital_status dependents
## 1 SALARY N M 8
## 2 BUSINESS N M 2
## 3 BUSINESS N M 2
## 4 NULL NULL NULL 0
## 5 NULL N NULL 0
## 6 NULL N NULL 0
Lets start data wangling. 1st we will create a dublicate a database
df1<-df
I was able to identify that sex and date of birth are NULL rest of are null. so will rename null as NA
df1$sex<-na_if(df1$sex,"NULL")
df1$date_of_birth<-na_if(df1$date_of_birth,"NULL")
df1$date_of_birth<-na_if(df1$date_of_birth,"~")
head(df1$sex);head(df1$date_of_birth)
## [1] M M M <NA> <NA> <NA>
## Levels: F M NULL
## [1] 23229 28151 28151 <NA> <NA> <NA>
## 541 Levels: ~ 11156 11465 11505 12997 13634 13885 13957 14052 ... NULL
We will subset the data and remove the unwanted and name is df3
df3<-df1[complete.cases(df1[,11:12]),]
head(df3,n=15)
## Cid_no channel_id transaction_id transaction_datetime
## 1 A0AA2953 ATM1 033ISAL171174163 2017-04-27 13:21:59
## 2 A2735A56 ATM1 033DBLC171091109 2017-04-20 03:56:00
## 3 A2735A56 Teller 071CHDP171090510 2017-04-19 00:00:00
## 7 A2889968 ATM2 033DCRD171150139 2017-04-25 12:46:00
## 12 A2926537 ATM1 008CHDP170960010 2017-04-06 00:00:00
## 13 A053A344 ATM2 033ISAL171141064 2017-04-24 11:58:00
## 14 A03A9857 ATM1 033IWRT171150508 2017-04-25 09:27:00
## 15 A0002620 ATM1 020CHWL171150011 2017-04-25 00:00:00
## 25 A2900365 ATM 014CHDP171140016 2017-04-24 00:00:00
## 26 A0063037 Teller 033IWRT171000200 2017-04-10 08:47:00
## 27 A0A00623 ATM1 033CARD170912223 2017-04-01 13:09:59
## 28 A0A00623 ATM1 071CHWL170910023 2017-04-01 00:00:00
## 29 A0A00623 ATM1 004CHWL170910514 2017-04-01 00:00:00
## 30 A2432A25 ATM1 033IWRT171080882 2017-04-18 14:51:00
## 32 A2474A88 ATM1 033OWFR171080024 2017-04-18 19:43:00
## transaction_amount product service_category transaction_location
## 1 209232.8 NULL REMITTANCE 19
## 2 275000.0 NULL REMITTANCE 3
## 3 240000.0 NULL CASH DEPOSIT 3
## 7 256050.0 NULL REMITTANCE 71
## 12 313530.0 NULL CASH DEPOSIT 91
## 13 227287.0 NULL REMITTANCE 12
## 14 253317.0 NULL REMITTANCE 17
## 15 700000.0 NULL CASH WITHDRAWAL 15
## 25 395000.0 NULL CASH DEPOSIT 24
## 26 294000.0 NULL REMITTANCE 24
## 27 860000.0 NULL REMITTANCE 3
## 28 300000.0 NULL CASH WITHDRAWAL 3
## 29 300000.0 NULL CASH WITHDRAWAL 3
## 30 300000.0 NULL REMITTANCE 3
## 32 1514000.0 NULL REMITTANCE 73
## transaction_type customer_type sex date_of_birth education_status
## 1 FINANCIAL I M 23229 NON STUDENT
## 2 FINANCIAL I M 28151 NON STUDENT
## 3 FINANCIAL I M 28151 NON STUDENT
## 7 FINANCIAL I M 25109 NON STUDENT
## 12 FINANCIAL I M 25024 NON STUDENT
## 13 FINANCIAL I M 24269 NON STUDENT
## 14 FINANCIAL I M 23955 NON STUDENT
## 15 FINANCIAL I M 18264 NON STUDENT
## 25 FINANCIAL I M 21630 NULL
## 26 FINANCIAL I M 24772 NON STUDENT
## 27 FINANCIAL I M 26586 NON STUDENT
## 28 FINANCIAL I M 26586 NON STUDENT
## 29 FINANCIAL I M 26586 NON STUDENT
## 30 FINANCIAL I M 27111 NON STUDENT
## 32 FINANCIAL I M 19542 NON STUDENT
## income minor marital_status dependents
## 1 SALARY N M 8
## 2 BUSINESS N M 2
## 3 BUSINESS N M 2
## 7 SALARY N M 2
## 12 BUSINESS N M 1
## 13 SALARY N M 0
## 14 SALARY N M 3
## 15 SALARY N M 0
## 25 BUSINESS N M 0
## 26 SALARY N S 0
## 27 RENTAL N M 3
## 28 RENTAL N M 3
## 29 RENTAL N M 3
## 30 INVESTMENT N NULL 0
## 32 NULL N M 0
After removing null there are few data which consist of null. lets rename it with the mean
df3$education_status<-na_if(df3$education_status,"NULL")
df3$education_status[is.na(df3$education_status)]="NON STUDENT"
df3$minor<-na_if(df3$minor,"NULL")
df3$minor[is.na(df3$minor)]="N"
df3$marital_status<-na_if(df3$marital_status,"NULL")
df3$marital_status[is.na(df3$marital_status)]="M"
df3$income<-na_if(df3$income,"NULL")
df3$income[is.na(df3$income)]="SALARY"
head(df3,n=15)
## Cid_no channel_id transaction_id transaction_datetime
## 1 A0AA2953 ATM1 033ISAL171174163 2017-04-27 13:21:59
## 2 A2735A56 ATM1 033DBLC171091109 2017-04-20 03:56:00
## 3 A2735A56 Teller 071CHDP171090510 2017-04-19 00:00:00
## 7 A2889968 ATM2 033DCRD171150139 2017-04-25 12:46:00
## 12 A2926537 ATM1 008CHDP170960010 2017-04-06 00:00:00
## 13 A053A344 ATM2 033ISAL171141064 2017-04-24 11:58:00
## 14 A03A9857 ATM1 033IWRT171150508 2017-04-25 09:27:00
## 15 A0002620 ATM1 020CHWL171150011 2017-04-25 00:00:00
## 25 A2900365 ATM 014CHDP171140016 2017-04-24 00:00:00
## 26 A0063037 Teller 033IWRT171000200 2017-04-10 08:47:00
## 27 A0A00623 ATM1 033CARD170912223 2017-04-01 13:09:59
## 28 A0A00623 ATM1 071CHWL170910023 2017-04-01 00:00:00
## 29 A0A00623 ATM1 004CHWL170910514 2017-04-01 00:00:00
## 30 A2432A25 ATM1 033IWRT171080882 2017-04-18 14:51:00
## 32 A2474A88 ATM1 033OWFR171080024 2017-04-18 19:43:00
## transaction_amount product service_category transaction_location
## 1 209232.8 NULL REMITTANCE 19
## 2 275000.0 NULL REMITTANCE 3
## 3 240000.0 NULL CASH DEPOSIT 3
## 7 256050.0 NULL REMITTANCE 71
## 12 313530.0 NULL CASH DEPOSIT 91
## 13 227287.0 NULL REMITTANCE 12
## 14 253317.0 NULL REMITTANCE 17
## 15 700000.0 NULL CASH WITHDRAWAL 15
## 25 395000.0 NULL CASH DEPOSIT 24
## 26 294000.0 NULL REMITTANCE 24
## 27 860000.0 NULL REMITTANCE 3
## 28 300000.0 NULL CASH WITHDRAWAL 3
## 29 300000.0 NULL CASH WITHDRAWAL 3
## 30 300000.0 NULL REMITTANCE 3
## 32 1514000.0 NULL REMITTANCE 73
## transaction_type customer_type sex date_of_birth education_status
## 1 FINANCIAL I M 23229 NON STUDENT
## 2 FINANCIAL I M 28151 NON STUDENT
## 3 FINANCIAL I M 28151 NON STUDENT
## 7 FINANCIAL I M 25109 NON STUDENT
## 12 FINANCIAL I M 25024 NON STUDENT
## 13 FINANCIAL I M 24269 NON STUDENT
## 14 FINANCIAL I M 23955 NON STUDENT
## 15 FINANCIAL I M 18264 NON STUDENT
## 25 FINANCIAL I M 21630 NON STUDENT
## 26 FINANCIAL I M 24772 NON STUDENT
## 27 FINANCIAL I M 26586 NON STUDENT
## 28 FINANCIAL I M 26586 NON STUDENT
## 29 FINANCIAL I M 26586 NON STUDENT
## 30 FINANCIAL I M 27111 NON STUDENT
## 32 FINANCIAL I M 19542 NON STUDENT
## income minor marital_status dependents
## 1 SALARY N M 8
## 2 BUSINESS N M 2
## 3 BUSINESS N M 2
## 7 SALARY N M 2
## 12 BUSINESS N M 1
## 13 SALARY N M 0
## 14 SALARY N M 3
## 15 SALARY N M 0
## 25 BUSINESS N M 0
## 26 SALARY N S 0
## 27 RENTAL N M 3
## 28 RENTAL N M 3
## 29 RENTAL N M 3
## 30 INVESTMENT N M 0
## 32 SALARY N M 0
We will remove the unwanted colums which are genric and do not contribute to the model
df3<-df3[-c(1,3,4,6,9)]
names(df3)
## [1] "channel_id" "transaction_amount" "service_category"
## [4] "transaction_location" "customer_type" "sex"
## [7] "date_of_birth" "education_status" "income"
## [10] "minor" "marital_status" "dependents"
Lets apply hclust and plot the data
set.seed(1000)
d_daisy<-daisy(df3)
h_c<-hclust(d_daisy,method = "complete")
plot(h_c)
Lets see the CPCC of the model
cor(cophenetic(h_c),d_daisy)
## [1] 0.6644664
Using rect.clust fucntion we can specify the number of K and cluster and aggregate function to view the group and view what are the observation within the data.