Friday, 17 November 2017

Segmentation using Cluster Analysis: Model Based Technique


Segmentation using Cluster Analysis: Model Based Technique

Introduction

In my previous article I had showed you clustering with bottom up hierarchy distance approach and mean based k means approach. These two are non-parametric approach for clustering.

Now in this article we will look at how we can use EM Algorithm and use Gaussian Mixture Model for Model based approach. Where we assume mean and variance is our parameter to be changed and the dataset has a underlying probability. The clustering happens based on probability and the goodness of the cluster can be seen using BIC (Bayesian Information Criteria)

What is EM Algorithm?

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

Now we will use the same data that we used in previous model. And run the algorithms.

Before we run the algorithm we have to load the package called mclust.And Cluster for plotting

library(mclust)
## Package 'mclust' version 5.3
## Type 'citation("mclust")' for citing this R package in publications.
library(cluster)

Please note, I have transformed the data into numeric before running the algorithm.

cust_num<-Mclust(seg.df.num)
summary(cust_num)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm 
## ----------------------------------------------------
## 
## Mclust EEV (ellipsoidal, equal volume and shape) model with 3 components:
## 
##  log.likelihood   n df       BIC       ICL
##       -5256.222 300 71 -10917.41 -10955.48
## 
## Clustering table:
##   1   2   3 
## 111 115  74

Unlike the previous algorithms where we used to set number of “k”. Here it is not required, as the algorithm do clustering. In our data the algorithms had grouped the data in 3 clusters. Also giving the BIC. Please note: if you are going for any model improvements. Take a note of BIC, it need to be the lowest compared to all models.

Now we will plot the data

clusplot(seg.df,cust_num$classification,color = TRUE,labels = 3,shade = FALSE,main = "Mclust with k=3")