Wednesday, 8 November 2017

Brand Rating Analysis using PCA Algorithm


PCA Analysis on Brand Rating Analysis

Introduction

Hi in this article I will do a muti-dimensional data Analysis(Metric Data) called PCA(Principle Component Analysis). To hear this, it may be difficult and in reality it is just a covariance matrix plotted on a chart. What is PCA?

In simple words: You are taking a photo and it is your objective to take all the people in the photo and with wide angle so that all the people in the photo are visible properly. We normally want background to be covered, standing angle, dress, shoes etc.. Similarly we do the same kind of technique using PCA. It is all about seeing data in a wide angle and understanding the relationship and similarities in the data. Where PCA can be applied?

When you have too many variables/Comments and you do not know who is the leader in a segment/Product/Brand and what the areas of improvements are. We can apply PCA and find it.Specifically, it can be appied to Product Rating, Brand Rating, position of customer segment, rating of political candidate, evaluvation of Advt etc.

Even a Linear regression Model will fail when you have too many features and parameters in a data set. In that point in time we try to group them based on the similarities and try to come with reduced set of variables. Which also help identifying the similarities in the data and its relationship?

There is one thing we have to understand, PCA is not the only technique for such thing, even Exploratory Factor Analysis and Multi Dimension Scaling is also applied. Which, I will be covering in further article. One special thing about PCA is that it see for Uncorrelated Linear Dimension and group them into similar groups with maximum component. When plotted the lines are scattered with different angle.

## 'data.frame':    1000 obs. of  10 variables:
##  $ perform: int  2 1 2 1 1 2 1 2 2 3 ...
##  $ leader : int  4 1 3 6 1 8 1 1 1 1 ...
##  $ latest : int  8 4 5 10 5 9 5 7 8 9 ...
##  $ fun    : int  8 7 9 8 8 5 7 5 10 8 ...
##  $ serious: int  2 1 2 3 1 3 1 2 1 1 ...
##  $ bargain: int  9 1 9 4 9 8 5 8 7 3 ...
##  $ value  : int  7 1 5 5 9 7 1 7 7 3 ...
##  $ trendy : int  4 2 1 2 1 1 1 7 5 4 ...
##  $ rebuy  : int  6 2 6 1 1 2 1 1 1 1 ...
##  $ brand  : Factor w/ 10 levels "a","b","c","d",..: 1 1 1 1 1 1 1 1 1 1 ...

If you see this data clearly there are various features of the brand and the output is the performance. There are 1000 observation and 9 features which are influencing the brand. All are integers and no factors in the data apart brand.

Now we will have a look of summary for outliers, missing values and abnormality.

##     perform           leader           latest            fun        
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.000  
##  1st Qu.: 1.000   1st Qu.: 2.000   1st Qu.: 4.000   1st Qu.: 4.000  
##  Median : 4.000   Median : 4.000   Median : 7.000   Median : 6.000  
##  Mean   : 4.488   Mean   : 4.417   Mean   : 6.195   Mean   : 6.068  
##  3rd Qu.: 7.000   3rd Qu.: 6.000   3rd Qu.: 9.000   3rd Qu.: 8.000  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.000  
##                                                                     
##     serious          bargain           value            trendy     
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 3.00  
##  Median : 4.000   Median : 4.000   Median : 4.000   Median : 5.00  
##  Mean   : 4.323   Mean   : 4.259   Mean   : 4.337   Mean   : 5.22  
##  3rd Qu.: 6.000   3rd Qu.: 6.000   3rd Qu.: 6.000   3rd Qu.: 7.00  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00  
##                                                                    
##      rebuy            brand    
##  Min.   : 1.000   a      :100  
##  1st Qu.: 1.000   b      :100  
##  Median : 3.000   c      :100  
##  Mean   : 3.727   d      :100  
##  3rd Qu.: 5.000   e      :100  
##  Max.   :10.000   f      :100  
##                   (Other):400

The data looks fine free from outliers and the mean and medians are close to each other.

Now we will do scaling on the data, because comparison will be difficult with one value with high number and other with low number. It helps data to come on common scale. This data do not require scaling as everything is common, doing scaling is a standard technique used.So let’s do scaling on the data.

##         perform     leader     latest        fun    serious     bargain
## [1,] -0.7766617 -0.1598662  0.5864084  0.7040175 -0.8361532  1.77763492
## [2,] -1.0888247 -1.3099826 -0.7131117  0.3396192 -1.1960986 -1.22195997
## [3,] -0.7766617 -0.5432383 -0.3882316  1.0684159 -0.8361532  1.77763492
## [4,] -1.0888247  0.6068781  1.2361685  0.7040175 -0.4762078 -0.09711188
## [5,] -1.0888247 -1.3099826 -0.3882316  0.7040175 -1.1960986  1.77763492
## [6,] -0.7766617  1.3736224  0.9112885 -0.3891774 -0.4762078  1.40268556
##           value     trendy      rebuy
## [1,]  1.1102404 -0.4449143  0.8932671
## [2,] -1.3912400 -1.1742820 -0.6786944
## [3,]  0.2764136 -1.5389658  0.8932671
## [4,]  0.2764136 -1.1742820 -1.0716848
## [5,]  1.9440672 -1.5389658 -1.0716848
## [6,]  1.1102404 -1.5389658 -0.6786944

Here I have removed brand which is the factor and it is something on which we can’t do scaling

We can see similarity and dissimilarity using correlation plot

## Warning: package 'corrplot' was built under R version 3.4.1

Now we will apply PCA Algorithms on Sclaed data

pca.algo<-prcomp(brandscale)
summary(pca.algo)
## Importance of components%s:
##                          PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.726 1.4479 1.0389 0.8528 0.79846 0.73133 0.62458
## Proportion of Variance 0.331 0.2329 0.1199 0.0808 0.07084 0.05943 0.04334
## Cumulative Proportion  0.331 0.5640 0.6839 0.7647 0.83554 0.89497 0.93831
##                            PC8     PC9
## Standard deviation     0.55861 0.49310
## Proportion of Variance 0.03467 0.02702
## Cumulative Proportion  0.97298 1.00000

Now we got the PCA summary and its result. If required we can also find the relation

pca.algo
## Standard deviations (1, .., p=9):
## [1] 1.7260636 1.4479474 1.0388719 0.8527667 0.7984647 0.7313298 0.6245834
## [8] 0.5586112 0.4930993
## 
## Rotation (n x k) = (9 x 9):
##                PC1         PC2         PC3         PC4         PC5
## perform  0.2374679  0.41991179  0.03854006 -0.52630873  0.46793435
## leader   0.2058257  0.52381901 -0.09512739 -0.08923461 -0.29452974
## latest  -0.3703806  0.20145317 -0.53273054  0.21410754  0.10586676
## fun     -0.2510601 -0.25037973 -0.41781346 -0.75063952 -0.33149429
## serious  0.1597402  0.51047254 -0.04067075  0.09893394 -0.55515540
## bargain  0.3991731 -0.21849698 -0.48989756  0.16734345 -0.01257429
## value    0.4474562 -0.18980822 -0.36924507  0.15118500 -0.06327757
## trendy  -0.3510292  0.31849032 -0.37090530  0.16764432  0.36649697
## rebuy    0.4390184  0.01509832 -0.12461593 -0.13031231  0.35568769
##                PC6          PC7         PC8         PC9
## perform -0.3370676  0.364179109 -0.14444718  0.05223384
## leader  -0.2968860 -0.613674301  0.28766118 -0.17889453
## latest  -0.1742059 -0.185480310 -0.64290436  0.05750244
## fun      0.1405367 -0.007114761  0.07461259  0.03153306
## serious  0.3924874  0.445302862 -0.18354764  0.09072231
## bargain -0.1393966  0.288264900  0.05789194 -0.64720849
## value   -0.2195327  0.017163011  0.14829295  0.72806108
## trendy   0.2658186  0.153572108  0.61450289  0.05907022
## rebuy    0.6751400 -0.388656160 -0.20210688 -0.01720236

We can see the relation which have +ve and -ve relation as per the Principle Component Group.

Now we can go ahead for a screen plot

on X axis we have PC groups and on Y we have variance. We look for a kink and we can see we have a kink at 4. so we will go ahead plotting our PCA Plot

biplot(pca.algo)

Now we can clearly see 9 features are distributed in different direction and with greater noise. So to reduce background noise we can take mean data we get the same output with less background noise or change colour to see concentration.