PCA Analysis on Brand Rating Analysis
Sangamesh K S
November 8, 2017
Introduction
Hi in this article I will do a muti-dimensional data Analysis(Metric Data) called PCA(Principle Component Analysis). To hear this, it may be difficult and in reality it is just a covariance matrix plotted on a chart. What is PCA?
In simple words: You are taking a photo and it is your objective to take all the people in the photo and with wide angle so that all the people in the photo are visible properly. We normally want background to be covered, standing angle, dress, shoes etc.. Similarly we do the same kind of technique using PCA. It is all about seeing data in a wide angle and understanding the relationship and similarities in the data. Where PCA can be applied?
When you have too many variables/Comments and you do not know who is the leader in a segment/Product/Brand and what the areas of improvements are. We can apply PCA and find it.Specifically, it can be appied to Product Rating, Brand Rating, position of customer segment, rating of political candidate, evaluvation of Advt etc.
Even a Linear regression Model will fail when you have too many features and parameters in a data set. In that point in time we try to group them based on the similarities and try to come with reduced set of variables. Which also help identifying the similarities in the data and its relationship?
There is one thing we have to understand, PCA is not the only technique for such thing, even Exploratory Factor Analysis and Multi Dimension Scaling is also applied. Which, I will be covering in further article. One special thing about PCA is that it see for Uncorrelated Linear Dimension and group them into similar groups with maximum component. When plotted the lines are scattered with different angle.
## 'data.frame': 1000 obs. of 10 variables:
## $ perform: int 2 1 2 1 1 2 1 2 2 3 ...
## $ leader : int 4 1 3 6 1 8 1 1 1 1 ...
## $ latest : int 8 4 5 10 5 9 5 7 8 9 ...
## $ fun : int 8 7 9 8 8 5 7 5 10 8 ...
## $ serious: int 2 1 2 3 1 3 1 2 1 1 ...
## $ bargain: int 9 1 9 4 9 8 5 8 7 3 ...
## $ value : int 7 1 5 5 9 7 1 7 7 3 ...
## $ trendy : int 4 2 1 2 1 1 1 7 5 4 ...
## $ rebuy : int 6 2 6 1 1 2 1 1 1 1 ...
## $ brand : Factor w/ 10 levels "a","b","c","d",..: 1 1 1 1 1 1 1 1 1 1 ...
If you see this data clearly there are various features of the brand and the output is the performance. There are 1000 observation and 9 features which are influencing the brand. All are integers and no factors in the data apart brand.
Now we will have a look of summary for outliers, missing values and abnormality.
## perform leader latest fun
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 4.000 1st Qu.: 4.000
## Median : 4.000 Median : 4.000 Median : 7.000 Median : 6.000
## Mean : 4.488 Mean : 4.417 Mean : 6.195 Mean : 6.068
## 3rd Qu.: 7.000 3rd Qu.: 6.000 3rd Qu.: 9.000 3rd Qu.: 8.000
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
##
## serious bargain value trendy
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.00
## 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.: 3.00
## Median : 4.000 Median : 4.000 Median : 4.000 Median : 5.00
## Mean : 4.323 Mean : 4.259 Mean : 4.337 Mean : 5.22
## 3rd Qu.: 6.000 3rd Qu.: 6.000 3rd Qu.: 6.000 3rd Qu.: 7.00
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.00
##
## rebuy brand
## Min. : 1.000 a :100
## 1st Qu.: 1.000 b :100
## Median : 3.000 c :100
## Mean : 3.727 d :100
## 3rd Qu.: 5.000 e :100
## Max. :10.000 f :100
## (Other):400
The data looks fine free from outliers and the mean and medians are close to each other.
Now we will do scaling on the data, because comparison will be difficult with one value with high number and other with low number. It helps data to come on common scale. This data do not require scaling as everything is common, doing scaling is a standard technique used.So let’s do scaling on the data.
## perform leader latest fun serious bargain
## [1,] -0.7766617 -0.1598662 0.5864084 0.7040175 -0.8361532 1.77763492
## [2,] -1.0888247 -1.3099826 -0.7131117 0.3396192 -1.1960986 -1.22195997
## [3,] -0.7766617 -0.5432383 -0.3882316 1.0684159 -0.8361532 1.77763492
## [4,] -1.0888247 0.6068781 1.2361685 0.7040175 -0.4762078 -0.09711188
## [5,] -1.0888247 -1.3099826 -0.3882316 0.7040175 -1.1960986 1.77763492
## [6,] -0.7766617 1.3736224 0.9112885 -0.3891774 -0.4762078 1.40268556
## value trendy rebuy
## [1,] 1.1102404 -0.4449143 0.8932671
## [2,] -1.3912400 -1.1742820 -0.6786944
## [3,] 0.2764136 -1.5389658 0.8932671
## [4,] 0.2764136 -1.1742820 -1.0716848
## [5,] 1.9440672 -1.5389658 -1.0716848
## [6,] 1.1102404 -1.5389658 -0.6786944
Here I have removed brand which is the factor and it is something on which we can’t do scaling
We can see similarity and dissimilarity using correlation plot
## Warning: package 'corrplot' was built under R version 3.4.1
Now we will apply PCA Algorithms on Sclaed data
pca.algo<-prcomp(brandscale)
summary(pca.algo)
## Importance of components%s:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.726 1.4479 1.0389 0.8528 0.79846 0.73133 0.62458
## Proportion of Variance 0.331 0.2329 0.1199 0.0808 0.07084 0.05943 0.04334
## Cumulative Proportion 0.331 0.5640 0.6839 0.7647 0.83554 0.89497 0.93831
## PC8 PC9
## Standard deviation 0.55861 0.49310
## Proportion of Variance 0.03467 0.02702
## Cumulative Proportion 0.97298 1.00000
Now we got the PCA summary and its result. If required we can also find the relation
pca.algo
## Standard deviations (1, .., p=9):
## [1] 1.7260636 1.4479474 1.0388719 0.8527667 0.7984647 0.7313298 0.6245834
## [8] 0.5586112 0.4930993
##
## Rotation (n x k) = (9 x 9):
## PC1 PC2 PC3 PC4 PC5
## perform 0.2374679 0.41991179 0.03854006 -0.52630873 0.46793435
## leader 0.2058257 0.52381901 -0.09512739 -0.08923461 -0.29452974
## latest -0.3703806 0.20145317 -0.53273054 0.21410754 0.10586676
## fun -0.2510601 -0.25037973 -0.41781346 -0.75063952 -0.33149429
## serious 0.1597402 0.51047254 -0.04067075 0.09893394 -0.55515540
## bargain 0.3991731 -0.21849698 -0.48989756 0.16734345 -0.01257429
## value 0.4474562 -0.18980822 -0.36924507 0.15118500 -0.06327757
## trendy -0.3510292 0.31849032 -0.37090530 0.16764432 0.36649697
## rebuy 0.4390184 0.01509832 -0.12461593 -0.13031231 0.35568769
## PC6 PC7 PC8 PC9
## perform -0.3370676 0.364179109 -0.14444718 0.05223384
## leader -0.2968860 -0.613674301 0.28766118 -0.17889453
## latest -0.1742059 -0.185480310 -0.64290436 0.05750244
## fun 0.1405367 -0.007114761 0.07461259 0.03153306
## serious 0.3924874 0.445302862 -0.18354764 0.09072231
## bargain -0.1393966 0.288264900 0.05789194 -0.64720849
## value -0.2195327 0.017163011 0.14829295 0.72806108
## trendy 0.2658186 0.153572108 0.61450289 0.05907022
## rebuy 0.6751400 -0.388656160 -0.20210688 -0.01720236
We can see the relation which have +ve and -ve relation as per the Principle Component Group.
Now we can go ahead for a screen plot
on X axis we have PC groups and on Y we have variance. We look for a kink and we can see we have a kink at 4. so we will go ahead plotting our PCA Plot
biplot(pca.algo)
Now we can clearly see 9 features are distributed in different direction and with greater noise. So to reduce background noise we can take mean data we get the same output with less background noise or change colour to see concentration.