Exploratory Data Analysis (EDA): Marketing Data
Sangamesh K S
November 6, 2017
Introduction
Today we will do an Exploratory Data Analysis using R. The data set we are working with is a market research data which consist of 300 observation and is been classified into segments. We will use this segmented data and carry our further explore the data.Before we get into Data Analysis it is necessary to know
What is Exploratory Data Analysis?
In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task
EDA stands to be an effective analysis; usually carried out to understand the data and applied prior few Machine Learning Algorithms.
It is normally thought to “data analyst beginners”; as it does not require hardcore statistical knowledge. Even the outputs are so simple that the even a non-statistician can easily understand.
The main objective of this analysis is to explore and identify facts and figures which normally can’t be analyzed just looking at the data.
Interestingly you do not require any high-tech to carry this analysis even an Excel can be used to do similar task.
Before going further we will look at the structure of the data
str(mydata)
## 'data.frame': 300 obs. of 7 variables:
## $ age : num 47.3 31.4 43.2 37.3 41 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 2 2 2 1 1 ...
## $ income : num 49483 35546 44169 81042 79353 ...
## $ kids : int 2 1 0 1 3 4 3 0 1 0 ...
## $ ownHome : Factor w/ 2 levels "ownNo","ownYes": 1 2 2 1 2 2 1 1 1 2 ...
## $ subscribe: Factor w/ 2 levels "subNo","subYes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Segment : Factor w/ 4 levels "Moving up","Suburb mix",..: 2 2 2 2 2 2 2 2 2 2 ...
head(mydata);tail(mydata)
## age gender income kids ownHome subscribe Segment
## 1 47.31613 Male 49482.81 2 ownNo subNo Suburb mix
## 2 31.38684 Male 35546.29 1 ownYes subNo Suburb mix
## 3 43.20034 Male 44169.19 0 ownYes subNo Suburb mix
## 4 37.31700 Female 81041.99 1 ownNo subNo Suburb mix
## 5 40.95439 Female 79353.01 3 ownYes subNo Suburb mix
## 6 43.03387 Male 58143.36 4 ownYes subNo Suburb mix
## age gender income kids ownHome subscribe Segment
## 295 36.14964 Male 40522.39 0 ownYes subNo Moving up
## 296 32.95227 Female 43882.43 0 ownYes subNo Moving up
## 297 40.96255 Female 64197.09 2 ownNo subNo Moving up
## 298 38.22980 Male 47580.93 0 ownNo subYes Moving up
## 299 33.17036 Male 60747.34 1 ownNo subNo Moving up
## 300 34.38388 Male 53674.93 5 ownYes subNo Moving up
The structure consist of 300 Observation one output and 6 feature i.e.(6+1). Where in segments is the output and Age, Gender, Income, Kids, Own House and Subscribers are the features. In this analysis we will look how Segment is related with all the features and other analysis.
We will look at the 1st feature Age and lets check out of 300 observation how the samples are been collected by using a Histogram.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Now lets find the relation between Sex, Segment and the frequency using Dplyr Package
## Source: local data frame [8 x 4]
## Groups: gender [2]
##
## # A tibble: 8 x 4
## gender Segment count rfreq
## <fctr> <fctr> <int> <dbl>
## 1 Female Moving up 49 31.21019
## 2 Female Suburb mix 48 30.57325
## 3 Female Travelers 40 25.47771
## 4 Female Urban hip 20 12.73885
## 5 Male Moving up 21 14.68531
## 6 Male Suburb mix 52 36.36364
## 7 Male Travelers 40 27.97203
## 8 Male Urban hip 30 20.97902
Now we are able to find the gender distribution with respect to the segment; segments consist females greater than male as per the observation.
Now we will go ahead and find which sex subscribe the product/service
##
## Female Male
## subNo 136 124
## subYes 21 19
As per the table; we are able to conclude majority of the subscribers are females and male’s proportion is not negligible after looking the data. By looking the data we can also conclude the product/service is a kind of unisex; by looking at both proportion.
As we have looked the relation of subscribers and sex. We will look the relation with the income. We will start with plotting a histogram of Income
As per the histogram we are able to see the data is showing no skew and majority of the data is concentrated between 20,000 to 80000, with negligible outliers
As we got a understanding of income we will look at the relation between income and sex
As per the bar plot we are able to see female income is on higer end.Now lets see this sapartlely.
## Male Female Total
## [1,] 6981973 8298988 15280961
Now we can see that female income is little on higer end than male.
m<-m/Total*100
f<-f/Total*100
s<-cbind(m,f)
colnames(s)<-c("Male","Female")
s
## Male Female
## [1,] 45.69066 54.30934
Now we will go ahead and find the relation between segment and income
## Segment Income
## 1 Moving up 3716368
## 2 Suburb mix 5503382
## 3 Travelers 4977115
## 4 Urban hip 1084096
## [1] 15280961
## Segment Income Percent
## 1 Moving up 3716368 24.320251
## 2 Suburb mix 5503382 36.014633
## 3 Travelers 4977115 32.570694
## 4 Urban hip 1084096 7.094423
As per the income analysis we are able to see the Travelers and Suburb Mix are having majority of income. Suburb the highest and urban hip the lowest.
Now we will look how many own a home as per our segment
##
## ownNo ownYes
## Moving up 47 23
## Suburb mix 52 48
## Travelers 20 60
## Urban hip 40 10
Interestingly Travelers own a home and urban hip the least won a home.
Now we will look, how many of the segments have kids
## Segment Kids Percent
## 1 Moving up 134 35.1706
## 2 Suburb mix 192 50.3937
## 3 Travelers 0 0.0000
## 4 Urban hip 55 14.4357
We are able to see Traveling people do not have kids and Suburb Mix have the highest amount of kids
Now we will look at Market Cap with respect to the segment. Based on subscribers
Now we are able to see we are able to see that data in tabular format
##
## Moving up Suburb mix Travelers Urban hip
## subNo 56 94 70 40
## subYes 14 6 10 10
We have tapped Moving up Urban Hip and travels Segment. Later understanding this data we can do proper STP(segmenting Trageting and Position) of the product. We are able to see the 2 segments are potential and we can go ahead and taget out segment.