Monday, 4 December 2017

A Simple Approach for Data Wrangling and Data Exploration of Sales Value


A Simple Approach for Data Wrangling and Data Exploration of Sales Value

Introduction

Here I will be addressing three thing; which is an Exploratory Data Analysis. Wherein I will be adding the areas as follows:

Performing Data Wrangling and Exploratory Analysis of given Data.

Plotting a histogram of the Data.

Commenting on the Distribution of Data.

Now let me load the dataset and required package into R

library(psych)

After loading the data lets look at the structure of the data

str(data1)
## 'data.frame':    1000 obs. of  1 variable:
##  $ values: num  -1.41 8.91 8.05 8.71 12.84 ...

When we look at the structure of the data we can clearly see that it is a data frame and now there is a necessary to relook the data again and check the data is numeric or not?

is.numeric(data1)
## [1] FALSE

Now the data is saying it is not a numeric so we have to covert to numeric

data4<-as.numeric(unlist(data1))

Now we have converted the data to numeric. Now we will check for the missing values

sum(is.na(data1$values))
## [1] 0

It have no missing values

summary(data4)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -6.461   5.373   8.172   8.046  10.896  20.504

After looking at the summary, the main observation to look is the difference between mean and median value. If the mean and median is too far then there is a possibility there is a outlier. Here in this data we are not seeing any outliers.

There are various techniques to find the outliers. It might me boxplot, frequency table, histogram etc. Here we will see the same using histogram. Lets plot the histogram.Before which we will plot our value and see how it is trending

plot(data4,type = "l",main = "Values distribution", ylab = "Values", xlab = "Observation")

As per the plot the data is moving sidways and with great amount of noise and variation

Now lets plot histogram

hist(data4,breaks = 10,col = "green", main = "Histogram of Values", xlab = "Values")

The histogram Is not perfectly bell shape and have tilted to negative side i.e know as negatively skewed data.

Now let’s cross check this using psych package

describe(data4)
##    vars    n mean   sd median trimmed  mad   min  max range  skew kurtosis
## X1    1 1000 8.05 4.09   8.17    8.11 4.11 -6.46 20.5 26.97 -0.15     0.04
##      se
## X1 0.13

You can see the data is negatively skewed i.e -.15 and kurtosis show the presence of negligible outlier i.e 0.04

No comments:

Post a Comment