A Simple Approach for Data Wrangling and Data Exploration of Sales Value
Sangamesh K S
November 24, 2017
Introduction
Here I will be addressing three thing; which is an Exploratory Data Analysis. Wherein I will be adding the areas as follows:
Performing Data Wrangling and Exploratory Analysis of given Data.
Plotting a histogram of the Data.
Commenting on the Distribution of Data.
Now let me load the dataset and required package into R
library(psych)
After loading the data lets look at the structure of the data
str(data1)
## 'data.frame': 1000 obs. of 1 variable:
## $ values: num -1.41 8.91 8.05 8.71 12.84 ...
When we look at the structure of the data we can clearly see that it is a data frame and now there is a necessary to relook the data again and check the data is numeric or not?
is.numeric(data1)
## [1] FALSE
Now the data is saying it is not a numeric so we have to covert to numeric
data4<-as.numeric(unlist(data1))
Now we have converted the data to numeric. Now we will check for the missing values
sum(is.na(data1$values))
## [1] 0
It have no missing values
summary(data4)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6.461 5.373 8.172 8.046 10.896 20.504
After looking at the summary, the main observation to look is the difference between mean and median value. If the mean and median is too far then there is a possibility there is a outlier. Here in this data we are not seeing any outliers.
There are various techniques to find the outliers. It might me boxplot, frequency table, histogram etc. Here we will see the same using histogram. Lets plot the histogram.Before which we will plot our value and see how it is trending
plot(data4,type = "l",main = "Values distribution", ylab = "Values", xlab = "Observation")
As per the plot the data is moving sidways and with great amount of noise and variation
Now lets plot histogram
hist(data4,breaks = 10,col = "green", main = "Histogram of Values", xlab = "Values")
The histogram Is not perfectly bell shape and have tilted to negative side i.e know as negatively skewed data.
Now let’s cross check this using psych package
describe(data4)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1000 8.05 4.09 8.17 8.11 4.11 -6.46 20.5 26.97 -0.15 0.04
## se
## X1 0.13
You can see the data is negatively skewed i.e -.15 and kurtosis show the presence of negligible outlier i.e 0.04
No comments:
Post a Comment