Thursday 17 August 2017

Text Analysis: Easy way to Web Scrapping (Part 1)

In this article, we will do the basic 1st step of Data Scientist is to import the data from SQL, Excel, Apache Spark etc. But when you have a data in a website you tend to do something called web scraping/web crawler/spider etc..

What is web scrapping?

Web scraping is a technique of extracting desired information from the website to the desired database.

In this article, we will try to scrap an article from our own blog to analyse and see the output using R Programming.

Requirements for web scrapping

To do web scraping we require

  1. rvest package
  2.  HTML XPath/ CSS Selector. 

If you don't have rvest package in R. No need to worry install the package using the code below.

 install.packages("rvest")

If you don't know how to use HTML XPath..CSS can be a great help.
To know more about the installation process Visit Here

Process of Web Scrapping 


article<- html("desired URL")

article<-article %>%
html_node("Xpath/CSS") %>% html_text()

We are creating a vector called "article" which consists an HTML URL. Later, you are specifying using %>% (pipe) that the article consists an HTML node and it is in the text format.

Example:
article<- html("https://experimentswithdatascience.blogspot.in/")

article<-article %>%
html_node(".column-center-inner") %>% html_text()


Output

> article
[1] "\n\n\n\n          \n        \nMonday, 14 August 2017\n\n          \n        \n\n\n\n\nAbout the writter\n\n\n\n\n\n\nAbout Sangamesh K S\n\n\nSangamesh is an MBA Finance with a non-technical background. He specializes in Technical Analysis, Statistics, Machine Learning and Data Science .\n\n\n\nPreviously, Sangamesh had applied his skills on Financial Markets to forecast price of Indian stocks and commodities. He is also a Technical Analysis Blogger who write articles for Experiments With Stocks.  \n\n\nSangamesh use R for Data Analysis. R is as potential tool as its competitor Python . As I know both language, R is more elegant and do contain specific libraries and more inclined to statistics. Python in another hand is easy and flexible.\nHere in this blog you will learn everything about data science form intermediate to advance level.\n\n\n\n\n\n\n\n\nPosted by\n\n\nSangamesh KS\n\n\n\n\nat\n04:25\n\n\n\n\n\nNo comments:\n    \n\n\n\n\n\n\n\n\n\n\nEmail ThisBlogThis!Share to Twi... <truncated>

Note: The data is not clear. so we require data mining to sanitize the article from unwanted words.

No comments:

Post a Comment