Web Scraping is a technique of extracting/scraping information from websites. This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data.
We will use python library named
BeautifulSoup for this purpose. Do not worry about it right now, we will have program examples in the next tutorial. Our web scraper program will use this library to parse the website's HTML and extract the data. We will also use the
Requests Library to open the URL, download the HTML and pass it to
NOTE: Many websites do not allow Web Scraping, and it might get you in legal troubles. Hence, we advice you to use this only for learning purposes and not to steal or copy data from websites.
To install the required python modules, follow the instructions below:
$ sudo pip install BeautifulSoup $ sudo pip install Requests
$ pip install BeautifulSoup $ pip install Requests
If you have ever visited a website and looked at the source code(Right Click → View Page Source) you must have seen lots of crappy or non understandable information there. Well, unless we get something understandable or well structured, it is of no use. So to scrap data from a website(say we want to get prices for all the products on a particular page of an e-commerce website), first of all we need to uniquely identify the HTML tags that hold the data on the website. The question is how?
So, if you know HTML basics(click on the link, to learn HTML using our Interactive Course), you must be knowing about HTML tags and attributes. Well, this is the trick, we use HTML tags or attributes or both to uniquely identify any data on a website. Let's see an example.
To uniquely identify the price tag from the website:
Here you can see that price 449.00 can be identified uniquely by:
<span id="priceblock_saleprice" class="a-size-medium a-color-price">
Now, let's say we want to get data and want to compare/store it with data gathered from some other websites. Here is where scraping comes into play. It can be used for mining data from multiple websites. This technology is being used vigrously now-a-days. Many websites like Trivago, which only compares price for the same product from different platforms uses same technology.