WEB SCRAPING USING BEAUTIFUL SOUP

Introduction to requests Module

requests module is used to send HTTP request to any server and receive the HTTP response back. So we will use the requests module to send HTTP request using website's URL and get back in return the response, for which we will be using Beautiful Soup module to take out the useful data/information/content of the website from the response.

So let's learn how to send HTTP request and receive the response from the server using requests module.


Some Useful request Module Methods

Following are some of the commonly used methods available in the requests module for making HTTP requests.

  1. requests.get()
  2. requests.post()
  3. requests.put()
  4. requests.delete()
  5. requests.head()
  6. requests.options()

In this tutorial, we will be using requests.get() and requests.post() methods to make HTTP requests for web scraping.

If you are new to HTTP requests and wondering what GET and POST requests are, here is a simple explanation:

  1. GET: It is used to retrieve information(webpage) using a URL.
  2. POST: It is used to send information to a URL.

Making Request using requests.get

requests.get(URL) method is used to send HTTP request and receive the data back as response. It takes a URL for a website or any API.

response.content is a variable in which the response content is stored which is the response of the get() method.

Let's take an example,

## import requests module
import requests

## send a request and receive the information from https://www.google.com
response = requests.get("https://www.google.com")

## printing the response
print(response.content)

The output for the above script would be the entire page source(or, source code) for the specified URL, which we have stored in the following file(as it's too long).

You must be wondering how we can read anything from this as it's too complicated. Well to make the response content readable we use Beautiful Soup module, which we will cover in the coming tutorials.

We can print the header information sent by the website in the response using the response.headers method.

For newbies, header information contains general meta information about the HTTP connection along with some connection properties.

Let's print headers for the above get request:

## import requests module
import requests

## send a request and receive the information from https://www.google.com
response = requests.get("https://www.google.com")

## headers of the website
print(response.headers)

{'Date': 'Wed, 07 Nov 2018 08:56:29 GMT', 'Expires': '-1', 'Cache-Control': 'private, max-age=0', 'Content-Type': 'text/html; charset=ISO-8859-1', 'P3P': 'CP="This is not a P3P policy! See g.co/p3phelp for more info."', 'Content-Encoding': 'gzip', 'Server': 'gws', 'X-XSS-Protection': '1; mode=block', 'X-Frame-Options': 'SAMEORIGIN', 'Set-Cookie': '1P_JAR=2018-11-07-08; expires=Fri, 07-Dec-2018 08:56:29 GMT; path=/; domain=.google.com, NID=144=cPBAw4RAx5TZoBZ3WtDNN54qgUt198oVTvdyWYx0iFIPo-MLX_qcQ8DjZXQNkO7WqRD4KOGnXShYh9TFmmZKtOZ0OoNBu-9Nlw50ocpoGMxvt9SNRZgXPUJgMv0D5A7URfeSV0BLihLp24UPNWhOQjMO5sbZNndc0Dvd3DHVR5s; expires=Thu, 09-May-2019 08:56:29 GMT; path=/; domain=.google.com; HttpOnly', 'Alt-Svc': 'quic=":443"; ma=2592000; v="44,43,39,35"', 'Transfer-Encoding': 'chunked'}

To print the values in a more readable format we can access each value separately using the method response.headers.items() and the use a for loop to print each value.

## import requests module
import requests

## send a request and receive the information from https://www.google.com
response = requests.get("https://www.google.com")

## headers of the website
for key, value in response.headers.items():
    print(key, '\t\t', value)

Date Wed, 07 Nov 2018 08:56:29 GMT Expires -1 Cache-Control private, max-age=0 Content-Type text/html; charset=ISO-8859-1 P3P CP="This is not a P3P policy! See g.co/p3phelp for more info." Content-Encoding gzip Server gws X-XSS-Protection 1; mode=block X-Frame-Options SAMEORIGIN Set-Cookie 1P_JAR=2018-11-07-08; expires=Fri, 07-Dec-2018 08:56:29 GMT; path=/; domain=.google.com, NID=144=cPBAw4RAx5TZoBZ3WtDNN54qgUt198oVTvdyWYx0iFIPo-MLX_qcQ8DjZXQNkO7WqRD4KOGnXShYh9TFmmZKtOZ0OoNBu-9Nlw50ocpoGMxvt9SNRZgXPUJgMv0D5A7URfeSV0BLihLp24UPNWhOQjMO5sbZNndc0Dvd3DHVR5s; expires=Thu, 09-May-2019 08:56:29 GMT; path=/; domain=.google.com; HttpOnly Alt-Svc quic=":443"; ma=2592000; v="44,43,39,35" Transfer-Encoding chunked


Status of Request

When we make a GET request using requests.get() method, the request might fail, get re-directed to some other URL, can fail at client side or server side or it may successfully complete.

To know the status of the request, we can check the status code of the response received.

This can be done using the response.status_code value. It's very simple,

## import requests module
import requests

## send a request and receive the information from https://www.google.com
response = requests.get("https://www.google.com")

## status of request
print(response.status_code)

200

Following are the different status values that you may get in the response:

Status CodeDescription
1XXInformational
2XXSuccess
3XXRedirection
4XXClient Error
5XXServer Error

For example: 200 status code is for success. Whereas, 201 status code is for created(when we send a request to create some resource) etc.

Just like GET request, we can make the POST request using the requests.post(URL) method and handling the response is same.

For web scraping we will mostly use GET request.


Setup User Agent

When we try to access websites using a program, some websites doesn't allow it for security reasons, as that makes a website susceptible to unnecessary request generated via programs which in extreme case can even burden the website server by sending a large number of requests.

To overcome this, we will use Fake Useragent module, which is used to fake a request to a server as if it is initiated by a user and not a program.

To install the module fake_useragent, run the following command:

pip install fake_useragent

Once it is installed, we can use it to generate a fake user request like this:

## import UserAgent from the fake_useragent module
from fake_useragent import UserAgent

## create an instance of the 'UserAgent' class
obj = UserAgent()

## create a dictionary with key 'user-agent' and value 'obj.chrome'
header = {'user-agent': obj.chrome}

## send request by passing 'header' to the 'headers' parameter in 'get' method
r = requests.get('https://google.com', headers=header)

print(r.content)

The output for this request will be the source code of the webpage https://google.com as if it was opened by a user using the Chrome browser.

So now we know how to send a HTTP request to any URL and receive the response using the requests module. In the next tutorial we will learn how to get real useful content from the HTTP response using the Beautiful Soup module.