WEB SCRAPING USING BEAUTIFUL SOUP

Exploring BeautifulSoup Methods

In this tutorial we will learn various different ways to access HTML tags using different methods of the BeautifulSoup module. For a basic introduction to the BeautifulSoup module, start from the previous tutorial.


BeautifulSoup: Accessing HTML Tags

The methods that we will cover in this section are used to traverse through different HTML tags considering HTML code as a tree.

Create a file sample_webpage.html and copy the following HTML code in it:

<!DOCTYPE html>
<html>
    
    <head>
        <title> Sample HTML Page</title>
        <style>
            * {
                margin: 0;
                padding: 0;
            }

            div {
                width: 95%;
                height: 75px;
                margin: 10px 2.5%;
                border: 1px dotted grey;
                text-align: center;
            }
              
            p {
                font-family: sans-serif;
                font-size: 18px;
                color: #000;
                line-height: 75px;
            }

            a {
                position: relative;
                top: 25px;
            }
        </style>
    </head>
    
    <body>
        <div id="first-div">
            <p class="first">First Paragraph</p>
        </div>

        <div id="second-div">
            <p class="second">Second Paragraph</p>
        </div>

        <div id="third-div">
            <a href="https://www.studytonight.com">Studytonight</a>
            <p class="third">Third Paragraph</p>        
        </div>

        <div id="fourth-div">
            <p class="fourth">Fourth Paragraph</p>        
        </div>

        <div id="fifth-div">
            <p class="fifth">Fifth Paragraph</p>        
        </div>
    </body>
</html>

Now to read the content of the above HTML file, use the following python code to store the content into a variable:

## reading content from the file
with open("sample_webpage.html") as html_file:
    html = html_file.read()

Now we will use different methods of the BeautifulSoup module and see how they work.

For warmup, let's start with using the prettify method.

import bs4

## reading content from the file
with open("sample_webpage.html") as html_file:
    html = html_file.read()
    
## creating a BeautifulSoup object
soup = bs4.BeautifulSoup(html, "html.parser")

print(soup.prettify)

<!DOCTYPE html> <html> <head> <title> Sample HTML Page</title> <style> * { margin: 0; padding: 0; } div { width: 95%; height: 75px; margin: 10px 2.5%; border: 1px dotted grey; text-align: center; } p { font-family: sans-serif; font-size: 18px; color: #000; line-height: 75px; } a { position: relative; top: 25px; } </style> </head> <body> <div id="first-div"> <p class="first">First Paragraph</p> </div> <div id="second-div"> <p class="second">Second Paragraph</p> </div> <div id="third-div"> <a href="https://www.studytonight.com">Studytonight</a> <p class="third">Third Paragraph</p> </div> <div id="fourth-div"> <p class="fourth">Fourth Paragraph</p> </div> <div id="fifth-div"> <p class="fifth">Fifth Paragraph</p> </div> </body> </html>


BeautifulSoup: Accessing HTML Tag Attributes

We can retrieve the attributes of any HTML tag using the following syntax:

TagName["AttributeName"]

Let's extract the href attribute from the anchor tag in our HTML code.

import bs4

## reading content from the file
with open("sample_webpage.html") as html_file:
    html = html_file.read()
    
## creating a BeautifulSoup object
soup = bs4.BeautifulSoup(html, "html.parser")

## getting anchor tag
link = soup.a

## printing the 'href' attribute of anchor tag
print(link["href"])

https://www.studytonight.com


BeautifulSoup: contents method

contents method is used to list out all the tags that are present in the parent tag. Let's list all the children HTML tags of the body tag using the contents method.

body = soup.body

## getting all the children of 'body' using 'contents'
content_list = body.contents

## printing all the children using for loop
for tag in content_list:
    if tag != "\n":
        print(tag)
        print("\n")

<div id="first-div"> <p class="first">First Paragraph</p> </div> <div id="second-div"> <p class="second">Second Paragraph</p> </div> <div id="third-div"> <a href="https://www.studytonight.com">Studytonight</a> <p class="third">Third Paragraph</p> </div> <div id="fourth-div"> <p class="fourth">Fourth Paragraph</p> </div> <div id="fifth-div"> <p class="fifth">Fifth Paragraph</p> </div>


BeautifulSoup: children method

children method is similar to the contents method, but children method returns an iterator while the contents method returns a list of all the children. Let's see an example:

body = soup.body

## we can also convert iterator into list using the 'list(iterator)'
for tag in body.children:
    if tag != "\n":
        print(tag)
        print("\n")

<div id="first-div"> <p class="first">First Paragraph</p> </div> <div id="second-div"> <p class="second">Second Paragraph</p> </div> <div id="third-div"> <a href="https://www.studytonight.com">Studytonight</a> <p class="third">Third Paragraph</p> </div> <div id="fourth-div"> <p class="fourth">Fourth Paragraph</p> </div> <div id="fifth-div"> <p class="fifth">Fifth Paragraph</p> </div>


BeautifulSoup: descendants method

descendants method helps to retrieve all the child tags of a parent tag. You must be wondering that is what the two methods above also did. Well this method is different from contents and children method as this method extracts all the child tags and content up until the end. In simple words if we use it to extract the body tag then it will print the first div tag, then it will print the child of the div tag and then their child until it reaches the end, then it will move on to the next div tag and so on.

This method returns a generator. Let's see an example:

body = soup.body

## getting child tags of 'body' tag using 'descendants' method
for tag in body.descendants:
    if tag != "\n":
        print(tag)
        print("\n")

<div id="first-div"> <p class="first">First Paragraph</p> </div> <p class="first">First Paragraph</p> First Paragraph <div id="second-div"> <p class="second">Second Paragraph</p> </div> <p class="second">Second Paragraph</p> Second Paragraph <div id="third-div"> <a href="https://www.studytonight.com">Studytonight</a> <p class="third">Third Paragraph</p> </div> <a href="https://www.studytonight.com">Studytonight</a> Studytonight <p class="third">Third Paragraph</p> Third Paragraph <div id="fourth-div"> <p class="fourth">Fourth Paragraph</p> </div> <p class="fourth">Fourth Paragraph</p> Fourth Paragraph <div id="fifth-div"> <p class="fifth">Fifth Paragraph</p> </div> <p class="fifth">Fifth Paragraph</p> Fifth Paragraph

As you can see in the output above the descendants method keeps entering inside the tag it reads until it reaches the end, and then it moves onto the next HTML tag.


BeautifulSoup: parent method

parent method is used to get the parent tag of a child tag. Let's see an example:

body = soup.body

## getting parent of 'body'
body_parent = body.parent

## you have to use 'name' method to print the name of the tag
## printing the name of the parent using 'name' method
print(body_parent.name)

html


BeautifulSoup: parents method

parent method is used to get all the parent tags of a child tag. It returns a generator. Let's see an example:

body = soup.body

## getting parents of 'body'
body_parents = body.parents

## if the child has more than one parent it will print all parent names
for parent in body_parents:
    print(parent.name)
    print("\n")

html [document]


BeautifulSoup: next_sibling method

next_sibling method is used to get the next tag of the specified tag from the same parent. Now let's print the sibling tag of the anchor tag in out HTML code:

anchor_tag = soup.a

print(anchor_tag)

## getting third paragraph using anchor tag
## here we have written 'next_sibling' two times 
## means there is a line break in between them
## anchor_tag.next_sibling gives a line break
## next to line break is the third paragraph
third_para = anchor_tag.next_sibling.next_sibling

print(third_para)

<a href="https://www.studytonight.com">Studytonight</a> <p class="third">Third Paragraph</p>


BeautifulSoup: previous_sibling method

previous_sibling method is similar to the next_sibling method. It returns the previous tag instead of the next tag. Let's see an example(this is in continuation to the above code snippet):

## getting anchor tag from the third_para
print(third_para.previous_sibling.previous_sibling)

<a href="https://www.studytonight.com">Studytonight</a>


BeautifulSoup: next_siblings method

next_siblings returns a generator with all available next tags. Let's see an example(this is in continuation to the above code snippet):

## using anchor_tag variable here
a_siblings = anchor_tag.next_siblings

print(list(a_siblings))

['\n', <p class="third">Third Paragraph</p>, '\n']


BeautifulSoup: previous_siblings method

previous_siblings returns a generator with all available previous tags. Let's see an example(this is in continuation to the above code snippet):

## using third_para variable here
p_siblings = third_para.previous_siblings

print(list(p_siblings))

['\n', lt;a href="https://www.studytonight.com">Studytonight</a>, '\n']


Now you are familiar with most of the methods that are used in web scraping. In the following tutorial, we will learn how to find a specific tag from a bunch of similar tags.