Signup/Sign In

Find HTML Tags using BeautifulSoup

In this tutorial we will learn about searching any tag using BeautifulSoup module. We suggest you to go through the previous tutorials about the basic introduction to the BeautifulSoup module and the tutorial covering all the useful methods of the BeautifulSoup module.

We have already learned different methods to traverse the HTML tree like parent, parents, next_sibling, previous_sibling etc. But it becomes difficult to find all the similar tags using those methods. So, now we will learn how to find any pariculat HTML tag using teh find and find_all method of the BeautifulSoup module.

If you are coming from the last tutorial, we will be using the same HTML code, if you are new here, please create a file sample_webpage.html and copy the following HTML code in it:

<!DOCTYPE html>
<html>
    
    <head>
        <title> Sample HTML Page</title>
        <style>
            * {
                margin: 0;
                padding: 0;
            }

            div {
                width: 95%;
                height: 75px;
                margin: 10px 2.5%;
                border: 1px dotted grey;
                text-align: center;
            }
              
            p {
                font-family: sans-serif;
                font-size: 18px;
                color: #000;
                line-height: 75px;
            }

            a {
                position: relative;
                top: 25px;
            }
        </style>
    </head>
    
    <body>
        <div id="first-div">
            <p class="first">First Paragraph</p>
        </div>

        <div id="second-div">
            <p class="second">Second Paragraph</p>
        </div>

        <div id="third-div">
            <a href="https://www.studytonight.com">Studytonight</a>
            <p class="third">Third Paragraph</p>        
        </div>

        <div id="fourth-div">
            <p class="fourth">Fourth Paragraph</p>        
        </div>

        <div id="fifth-div">
            <p class="fifth">Fifth Paragraph</p>        
        </div>
    </body>
</html>

To read the content of the above HTML file, use the following python code to store the content into a variable:

## reading content from the file
with open("sample_webpage.html") as html_file:
    html = html_file.read()

Once we have read the file, we create the BeautifulSoup object:

import bs4

## reading content from the file
with open("sample_webpage.html") as html_file:
    html = html_file.read()
    
## creating a BeautifulSoup object
soup = bs4.BeautifulSoup(html, "html.parser")

And the process of web scraping begins...


BeautifulSoup: find_all method

find_all method is used to find all the similar tags that we are searching for by prviding the name of the tag as argument to the method. find_all method returns a list containing all the HTML elements that are found. Following is the syntax:

find_all(name, attrs, recursive, limit, **kwargs)

We will cover all the parameters of the find_all method one by one. Let's start with the name parameter.


find_all: name Parameter

Let's find all the p tags from the HTML code:

import bs4

## reading content from the file
with open("sample_webpage.html") as html_file:
    html = html_file.read()
    
## creating a BeautifulSoup object
soup = bs4.BeautifulSoup(html, "html.parser")

## finding all p tags
p_tags = soup.find_all("p")     

print(p_tags)

print("\n-----Class Names Of All Paragraphs-----\n")

for tag in p_tags:
    print(tag['class'][0])
    
print("\n-----Content Of All Paragraphs-----\n")

for tag in p_tags:
    print(tag.text)

[<p class="first">First Paragraph</p>, <p class="second">Second Paragraph</p>, <p class="third">Third Paragraph</p>, <p class="fourth">Fourth Paragraph</p>, <p class="fifth">Fifth Paragraph</p>] -----Class Names Of All Paragraphs----- first second third fourth fifth -----Content Of All Paragraphs----- First Paragraph Second Paragraph Third Paragraph Fourth Paragraph Fifth Paragraph

As you can see, not only we can find the tags, but we can also find all the information related to those tags.


find_all: attribute Parameter

Let's find all the tags from the HTML code who have the attribute class equals to link(this code is after we have created the soup object in the above code snippet):

## finding using class name
link_class_tags = soup.find_all(class_="link")

print(link_class_tags)

print("----------")

for tag in link_class_tags:
    print(tag.name)

<a href="https://www.google.com">Google</a> ---------- a

Note the syntax for providing the class attribute with an underscore(_), you must follow that.


find_all: Tags containing any string

We can use find_all method to find all the HTML tags containing a given string. As the method find_all expects a regular expression to search, hence in the code example below we have used the re module of python for generating a regular expression.

## finding tags using a string

## importing regular expression module to find all the strings
import re

## defining an re variable which contains "Paragraph" text
s = re.compile("Paragraph")

## finding all the content of the tags which contains "Paragraph"
tags_containing_paragraph = soup.find_all(string=s)

print(tags_containing_paragraph)

['First Paragraph', 'Second Paragraph', 'Third Paragraph', 'Fourth Paragraph', 'Fifth Paragraph']

While writing the above code, keep the import re statement at the top along with import bs4 statement.


find_all: limit Parameter

The limit parameter is used to limit the resultset. When provided a limit the find_all method only returns the tags equal to the given limit, other qualifying tags are not included in the list returned.

## finding the first p tag using limit parameter
first_p_tag = soup.find_all("p", limit=1)

print(first_p_tag)

<p class="first">First Paragraph</p>

You can use multiple parameters together like we did in this example.


BeautifulSoup: find method

find method is used to find the first matching tag. It is similar to passing limit=1 parameter value to the find_all method.

Let's take an example:

p_tag = soup.find("p")

print(p_tag)
print("----------")
print(p_tag.text)

<p class="first">First Paragraph</p> ---------- First Paragraph


one more example,

a_tag = soup.find("a")

print(a_tag)
print("----------")
print(a_tag.text)
print("\n")
print(a_tag['href'])

<a href="https://www.studytonight.com">Studytonight</a> ---------- Studytonight https://www.studytonight.com


And with that we have learned web scraping using BeautifulSoup module. We have covered all the important and useful methods, but there are many more. If you want to dig in deep, check the BeautifulSoup documentation.

In the next tutorial we will scrape a website.