Signup/Sign In

How to Scrape Content from a Website using BeautifulSoup (Python)

Posted in Programming   LAST UPDATED: SEPTEMBER 10, 2021

    The internet is a massive ocean of data. Most of this data is not well organized or in a readily available format like a downloadable CSV format dataset. If we would like to get particular data from any website, we might need to employ web scraping.

    Web scraping is the technique of using programming to grab the data we would like to work with from a website instead of using the manual copy and paste.

    How Web Scraping Works

    • We send a request to the server hosting the page we specified.

    • We download and filter for the HTML elements of the page we specified.

    • Finalyy extract the text/content from the HTML elements.

    As seen above, we only go for what we already specified. This specification can only be done through code.

    Ethical Web Scraping

    Some websites explicitly allow web-scraping while some do not. Some do not declare their stand on the same. It is good practice to consider this when scraping as it consumes server resources from the host website. In this case, the frequency at which we scrape a page has to be considerate.

    1. Using Python Requests Library

    Before we scrape a webpage we need to download it first. We download pages using the python requests library. First, we have to send a GET request to the web server to download the contents of the webpage we require. The downloaded content is usually in HTML format. HyperText Markup Language(HTML) is the language with which most websites are created. You can familiarise yourself with our HTML course.

    We can start by scraping a simple website:

    # Import the requests library
    import requests
    scrappedPage = requests.get('https://www.studytonight.com/python/')
    scrappedPage
    

    You should expect the following Output


    <Response [200]>

    The output is a response object which indicates through the status code property that the page was successfully downloaded. 200 is the status code for a successful download of a page.

    The response object has another property known as content which holds the scrapped page's HTML content.

    scrappedPage.content
    

    After the above line of code, you should expect a display of our scrapped page's HTML content. Since we are scraping the entire page we should expect some heavy output. In this case, we can save it to a file instead or directly access what we need from the website.

    2. Saving to a File

    We are now going to save the HTML contents of our scrapped page to a text file named (output.txt) within our program's directory.

    # We are going to initialize a file object
    
    """ This is accomplished in the following format
    fileObject = open('filename', 'accessmode')
    In our case, the access mode'w' means we are writing to the file """
    
    pageContent = open('output','w')
    
    for line in scrappedPage:
        pageContent.write(str(line)) # The scrappedPage content is in - 
                                     # - binary so it needs to be converted - 
                                     # - to string
    
    pageContent.close  # it is paramount to close the file because once it is open  -
                       # - for a particular access mode in our program it might -
                       # - not be available for other access modes outside our program 

    The output in our file output.txt is not in the most pleasing format and this is where Beautiful Soup comes in to spice our scrapping process.

    We use Beautiful Soup to parse the contents of an HTML document.

    3. Using BeautifulSoup

    """ we will import the library and create an instance 
        of the BeautifulSoup class to parse our document """
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(scrappedPage.content, 'html.parser')
    
    # We can print out the contents of our HTML document to a new file using BeautifulSoup's -
    # - prettify method and compare with our previous output
    
    pageContent2 = open('output2.txt','w',encoding="utf-8") # we include the encoding to avoid UnicodeEncodeError
    
    pageContent2.write(soup.prettify()) # prettify method formats the output
    pageContent2.close

    In our working directory, a new file output2.txt will be created. If we compare our initial output to this new one, it is clear which one is more legible and of greater resemblance to an HTML document. This is one of the subtle quirks that make BeautifulSoup interesting to work with.

    4. Directly accessing what we need

    we can use the find_all method to display all the instances of a specific HMTL tag on a page.

    This method, however, returns a list, we will need to employ list indexing or loop through it to display the text we need.

    # soup.find_all('p') # returns a list of elements in our page referenced by the HTML tag 'p' 
    """ Uncomment the above line to see the full list, it is a long list """
    
    soup.find_all('p')[0].get_text() # Gets the text from the first instance of the tag

    You should get the following output:


    'PROGRAMMING'

    Classes and ids are used by CSS to decide which HTML elements to implement specific styles to. We also use them to specify particular elements we require to scrape.

    scrapsfromtheloft.com is a website serving as a cultural multilingual magazine featuring movie reviews and essays, stand-up comedy transcripts, interviews, great literature, history, and many more. We are going to scrap stand-up comedy transcripts.

    We will need to explore(Inspect) the page structure using Chrome DevTools. Other Browsers have their equivalents but in our case, we'll use chrome. Ctrl + shift + I (F12 in Windows and Command + Option + I in MacOS can also be used)will open the inspect element window as shown below.

    By Hovering over different sections of the page, we can then tell their specific classes and IDs from the inspect element window which in turn informs our next scraping move.

    web scraping using BeautifulSoup Python

    We realize that the transcripts are in the div class "post-content". With this information, we can then scrap all paragraphs within the class.

    We are now going to build our web-Scraper which can then use to compare comedians or any other purpose.

    import requests
    from bs4 import Beautifulsoup
    
    # Scrapes transcript data from scrapsfromtheloft.com
    def url_to_transcript(url):
        '''Returns transcript data specifically from scrapsfromtheloft.com.'''
        
        page = requests.get(url).text
        soup = BeautifulSoup(page, "html5lib") 
        text = [p.text for p in soup.find(class_="post-content").find_all('p')]
        # convert to text all paragraphs in the class (post-content)
        print(url + "DONE")
        # Once done print the specific url and the "DONE" after it
        return text
    
    # A list of URLs to transcripts in our scope
    urls = ['http://scrapsfromtheloft.com/2017/05/06/louis-ck-oh-my-god-full-transcript/',
            'http://scrapsfromtheloft.com/2017/04/11/dave-chappelle-age-spin-2017-full-transcript/',
            'http://scrapsfromtheloft.com/2018/03/15/ricky-gervais-humanity-transcript/',
            'http://scrapsfromtheloft.com/2017/08/07/bo-burnham-2013-full-transcript/',
            'http://scrapsfromtheloft.com/2017/05/24/bill-burr-im-sorry-feel-way-2014-full-transcript/',
            'http://scrapsfromtheloft.com/2017/04/21/jim-jefferies-bare-2014-full-transcript/',
            'http://scrapsfromtheloft.com/2017/08/02/john-mulaney-comeback-kid-2015-full-transcript/',
            'http://scrapsfromtheloft.com/2017/10/21/hasan-minhaj-homecoming-king-2017-full-transcript/',
            'http://scrapsfromtheloft.com/2017/09/19/ali-wong-baby-cobra-2016-full-transcript/',
            'http://scrapsfromtheloft.com/2017/08/03/anthony-jeselnik-thoughts-prayers-2015-full-transcript/',
            'http://scrapsfromtheloft.com/2018/03/03/mike-birbiglia-my-girlfriends-boyfriend-2013-full-transcript/',
            'http://scrapsfromtheloft.com/2017/08/19/joe-rogan-triggered-2016-full-transcript/',
           ]
    
    # Comedian names in order
    comedians = ['louis', 'dave', 'ricky', 'bo', 'bill', 'jim', 'john', 'hasan', 'ali', 'anthony', 'mike', 'joe']
    
    # Actually request transcripts
    transcripts = [url_to_transcript(u) for u in urls]

    5. Using pickle to store our data

    Pickling is when we convert a Python object hierarchy into a byte stream, and “unpickling” is the inverse operation, where we convert a byte stream (from a binary file or bytes-like object) back into an object hierarchy. This is generally known as Serialization which is beyond the scope of our current task. However, we will be able to see its use in storing and retrieving our scrapped data.

    for i, c in enumerate(comedians):
      with open(c + ".txt", "wb") as file:
           pickle.dump(transcripts[i], file)

    This above code names each transcript against the comedian respectively. This data is afterward pickled.

    We should now have each comedian's transcript separately in our working directory. With this Scrapped data we can feel free to perform any operation on it but that is beyond the scope of this project.

    Have fun Scraping ethically!

    You may also like:

    About the author:
    Fabian Omoke is a highly skilled technical author specializing in writing on the Python language. With a background in Computer Science and extensive experience in the field, Fabian is highly knowledgeable about Python and its applications.
    Tags:BeautifulSoupPythonWeb Scraping
    IF YOU LIKE IT, THEN SHARE IT
     

    RELATED POSTS