The internet is a massive ocean of data. Most of this data is not well organized or in a readily available format like a downloadable CSV format dataset. If we would like to get particular data from any website, we might need to employ web scraping.
Web scraping is the technique of using programming to grab the data we would like to work with from a website instead of using the manual copy and paste.
We send a request to the server hosting the page we specified.
We download and filter for the HTML elements of the page we specified.
Finalyy extract the text/content from the HTML elements.
As seen above, we only go for what we already specified. This specification can only be done through code.
Some websites explicitly allow web-scraping while some do not. Some do not declare their stand on the same. It is good practice to consider this when scraping as it consumes server resources from the host website. In this case, the frequency at which we scrape a page has to be considerate.
Before we scrape a webpage we need to download it first. We download pages using the python requests library. First, we have to send a GET request to the web server to download the contents of the webpage we require. The downloaded content is usually in HTML format. HyperText Markup Language(HTML) is the language with which most websites are created. You can familiarise yourself with our HTML course.
We can start by scraping a simple website:
# Import the requests library import requests scrappedPage = requests.get('https://www.studytonight.com/python/') scrappedPage
You should expect the following Output
The output is a response object which indicates through the status code property that the page was successfully downloaded. 200 is the status code for a successful download of a page.
The response object has another property known as content which holds the scrapped page's HTML content.
After the above line of code, you should expect a display of our scrapped page's HTML content. Since we are scraping the entire page we should expect some heavy output. In this case, we can save it to a file instead or directly access what we need from the website.
We are now going to save the HTML contents of our scrapped page to a text file named (output.txt) within our program's directory.
# We are going to initialize a file object """ This is accomplished in the following format fileObject = open('filename', 'accessmode') In our case, the access mode'w' means we are writing to the file """ pageContent = open('output','w') for line in scrappedPage: pageContent.write(str(line)) # The scrappedPage content is in - # - binary so it needs to be converted - # - to string pageContent.close # it is paramount to close the file because once it is open - # - for a particular access mode in our program it might - # - not be available for other access modes outside our program
The output in our file output.txt is not in the most pleasing format and this is where Beautiful Soup comes in to spice our scrapping process.
We use Beautiful Soup to parse the contents of an HTML document.
""" we will import the library and create an instance of the BeautifulSoup class to parse our document """ from bs4 import BeautifulSoup soup = BeautifulSoup(scrappedPage.content, 'html.parser') # We can print out the contents of our HTML document to a new file using BeautifulSoup's - # - prettify method and compare with our previous output pageContent2 = open('output2.txt','w',encoding="utf-8") # we include the encoding to avoid UnicodeEncodeError pageContent2.write(soup.prettify()) # prettify method formats the output pageContent2.close
In our working directory, a new file output2.txt will be created. If we compare our initial output to this new one, it is clear which one is more legible and of greater resemblance to an HTML document. This is one of the subtle quirks that make BeautifulSoup interesting to work with.
we can use the
find_all method to display all the instances of a specific HMTL tag on a page.
This method, however, returns a list, we will need to employ list indexing or loop through it to display the text we need.
# soup.find_all('p') # returns a list of elements in our page referenced by the HTML tag 'p' """ Uncomment the above line to see the full list, it is a long list """ soup.find_all('p').get_text() # Gets the text from the first instance of the tag
You should get the following output:
Classes and ids are used by CSS to decide which HTML elements to implement specific styles to. We also use them to specify particular elements we require to scrape.
scrapsfromtheloft.com is a website serving as a cultural multilingual magazine featuring movie reviews and essays, stand-up comedy transcripts, interviews, great literature, history, and many more. We are going to scrap stand-up comedy transcripts.
We will need to explore(Inspect) the page structure using Chrome DevTools. Other Browsers have their equivalents but in our case, we'll use chrome. Ctrl + shift + I (F12 in Windows and Command + Option + I in MacOS can also be used)will open the inspect element window as shown below.
By Hovering over different sections of the page, we can then tell their specific classes and IDs from the inspect element window which in turn informs our next scraping move.
We realize that the transcripts are in the
div class "post-content". With this information, we can then scrap all paragraphs within the class.
We are now going to build our web-Scraper which can then use to compare comedians or any other purpose.
import requests from bs4 import Beautifulsoup # Scrapes transcript data from scrapsfromtheloft.com def url_to_transcript(url): '''Returns transcript data specifically from scrapsfromtheloft.com.''' page = requests.get(url).text soup = BeautifulSoup(page, "html5lib") text = [p.text for p in soup.find(class_="post-content").find_all('p')] # convert to text all paragraphs in the class (post-content) print(url + "DONE") # Once done print the specific url and the "DONE" after it return text # A list of URLs to transcripts in our scope urls = ['http://scrapsfromtheloft.com/2017/05/06/louis-ck-oh-my-god-full-transcript/', 'http://scrapsfromtheloft.com/2017/04/11/dave-chappelle-age-spin-2017-full-transcript/', 'http://scrapsfromtheloft.com/2018/03/15/ricky-gervais-humanity-transcript/', 'http://scrapsfromtheloft.com/2017/08/07/bo-burnham-2013-full-transcript/', 'http://scrapsfromtheloft.com/2017/05/24/bill-burr-im-sorry-feel-way-2014-full-transcript/', 'http://scrapsfromtheloft.com/2017/04/21/jim-jefferies-bare-2014-full-transcript/', 'http://scrapsfromtheloft.com/2017/08/02/john-mulaney-comeback-kid-2015-full-transcript/', 'http://scrapsfromtheloft.com/2017/10/21/hasan-minhaj-homecoming-king-2017-full-transcript/', 'http://scrapsfromtheloft.com/2017/09/19/ali-wong-baby-cobra-2016-full-transcript/', 'http://scrapsfromtheloft.com/2017/08/03/anthony-jeselnik-thoughts-prayers-2015-full-transcript/', 'http://scrapsfromtheloft.com/2018/03/03/mike-birbiglia-my-girlfriends-boyfriend-2013-full-transcript/', 'http://scrapsfromtheloft.com/2017/08/19/joe-rogan-triggered-2016-full-transcript/', ] # Comedian names in order comedians = ['louis', 'dave', 'ricky', 'bo', 'bill', 'jim', 'john', 'hasan', 'ali', 'anthony', 'mike', 'joe'] # Actually request transcripts transcripts = [url_to_transcript(u) for u in urls]
Pickling is when we convert a Python object hierarchy into a byte stream, and “unpickling” is the inverse operation, where we convert a byte stream (from a binary file or bytes-like object) back into an object hierarchy. This is generally known as Serialization which is beyond the scope of our current task. However, we will be able to see its use in storing and retrieving our scrapped data.
for i, c in enumerate(comedians): with open(c + ".txt", "wb") as file: pickle.dump(transcripts[i], file)
This above code names each transcript against the comedian respectively. This data is afterward pickled.
We should now have each comedian's transcript separately in our working directory. With this Scrapped data we can feel free to perform any operation on it but that is beyond the scope of this project.
Have fun Scraping ethically!