Generally, a users can either view a website using a browser or by viewing the source code using a number of different methods and tools; the Linux program
wget is a popular method. If you want to open a website using Python, the only way to browse the Internet is to retrieve and parse the website's HTML source code. In this tutorial, we'll learn how to use Mechanize Library for this purpose.
To use the
mechanize library, download it's tar.gz file from here. Extract the
tar file and install it using
python setup.py install
Mechanize's primary class,
Browser, allows the manipulation of anything that can be manipulated inside a browser. Let's see an example to view source code of a website using Mechanize Library:
#!usr/bin/env python #Program to view source code using mechanize import mechanize def page_view(url): try: #create browser object browser = mechanize.Browser() #browser.set_handle_robots(False) page = browser.open(url) src_code = page.read() #print source code print src_code except: print "Error in browsing..." url = "http://www.syngress.com/" page_view(url)
Now, in the script
mech1.py change the url to https://www.google.com. What do you see? "Error in browsing..." Now let's analyse the error closely. Remove the try & except statement from the above code and try to execute the code again. Oops! It still didn't work, but this time you will see the detailed error. You must be seeing the error message stating:
As we can see in the error, there is something about the robots.txt file. Do you know what a robots.txt file is? Using this file, any website can inform the search engines like Google, Bing etc to crawl or not to crawl any webpage. Hence, if you have a website, and you don't want Google to crawl any particular webpage(might be for internal usage), then you can specify that in the robots.txt file.
Now, coming on to the problem. So the above error is raised because the website is preventing our browser to visit their webpages. So, what should we do? We instruct our
browser object to ignore the website parsing for robots file. In order to do that, simply uncomment the following line in mech1.py:
Now, if you visit Google.com, you can view something like below: