Signup/Sign In

Python's mechanize Library

Generally, a users can either view a website using a browser or by viewing the source code using a number of different methods and tools; the Linux program wget is a popular method. If you want to open a website using Python, the only way to browse the Internet is to retrieve and parse the website's HTML source code. In this tutorial, we'll learn how to use Mechanize Library for this purpose.

To use the mechanize library, download it's tar.gz file from here. Extract the tar file and install it using python setup.py install

Mechanize's primary class, Browser, allows the manipulation of anything that can be manipulated inside a browser. Let's see an example to view source code of a website using Mechanize Library:

mech1.py

#!usr/bin/env python
#Program to view source code using mechanize

import mechanize

def page_view(url):
	try:
		#create browser object
		browser = mechanize.Browser()

		#browser.set_handle_robots(False)
		page = browser.open(url)
		src_code = page.read()
		#print source code
		print src_code  	
	except:
		print "Error in browsing..."

url = "http://www.syngress.com/"
page_view(url)

Output:

Mechanize Library


Now, in the script mech1.py change the url to https://www.google.com. What do you see? "Error in browsing..." Now let's analyse the error closely. Remove the try & except statement from the above code and try to execute the code again. Oops! It still didn't work, but this time you will see the detailed error. You must be seeing the error message stating:

error with robots.txt while using mechanize Library


As we can see in the error, there is something about the robots.txt file. Do you know what a robots.txt file is? Using this file, any website can inform the search engines like Google, Bing etc to crawl or not to crawl any webpage. Hence, if you have a website, and you don't want Google to crawl any particular webpage(might be for internal usage), then you can specify that in the robots.txt file.

Now, coming on to the problem. So the above error is raised because the website is preventing our browser to visit their webpages. So, what should we do? We instruct our mechanize browser object to ignore the website parsing for robots file. In order to do that, simply uncomment the following line in mech1.py: browser.set_handle_robots(False)

Now, if you visit Google.com, you can view something like below:

access google.com using mechanize Library