Scraping with Mechanize and BeautifulSoup

Swizec TellerAugust 10, 2012

Scraping is one of those annoying little things that will never be solved for the general case. Sometimes you want to extract articles, other times you're looking for data in organized tables ... and sometimes it's all hidden behind a form with cross-site request forgeryprotection (csrf).

And it's never actually organized. Even with the best of websites, I don't think I've ever encountered a scraping job that couldn't be described as "A small and simple general model with heaps upon piles of annoying little exceptions"

At best scraping a bunch of data from a website is a somewhat fiddly job, at worst you'll be wishing you'd done it manually and be done with it.

When you're done, you will lie awake at night, praying to the gods of the internets, hoping nobody sneezes at the HTML of that page.

Mechanize

You won't get away from the fiddliness, but there's a lot you can do to make the job more palatable.

For starters - ditch manually taking care of submitting forms, hauling cookies around, holding history, sending referrers, using a good user-agent, following redirects and so on and on.

Submitting a form usually goes like this:

Go to page
View source
Find form
Note the action url
Make a note of the field names
Make sure honeypot fields will be handled properly
Write a few lines of code to prepare data for submission
Submit to the correct url

Then you discover the website uses csrf protection and you have to make a script that will go to the form address, parse the form, find the csrf field, hold the proper cookie and so on ...

Pain in the ass.

import urllib, urllib2

req = urllib2.Request("http://example.com/form/submit/url"
                      data=urllib.urlencode({'field1': 'value',
                                             'field2': 'value',
                                             'filed3': 'value'}),
                      headers={'User-Agent': 'Mozilla something',
                               'Cookie': 'name=value; name2=value2'})
response = urllib2.urlopen(req)
# do something with response

With Mechanize the process is much simpler:

Go to page
View source
Find form
Note an identifier for the form
Decide which fields you want to manipulate
Write some code

The beautiful thing is, mechanize will automatically handle csrf fields and most other popular forms of preventing bots doing their dirty business all over a website.

import mechanize

browser.open('http://example.com/form/')
browser.select_form(name='the_form')
browser['field1'] = 'value'
browser['field2'] = 'value'
browser['field3'] = 'value'
browser.submit()

# use browser to click on stuff
# or browser.response() to get the raw response

Now isn't that much easier and cleaner?

The cool thing about Mechanize is that it also lets you do a lot of browsing around. browser.links gives you all the links on a page, browser.forms all the forms and so on. You can even use browser.follow_link to naturally walk around the whole website like a user might.

Which is very useful when you're handling websites that either don't want or don't expect bots.

BeautifulSoup

Ok, now we can get to the data. But how do we get the data itself?

Unfortunately this is the fiddly part of the process and there isn't much you can do about that. Your best bet is using BeautifulSoup to at least make the process of handling poorly written HTML without a big fuss. It will even make sure everything is unicode. Win!

At least BeautifulSoup makes browsing DOM a breeze:

from bs4 import BeautifulSoup

soup = BeautifulSoup(browser.response().read())

body_tag = soup.body
all_paragraphs = soup.find_all('p')
logo_img = soup.find('header').find('div', id="logo").img

# and so on depending on what you need

That's it.

The next time you have to scrape some data off a website I suggest using Mechanize and BeautifulSoup. That way you can worry about the fiddly bits, not the infrastructure.

Filed under: BeautifulSoupClientsCross-site request forgeryHTMLHTTP cookieUncategorizedUser agent

Scraping with Mechanize and BeautifulSoup

Mechanize

BeautifulSoup

Related Articles

Related Articles

Scraping with Mechanize and BeautifulSoup

Mechanize

BeautifulSoup

Dive deeper with my books

Related Articles

Related Articles