Swizec Teller - a geek with a hatswizec.com

    Scraping with Mechanize and BeautifulSoup

    Scraping
    Scraping

    Scraping is one of those annoying little things that will never be solved for the general case. Sometimes you want to extract articles, other times you're looking for data in organized tables ... and sometimes it's all hidden behind a form with cross-site request forgeryprotection (csrf).

    And it's never actually organized. Even with the best of websites, I don't think I've ever encountered a scraping job that couldn't be described as "A small and simple general model with heaps upon piles of annoying little exceptions"

    At best scraping a bunch of data from a website is a somewhat fiddly job, at worst you'll be wishing you'd done it manually and be done with it.

    When you're done, you will lie awake at night, praying to the gods of the internets, hoping nobody sneezes at the HTML of that page.

    Mechanize

    You won't get away from the fiddliness, but there's a lot you can do to make the job more palatable.

    For starters - ditch manually taking care of submitting forms, hauling cookies around, holding history, sending referrers, using a good user-agent, following redirects and so on and on.

    Submitting a form usually goes like this:

    1. Go to page
    2. View source
    3. Find form
    4. Note the action url
    5. Make a note of the field names
    6. Make sure honeypot fields will be handled properly
    7. Write a few lines of code to prepare data for submission
    8. Submit to the correct url

    Then you discover the website uses csrf protection and you have to make a script that will go to the form address, parse the form, find the csrf field, hold the proper cookie and so on ...

    Pain in the ass.

    import urllib, urllib2
    req = urllib2.Request("http://example.com/form/submit/url"
    data=urllib.urlencode({'field1': 'value',
    'field2': 'value',
    'filed3': 'value'}),
    headers={'User-Agent': 'Mozilla something',
    'Cookie': 'name=value; name2=value2'})
    response = urllib2.urlopen(req)
    # do something with response

    With Mechanize the process is much simpler:

    1. Go to page
    2. View source
    3. Find form
    4. Note an identifier for the form
    5. Decide which fields you want to manipulate
    6. Write some code

    The beautiful thing is, mechanize will automatically handle csrf fields and most other popular forms of preventing bots doing their dirty business all over a website.

    import mechanize
    browser.open('http://example.com/form/')
    browser.select_form(name='the_form')
    browser['field1'] = 'value'
    browser['field2'] = 'value'
    browser['field3'] = 'value'
    browser.submit()
    # use browser to click on stuff
    # or browser.response() to get the raw response

    Now isn't that much easier and cleaner?

    The cool thing about Mechanize is that it also lets you do a lot of browsing around. browser.links gives you all the links on a page, browser.forms all the forms and so on. You can even use browser.follow_link to naturally walk around the whole website like a user might.

    Which is very useful when you're handling websites that either don't want or don't expect bots.

    BeautifulSoup

    Ok, now we can get to the data. But how do we get the data itself?

    Unfortunately this is the fiddly part of the process and there isn't much you can do about that. Your best bet is using BeautifulSoup to at least make the process of handling poorly written HTML without a big fuss. It will even make sure everything is unicode. Win!

    At least BeautifulSoup makes browsing DOM a breeze:

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(browser.response().read())
    body_tag = soup.body
    all_paragraphs = soup.find_all('p')
    logo_img = soup.find('header').find('div', id="logo").img
    # and so on depending on what you need

    That's it.

    The next time you have to scrape some data off a website I suggest using Mechanize and BeautifulSoup. That way you can worry about the fiddly bits, not the infrastructure.

    Enhanced by Zemanta

    Did you enjoy this article?

    Published on August 10th, 2012 in BeautifulSoup, Clients, Cross-site request forgery, HTML, HTTP cookie, Uncategorized, User agent,

    Learned something new?
    Read more Software Engineering Lessons from Production

    I write articles with real insight into the career and skills of a modern software engineer. "Raw and honest from the heart!" as one reader described them. Fueled by lessons learned over 20 years of building production code for side-projects, small businesses, and hyper growth startups. Both successful and not.

    Subscribe below 👇

    Software Engineering Lessons from Production

    Join Swizec's Newsletter and get insightful emails 💌 on mindsets, tactics, and technical skills for your career. Real lessons from building production software. No bullshit.

    "Man, love your simple writing! Yours is the only newsletter I open and only blog that I give a fuck to read & scroll till the end. And wow always take away lessons with me. Inspiring! And very relatable. 👌"

    ~ Ashish Kumar

    Join 15,883+ engineers learning lessons from my "raw and honest from the heart" emails.

    ⭐️⭐️⭐️⭐️✨
    4.5 stars average rating

    Have a burning question that you think I can answer? Hit me up on twitter and I'll do my best.

    Who am I and who do I help? I'm Swizec Teller and I turn coders into engineers with "Raw and honest from the heart!" writing. No bullshit. Real insights into the career and skills of a modern software engineer.

    Want to become a true senior engineer? Take ownership, have autonomy, and be a force multiplier on your team. The Senior Engineer Mindset ebook can help 👉 swizec.com/senior-mindset. These are the shifts in mindset that unlocked my career.

    Curious about Serverless and the modern backend? Check out Serverless Handbook, for frontend engineers 👉 ServerlessHandbook.dev

    Want to Stop copy pasting D3 examples and create data visualizations of your own? Learn how to build scalable dataviz React components your whole team can understand with React for Data Visualization

    Want to get my best emails on JavaScript, React, Serverless, Fullstack Web, or Indie Hacking? Check out swizec.com/collections

    Want to brush up on modern JavaScript syntax? Check out my interactive cheatsheet: es6cheatsheet.com

    Did someone amazing share this letter with you? Wonderful! You can sign up for my weekly letters for software engineers on their path to greatness, here: swizec.com/blog

    Want to brush up on your modern JavaScript syntax? Check out my interactive cheatsheet: es6cheatsheet.com

    By the way, just in case no one has told you it yet today: I love and appreciate you for who you are ❤️

    Created by Swizec with ❤️