Scrape the Web: Strategies for programming websites that don't expect it

Summary

We'll discuss the basics of web scraping, and then dive into the details of different methods and where they are most applicable. You'll leave with an understanding of when to apply different tools, and learn about automating a full web browser, a "heavy hammer" that I picked up at a project for the Electronic Frontier Foundation.

Description

Scrape the Web: Strategies for programming websites that don't expect it

Presented by Asheesh Laroia

Do you find yourself faced with websites that have data you need to extract? Would your life be simpler if you could programmatically input data into web applications, even those tuned to resist interaction by bots?

Year by year, the web is becoming a stronger force. Learn how to get the best of it.

We'll discuss the basics of web scraping, and then dive into the details of different methods and where they are most applicable. You'll leave with an understanding of when to apply different tools, and learn about automating a full web browser, a "heavy hammer" that I picked up at a project for the Electronic Frontier Foundation.

Atendees should bring a laptop, if possible, to try the examples we discuss and optionally take notes. Code samples will be made available after class with no restrictions. Intended Audience

Intermediate (or better) Python programmers, probably without extensive web testing experience

Class Outline

  • My motto: "The website is the API."
  • Choosing a parser: BeautifulSoup, lxml, HTMLParse, and html5lib.
  • Extracting information, even in the face of bad HTML: Regular expressions, BeautifulSoup, SAX, and XPath.
  • Automatic template reverse-engineering tools.
  • Submitting to forms.
  • Playing with XML-RPC
  • DO NOT BECOME AN EVIL COMMENT SPAMMER.
  • Countermeasures, and circumventing them:
    • IP address limits
    • Hidden form fields
    • User-agent detection
    • JavaScript
    • CAPTCHAs
  • Plenty of full source code to working examples:
    • Submitting to forms for text-to-speech.
    • Downloading music from web stores.
    • Automating Firefox with Selenium RC to navigate a pure-JavaScript service.
  • Q[HTML REMOVED] and workshopping
  • Use your power for good, not evil.