Scraping content with readability and python
This is only going to be a short one. If you want to scrape the main content of a site only (we aren't interested in the menu/sidebar etc), then you can use the python port of the readability library.
We'll simply fetch the page using Kenneth Reitz's requests library, and then extract the content with readability. I'll be using Redis to store the results, simply because I didn't want the extra overhead of having to install a RDBMS for this project.
pip install redis requests readability-lxml
First we'll create a generator to yield urls we want to fetch from a text file. We'll only yield the url if we don't already have the content for it stored though.
import redis r = redis.StrictRedis() def urls(): for url in open('urls.txt', 'r'): if not r.hexists('url-content', url): yield url
Now we can iterate over this, fetching the content, running it through readability and then storing the results in redis. It's as simple as it sounds.
import redis import requests from readability.readability import Document r = redis.StrictRedis() def urls(): for url in open('urls.txt', 'r'): if not r.hexists('url-content', url): yield url if __name__ == '__main__': for url in urls(): try: response = requests.get(url, timeout=10) # Only store the content if the page load was successful if response.ok: page_content = Document(response.content).summary() r.hset('url-content', url, page_content) except: print 'Error processing URL: %s' % url print 'Processed all URLs'
And there you have it, scraping and storing content of the urls loaded from a text file, all in < 25 lines of Python.