Scraping content with readability and python

This is only going to be a short one. If you want to scrape the main content of a site only (we aren't interested in the menu/sidebar etc), then you can use the python port of the readability library.

Jump to final code

We'll simply fetch the page using Kenneth Reitz's requests library, and then extract the content with readability. I'll be using Redis to store the results, simply because I didn't want the extra overhead of having to install a RDBMS for this project.

Installing libaries
pip install redis requests readability-lxml

First we'll create a generator to yield urls we want to fetch from a text file. We'll only yield the url if we don't already have the content for it stored though.

    import redis
    r = redis.StrictRedis()

    def urls():
      for url in open('urls.txt', 'r'):
            if not r.hexists('url-content', url):
                yield url

Now we can iterate over this, fetching the content, running it through readability and then storing the results in redis. It's as simple as it sounds.

Final Code

    import redis
    import requests
    from readability.readability import Document

    r = redis.StrictRedis()

    def urls():
        for url in open('urls.txt', 'r'):
            if not r.hexists('url-content', url):
                yield url

    if __name__ == '__main__':
        for url in urls():
            try:
                response = requests.get(url, timeout=10)

                # Only store the content if the page load was successful
                if response.ok:
                    page_content = Document(response.content).summary()
                    r.hset('url-content', url, page_content)
            except:
                print 'Error processing URL: %s' % url

        print 'Processed all URLs'

And there you have it, scraping and storing content of the urls loaded from a text file, all in < 25 lines of Python.

By Jake Austwick

21 year old self-taught programmer living in San Diego, California. Proficient in Python, Ruby and have dabbled in Go. Main interests are web scraping and web applications.

comments powered by Disqus