Extending the requests response class
Requests is a fantastic library for python, one of the most enjoyable libraries I have used to this day. I use it on a daily basis for most of my scraping activities.
The chances are you have some convenience functions that you use in all of your scraping projects, but you may have just been copying them around for now, and passing your response objects into them as an argument. We can do better.
I'm just going to show you how to add a few simple methods to the
Response class, so that you can use this technique for your own projects with your own methods.
We'll start by defining a
Response class with a few convenience methods. The important method defined is
doc() . It "caches" the parsed tree of the HTML, so all our other convenient methods don't cause the whole HTML to be re-parsed with each function call.
import requests from lxml import html import inspect class Response(object): def doc(self): if not hasattr(self, '_doc'): self._doc = html.fromstring(self.text) return self._doc def links(self): return self.doc().xpath('//a/@href') def images(self, filter_extensions=['jpg', 'jpeg', 'gif', 'png']): return [link for link in self.doc().xpath('//img/@src') if link.endswith(tuple(filter_extensions))] def title(self): title = self.doc().xpath('//title/text()') if len(title): return title.strip() else: return None
for method_name, method in inspect.getmembers(Response, inspect.ismethod): setattr(requests.models.Response, method_name, method.im_func)
We're all done. You can now access these convinience functions on any response object, see the following example:
r = requests.get('http://imgur.com/') print r.title() print r.images(filter_extensions=['png'])
Now go ahead, and make your response objects as powerful as you desire. If you're interested in other scraping related hints / tips, check out my python web scraping resource.