Scraping requires some flexible tools. |
The backstory
I have a Django project which uses django-dynamic-scraper, a scrapy wrapper, to gather data from various websites. I found myself creating a series of nearly identical scrapers to get data from different parts of these sites. This was rather inefficient since each scraper had a dozen fields or so, and there is no built in solution for copying a scraper.
I explored form autofill solutions before realizing I was overthinking the problem. All I needed to do was create a small script taking advantage of Django model access. This could probably be worked into manage.py, but I just wanted something to run in a hurry.
The script
clone-scraper.py:
import sys
from dynamic_scraper.models import Scraper, ScraperElem
s = Scraper.objects.get(name='My Template')
s.name=sys.argv[1]
s.id = None
s.pk = None
s.save()
se = ScraperElem.objects.filter(scraper__name='My Template')
for item in se:
item.scraper = s
item.id = None
item.pk = None
item.save()
Usage
python clone-scraper.py "New Scraper Name"
I run this script from within my virtualenv where Django lives. That gives it access to the models in my apps. If you're not sure which python executable you're using, run "which python" to find out. You want the one from the virtualenv.
How it works
The script first accesses the django-dynamic-scraper models to work with. Then it gets the "template" Scraper to be copied from, which has already been created. This working copy of a Scraper has its name set to the first command line argument. It also has its keys reset so that a new Scraper can be saved and given a new ID.
Then, the different ScraperElem fields from the new scraper are called and given the values of the fields from the template. Finally, those ScraperElem objects are saved.
This is tested on Django 1.5 and 1.5.2, but the technique should work with older versions.
Homework
Add a second argument for template name.
No comments:
Post a Comment
Comments welcome!