Showing posts with label django. Show all posts
Showing posts with label django. Show all posts

Cloning Scrapy scrapers easily with django-dynamic-scraper

Scraping requires some flexible tools.

The backstory


I have a Django project which uses django-dynamic-scraper, a scrapy wrapper, to gather data from various websites. I found myself creating a series of nearly identical scrapers to get data from different parts of these sites. This was rather inefficient since each scraper had a dozen fields or so, and there is no built in solution for copying a scraper.

I explored form autofill solutions before realizing I was overthinking the problem. All I needed to do was create a small script taking advantage of Django model access. This could probably be worked into manage.py, but I just wanted something to run in a hurry.

The script


clone-scraper.py:

import sys
from dynamic_scraper.models import Scraper, ScraperElem

s = Scraper.objects.get(name='My Template')
s.name=sys.argv[1]
s.id = None
s.pk = None
s.save()

se = ScraperElem.objects.filter(scraper__name='My Template')

for item in se:
    item.scraper = s
    item.id = None
    item.pk = None
    item.save()

Usage


python clone-scraper.py "New Scraper Name"

I run this script from within my virtualenv where Django lives. That gives it access to the models in my apps. If you're not sure which python executable you're using, run "which python" to find out. You want the one from the virtualenv.


How it works


The script first accesses the django-dynamic-scraper models to work with. Then it gets the "template" Scraper to be copied from, which has already been created. This working copy of a Scraper has its name set to the first command line argument. It also has its keys reset so that a new Scraper can be saved and given a new ID.

Then, the different ScraperElem fields from the new scraper are called and given the values of the fields from the template. Finally, those ScraperElem objects are saved.

This is tested on Django 1.5 and 1.5.2, but the technique should work with older versions.

Homework


Add a second argument for template name.

Checking Python style with pep8

A solid foundation in Python Style



If you've been coding with the Python language for a while, you have probably heard of PEP 8, the authoritative Python coding style guide authored by Guido van Rossum. While PEP 8 makes for a thrilling and inspiring read, something like a cross between Uncle John's Bathroom Reader and the US Constitution, it never occurred to me that there was a living application of the principles embodied in its chapters.

Today I learned that 'pep8' is a Python app! If you have pip installed, you can just 'pip install pep8'. If you don't have pip installed, install pip first with 'sudo apt-get install python-pip'. Pip is like apt-get, but for Python.

Pay attention, because this is the cool part. Now that you have pep8 you can use it to check your actual code against the actual PEP 8! It's like Guido himself was standing next to your desk, calmly eating canteloupe and critiquing your code for Pythonicity.

Here is the actual output I got from pep8 when I ran it for the first time on my Django project's settings.py:

$ pep8 cpi/settings.py
myproject/settings.py:6:80: E501 line too long (82 > 79 characters)
myproject/settings.py:110:1: E122 continuation line missing indentation or outdented
myproject/settings.py:133:1: W293 blank line contains whitespace
myproject/settings.py:138:1: W293 blank line contains whitespace
myproject/settings.py:143:13: W291 trailing whitespace
myproject/settings.py:156:80: E501 line too long (90 > 79 characters)
myproject/settings.py:130:1: E122 continuation line missing indentation or outdented
myproject/settings.py:151:14: E261 at least two spaces before inline comment
myproject/settings.py:152:22: E261 at least two spaces before inline comment
myproject/settings.py:153:16: E261 at least two spaces before inline comment
myproject/settings.py:231:1: E303 too many blank lines (3)
Ouch, that's a lot of errors! These are all the blunders in Python style which don't actually prevent execution of the code when it's run through the interpreter.

I was able to clear up all the errors, after which pep8 will run with no output:

$ pep8 cpi/settings.py

Joy!