Scraping with Python3 and Scrapy

Scrapy is one of the most popular Python framework for large scale web scraping. It gives you all the tools you need to efficiently extract data from websites, process them as you want, and store them in your preferred structure and format. It has good documentation and lots of “get started quick” tutorials all over the web.

In a previous article, we build a very simple web crawler for scraping using urllib. If we were to expand it with proper error handling, ability to pause and resume, connecting reliably to websites… we would end up building a scraping framework similar to scrapy.

Scrapy takes care of error handing, parallel processing which is already much more than what our simple script did.

The first step is to install the Scrapy library (install guide):

pip install scrapy

The Scrapy Shell

Scrapy offers a shell that is useful to test the effect of a method or to explore a webpage. For instance:

> scrapy shell

In  [1]: fetch("https://medium.com/s/journalists-are-wrong-about-health/health-is-more-complicated-than-correlations-5436cee51b0c") 
Out [1]:

In  [2]: response.css("h1::text, h2::text, h3::text").extract()
Out [2]:
['Health Is More Complicated Than Correlations',
 'The Big Scary Study',
 'Ice Cream Doesn’t Cause Drownings',
 'Scary Studies And Correlations',
 'Observational Causation',
 'Spotting the Scary',
 'Gid M-K; Health Nerd']

When using the shell, you might want to define custom functions and reuse them accross sessions. This can be achieved by putting them in a file myfunctions.py and then sourcing that file from the scrapy shell using run myfunctions.py. After that, you can call your custom functions right away.

To exit the shell, type ctrl+d.

Learn more about this feature in the shell documentation.

A Scrapy project

Let’s see how to create a real project using scrapy (Link to official doc).

First, open your terminal, move to the directory where you want to work and run this command to create a new project called googleScraper:

scrapy startproject googleScraper
cd googleScraper

Useful settings

In order to scrape websites regardless of their robot.txt file (which is bad but convenient), edit the file settings.py and change the line ROBOTSTXT_OBEY = True to:

ROBOTSTXT_OBEY = False

You’ll notice that the logs are pretty verbose and difficult to read while running the code. That’s because the default LOG_LEVEL is set to DEBUG. My advice is to add the following line in settings.py:

LOG_LEVEL = 'INFO'

To change the user-agent of your spider, add a line to settings.py. For instance:

USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36"

First spider

Then, you need to write a spider. a spider is a class that you define and that Scrapy uses to scrape information from a website. A custom spider must subclass scrapy.Spider and define the following methods:

  • name: uniquely identifies the spider;
  • start_urls: a list of urls the crawler will begin to crawl from; alternatively you can define start_requests() (see the doc);
  • parse(): the method to handle to responses to each requests, this is where you extract data and find new URL to explore.

Navigate to googleScraper/spiders and create a new file google_spider.py. Before the code of the spider itself, I copied the filter_urls function from my previous article : Scraping with python and urllib.

#
# file googleScraper/spiders/google_spider.py
#

#-----------------------------------------------------
# Code to parse URL (as used in previous article)

from urllib.parse import urlparse

netloc = 'www.google.ch'

def has_bad_format(url):
    exts = ['.gif', '.png', '.jpg']
    return any(url.find(ext) >= 0 for ext in exts)

def filter_urls(urls, netloc):
    urls = [url for url in urls if urlparse(url).netloc == netloc]
    urls = [url for url in urls if not has_bad_format(url)]
    return urls

#-----------------------------------------------------
# Code for scrapy's spider

import scrapy

class GoogleSpider(scrapy.Spider):
    # This will be used to run the spider
    name = "google"

    # Keep our spider on google's domain
    allowed_domains = [netloc]

    # The crawler will start with those URL's
    start_urls = [
        'https://' + netloc,
    ]

    # The function to handle the server's responses
    def parse(self, response):
        self.logger.info('Processing {}'.format(response.request.url))
        if response.url != response.request.url:
            self.logger.info('\t==> Redirected to {}'.format(response.url))

        # Use CSS selector to extract href urls
        urls = response.css('a::attr(href)').extract()
        urls = filter_urls(urls, netloc)

        # Follow the links
        for url in urls:
            yield response.follow(url, callback=self.parse)

You can see in the code that Scrapy takes care of duplicate URLS for us, but we still have to filter urls manually. Actually, allowed_domains already prevents the spider to go too far, but there are still some bugs so it’s a good idea to filter manually.

If you don’t know what the yield keyword is, there is an excellent explaination here.

Run this spider using the shell command :

scrapy crawl google

where google is the name of the spider we created.

The scrapy doc is well written so you can tackle it without apprehension. There is a spider subclass called CrawlSpider that will perform some kind of link filtering for you.