チュートリアル¶

このチュートリアルでは、既にScrapyがシステムにインストールされていることを想定しています。インストールされていない場合は、インストールガイドを参照してください。

有名な著者からの引用を掲載する quotes.toscrape.com からスクレイピングしてみましょう。

このチュートリアルでは、以下のタスクについて解説します。

新しいScrapyプロジェクトの作成
サイトをクロールしてデータを抽出する Spider の作成
コマンドラインを使用して抽出したデータをエクスポート
再帰的にリンクをたどるようにSpiderを変更
引数の使用

Scrapyは Python で書かれています。Pythonに慣れていない場合は、どのようなことができるかを理解してからのほうがScrapyを最大限に活用できるかもしれません。

If you're already familiar with other languages, and want to learn Python quickly, the Python Tutorial is a good resource.

If you're new to programming and want to start with Python, the following books may be useful to you:

You can also take a look at this list of Python resources for non-programmers, as well as the suggested resources in the learnpython-subreddit.

プロジェクトの作成¶

スクレイピングを開始する前に、新しいScrapyプロジェクトをセットアップする必要があります。コードを保存するディレクトリに入って以下を実行してください。

scrapy startproject tutorial

これにより、次の内容の tutorial ディレクトリが作成されます。

tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py

最初のSpider¶

Spiderは、ScrapyがWebサイト（またはWebサイトのグループ）から情報を抽出するために定義するクラスです。Spiderは scrapy.Spider をサブクラス化したもので、最初のリクエスト、ページ間のリンクをたどる方法、ダウンロードされたページの内容を解析してデータを抽出する方法などを定義する必要があります。

これは最初のSpiderのコードです。 tutorial/spiders ディレクトリに quotes_spider.py という名前のファイルで保存します。

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

ご覧のように、 scrapy.Spider のサブクラスでいくつかの変数とメソッドを定義しています。

name: Spiderを識別します。プロジェクト内で一意でなければなりません。つまり、異なるSpiderに対して同じ名前を設定することはできません。
start_requests(): Spiderがクロールを開始するリクエストの繰り返し（リクエストのリスト、またはジェネレータ関数）を返す必要があります。最初のリクエストから順番に生成されます。
parse(): 各リクエストによってダウンロードされたレスポンスを処理するためのメソッドです。responseパラメータはページコンテンツを保持する TextResponse のインスタンスであり、それを処理するための役立つメソッドがあります。

parse() メソッドは通常、レスポンスを解析し、取り込まれたデータをdictとして抽出し、新しいURLを見つけ、それらから新しいリクエスト (Request) を作成します。

Spiderの実行方法¶

Spiderを動作させるには、プロジェクトの最上位ディレクトリに移動し、次のコマンドを実行します。

scrapy crawl quotes

このコマンドは、先ほど追加した quotes という名前のSpiderを実行し、 quotes.toscrape.com ドメインにいくつかのリクエストを送信します。実際に実行すると次のような出力が得られます。

... (omitted for brevity)
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened
2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)
...

カレントディレクトリのファイルをチェックしてみてください。 parse メソッドにより quotes-1.html と quotes-2.html の2つの新しいファイルが作成されていることに気がつくでしょう。

注釈

HTMLを解析していないのを疑問に思うかも知れませんが、この後すぐにカバーします。

内部で何が起こったのか？¶

Scrapyは、Spiderの start_requests メソッドによって返された scrapy.Request オブジェクトをスケジュールします。それぞれのリクエストに対して応答を受け取ると、 Response オブジェクトをインスタンス化し、それを引数としてリクエストに関連付けられたコールバックメソッド（ここでは parse メソッド）を呼び出します。

start_requestsメソッドのショートカット¶

URLから scrapy.Request オブジェクトを生成する start_requests() メソッドを実装する代わりに、 start_urls クラス変数でURLのリストを定義できます。このリストは、 start_requests() のデフォルトの実装として使用され、Spiderの最初のリクエストが作成されます。

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

parse() メソッドは、明示的に指示していなくても、これらのURLのリクエストを処理するために呼び出されます。これは parse() がScrapyのデフォルトのコールバックメソッドであり、明示的に割り当てられたコールバックのないリクエストに対して呼び出されるためです。

データの抽出¶

Scrapyのデータを抽出方法を学ぶには、 Scrapy shell を使ってセレクタを試してみるのが最良の方法です。次を実行してみてください。

scrapy shell 'http://quotes.toscrape.com/page/1/'

注釈

コマンドラインからScrapyシェルを実行するときは、常にURLをクォーテーションで囲むことを忘れないでください。そうしないとクエリを含むURL（ & 文字）が動作しません。

Windowsではダブルクォーテーションを使用します。

scrapy shell "http://quotes.toscrape.com/page/1/"

次のようなものが表示されます。

[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fa91d888c10>
[s]   spider     <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
>>>

シェルを使用して、responseオブジェクトで CSS を指定して要素を選択することができます。

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

response.css('title') を実行すると、XML/HTML要素をラップする Selector オブジェクトのリストを表す SelectorList というリストに似たオブジェクトが返されて、さらに細かく選択や抽出を行うためのクエリを実行できます。

このタイトルからテキストを抽出するには、次のようにします。

>>> response.css('title::text').getall()
['Quotes to Scrape']

ここで2つ注意すべき点があります。1つは、 <title> 要素の中のテキストだけを選択するために、CSSのクエリに ::text を追加したことです。 ::text を指定しないとタグを含めた完全なtitle要素が得られます。

>>> response.css('title').getall()
['<title>Quotes to Scrape</title>']

The other thing is that the result of calling .getall() is a list: it is possible that a selector returns more than one result, so we extract them all. When you know you just want the first result, as in this case, you can do:

>>> response.css('title::text').get()
'Quotes to Scrape'

もしくは、次のように書くこともできます。

>>> response.css('title::text')[0].get()
'Quotes to Scrape'

However, using .get() directly on a SelectorList instance avoids an IndexError and returns None when it doesn't find any element matching the selection.

ここで注意することがあります。スクレイピングのコードでは、ページにないものが原因で発生するエラーに対しての柔軟性を高めるべきです。そのため、一部の抽出に失敗しても、少なくとも いくつかの のデータは取得できるようにします。

Besides the getall() and get() methods, you can also use the re() method to extract using regular expressions:

>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']

In order to find the proper CSS selectors to use, you might find useful opening the response page from the shell in your web browser using view(response). You can use your browser developer tools to inspect the HTML and come up with a selector (see section about Using your browser's Developer Tools for scraping).

Selector Gadget は、選択された要素のCSSセレクタを視覚的にすばやく見つけるツールです。多くのブラウザで動作します。

XPathの簡単な紹介¶

CSS の他に、Scrapyセレクタでは XPath 式をサポートしています。

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').get()
'Quotes to Scrape'

XPath式はとても強力で、Scrapyセレクタの基盤となっています。実際のところCSSセレクタは、内部でXPathに変換されます。シェルのセレクタオブジェクトのテキスト表現をよく読んでみると分かります。

XPath式は、CSSセレクタほど普及していないかもしれませんが、構造を辿るだけでなく、コンテンツを見ることもできます。XPathを使用すると、例えば "Next Page" というテキストを含むリンクを選択できます。このように、XPathはスクレイピングの作業にとても適しています。ですから、CSSセレクタを構築する方法をすでに知っていても、XPathを学ぶことをお勧めします。

ここではXPathについて多くは扱いませんが、 ScrapyセレクタでXPathを使用する方法で詳しく知ることができます。XPathの詳細については、例を使ってXPathを学習するチュートリアルや、「XPathの考え方」を学ぶチュートリアルをお勧めします。

引用と著者の抽出¶

選択と抽出について少し知ることができたので、Webページから引用を抽出するコードを書いて、Spiderを完成させましょう。

http://quotes.toscrape.com の各引用は、次のようなHTML要素で表されます。

<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

欲しいデータを抽出する方法を見つけるために、Scrapyシェルを開き、以下を試してみましょう。

$ scrapy shell 'http://quotes.toscrape.com'

引用のHTML要素のセレクタリストを以下のように取得します。

>>> response.css("div.quote")

このクエリによって返された各セレクタのサブ要素に対してさらにクエリを実行できます。最初のセレクタを変数に代入して、CSSセレクタを特定の引用で直接実行できるようにしてみましょう。

>>> quote = response.css("div.quote")[0]

作成した quote オブジェクトを使って、 title, author, tags を抽出してみましょう。

>>> title = quote.css("span.text::text").get()
>>> title
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").get()
>>> author
'Albert Einstein'

Given that the tags are a list of strings, we can use the .getall() method to get all of them:

>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']

各引用をどのように抽出するかが分かったので、今度はすべての引用の要素を繰り返して取得し、それらをまとめてPythonの辞書に入れてみましょう。

>>> for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").get()
...     author = quote.css("small.author::text").get()
...     tags = quote.css("div.tags a.tag::text").getall()
...     print(dict(text=text, author=author, tags=tags))
{'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
{'tags': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'}
    ... a few more of these, omitted for brevity
>>>

Spiderでデータを抽出する¶

Spiderに戻りましょう。これまではデータを抽出することはなく、HTMLページ全体をローカルファイルに保存するだけでした。上記の抽出ロジックをSpiderに統合してみましょう。

ScrapyのSpiderは通常、ページから抽出されたデータを含む多くの辞書を生成します。これを行うために、コールバックでPythonの yield キーワードを使用してみます。

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

このSpiderを実行すると、抽出されたデータがログに出力されます。

2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}

抽出されたデータの保存¶

抽出されたデータを保存する最も簡単な方法は、次のコマンドによって Feed exports を使用することです。

scrapy crawl quotes -o quotes.json

これで JSON でシリアライズされた、抽出されたすべてのアイテムを含む quotes.json ファイルが生成されます。

歴史的な理由により、Scrapyはその内容を上書きするのではなく、指定されたファイルに追加します。ファイルを削除せずにこのコマンドを2回実行すると、JSONファイルが壊れてしまいます。

JSON Lines のような他のフォーマットを使うこともできます。

scrapy crawl quotes -o quotes.jl

JSON Lines 形式はストリームライクなので便利です。簡単に新しいレコードを追加できます。2回実行してもJSONのような問題はありません。また、各レコードが別々の行であるため、メモリにすべてを収める必要なく大きなファイルを処理できます。また JQ のような、役に立つコマンドラインツールがあります。

このチュートリアルのような小さなプロジェクトでは、これで十分です。しかし、抽出したアイテムでより複雑な作業を実行する場合は、 Itemパイプラインを作成することができます。Itemパイプライン用のプレースホルダは、プロジェクトの作成時に tutorial/pipelines.py に作成されています。抽出したアイテムを保存するだけの場合は、Itemパイプラインを実装する必要はありません。

リンクをたどる¶

http://quotes.toscrape.com の最初の2ページからデータを抽出するのではなく、サイトのすべてのページから引用を抽出したいとします。

ページからデータを抽出する方法は理解したので、そこからリンクをたどる方法を見てみましょう。

最初にすることは、たどりたいページへのリンクを抽出することです。ページを調べると、次のマークアップを持つ次ページへのリンクがあることがわかります。

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

シェルでそれを抽出してみましょう。

>>> response.css('li.next a').get()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'

これはa要素を取得しますが、欲しいのは href 属性です。そのために、Scrapyは属性の内容を選択できるCSS拡張をサポートしています。

>>> response.css('li.next a::attr(href)').get()
'/page/2/'

There is also an attrib property available (see Selecting element attributes for more):

>>> response.css('li.next a').attrib['href']
'/page/2'

次のページへのリンクを再帰的にたどってデータを抽出するようにSpiderを修正しました。

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

parse() メソッドはデータを抽出した後、次ページへのリンクを探し、 urljoin() メソッドを使用して絶対URLを作成し（リンクは相対URLであるため）、コールバックを登録した次のページへの新しいリクエストをyieldし、それを繰り返してすべてのページをクロールします。

ここにリンクをたどるScrapyのメカニズムを見ることができます。コールバックメソッドのリクエストをyieldすると、Scrapyはそのリクエストを送信するようスケジュールし、リクエストが終了したときに実行されるコールバックメソッドを登録します。

これを利用して、定義したルールに従ってリンクをたどる複雑なクローラーを構築し、訪れるページに応じて異なる種類のデータを抽出することができます。

この例では、次ページへのリンクが見つからなくなるまで、一種のループを作成します。これはページネーションのあるブログ、フォーラム、その他のサイトをクロールするのに便利です。

リクエストを作成するためのショートカット¶

Requestオブジェクトを作成するためのショートカットとして、 response.follow を使用することができます。

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

scrapy.Requestとは異なり、 response.follow は相対URLをサポートしているため、urljoinを呼び出す必要はありません。 response.follow は単にRequestインスタンスを返すことに注意してください。Requestをyieldする必要があります。

response.follow に文字列の代わりにセレクタを渡すこともできます。このセレクタは必要な属性を抽出する必要があります。

for href in response.css('li.next a::attr(href)'):
    yield response.follow(href, callback=self.parse)

<a> 要素にはショートカットがあり、 response.follow はそのhref属性を自動的に使います。これによりコードをさらに短縮することができます。

for a in response.css('li.next a'):
    yield response.follow(a, callback=self.parse)

注釈

response.follow(response.css('li.next a')) は有効ではありません。 response.css は、セレクタのすべての結果を持つリストのようなオブジェクトを返すためです。上の例のように for ループ、または response.follow(response.css('li.next a')[0]) ならば問題ありません。

より多くの例とパターン¶

コールバックとリンクをたどる説明のために、別のSpiderを示します。今回は、著者の情報を集めるためのものです。

import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        # follow pagination links
        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default='').strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

このSpiderはメインページから始まり、 parse_author コールバックによってすべての著者ページへのリンクと、前と同様に parse コールバックによってページネーションリンクをたどります。

ここではコードをより短くするために、 response.follow に位置引数としてコールバックを渡しています。この方法でも scrapy.Request は動作します。

parse_author コールバックは、CSSクエリからデータを抽出してクリーンアップするヘルパー関数を定義し、著者データをPythonのdictにしてyieldします。

このSpiderが示すもう1つの面白い点として、同じ著者の引用がたくさんあっても、同じ著者ページを複数回訪問する心配がありません。デフォルトでScrapyは重複したリクエストをすでに訪問したURLとしてフィルタリングします。これによりプログラミングミスのためにサーバーに過度の負荷がかかるという問題を回避します。この動作は、 DUPEFILTER_CLASS の設定で変更できます。

これで、あなたがScrapyでリンクをたどることとコールバックの仕組みをよく理解できるように願っています。

リンクをたどるメカニズムを活用したもう1つの例として、小さなルールエンジンを実装した汎用的なSpiderの CrawlSpider クラスをチェックしてみてください。

また、複数のページのデータを持つアイテムを作成する一般的なパターンとして、追加のデータをコールバックに渡すためのトリックを参照してください。

引数の使用¶

Spiderにコマンドライン引数を渡すには、 -a オプションを使用します。

scrapy crawl quotes -o quotes-humor.json -a tag=humor

これらの引数はSpiderの __init__ メソッドに渡され、デフォルトでSpiderのインスタンス変数になります。

以下の例では、 tag 引数に指定された値が self.tag を介して利用可能になります。これによりSpiderが引数に基づいてURLを構築し、特定のタグで引用を絞り込むことができます。

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

tag=humor 引数をこのSpiderに渡すと、 humor タグのURL (http://quotes.toscrape.com/tag/humor) のみを訪問することに気づくでしょう。

スパイダー引数の扱いについては、こちらをご覧ください。

次のステップ¶

このチュートリアルでは、Scrapyの基本についてのみ説明しましたが、ここでは触れられていない多くの機能があります。重要なものの概要については、 Scrapyの概要のそれ以外には？セクションをチェックしてください。

Scrapyの基本コンセプトセクションからコマンドラインツール、Spider、セレクタ、および抽出されたデータのモデリングのような、チュートリアルでは扱っていない事柄についてさらに詳しく知ることができます。サンプルプロジェクトを試してみたい場合は、実例セクションをチェックしてみてください。