Python3でクロールしようと思って調べたこと

概要

Python3でさらっとクロールする用事があったので、requests/beautifulsoup/reppyあたりを利用した処理について調べてみる。

requestsはPython3でHTTPリクエストする際の便利ツール。beautifulsoupはスクレイピング。reppyはrobots.txt周りを見てくれる機能。

@CretedDate 2016/08/13
@Versions python3.5, beautifulsoup4.5.1, reppy0.3.0, requests2.9.1

インストール

pip利用。

$ pip install requests
$ pip install beautifulsoup4
$ pip install reppy

requestsでリスクエスト

まずは適当なサイトにリクエストをして、コンテンツの内容を文字列で取得するあたり。

import requests

# うちのサイトにリクエスト
resp = requests.get('http://www.mwsoft.jp/')

# ステータスコード取得
resp.status_code
  #=> 200

# コンテンツの取得（byte配列）
resp.content
  #=> b'<!DOCTYPE html>\n<html>\n\n<head>\n  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n  <title>Top Page | mwSoft</title>\n ...

# コンテンツの取得（文字列）
resp.text
  #=> <!DOCTYPE html>\n<html>\n\n<head>\n  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n  <title>Top Page | mwSoft</title>\n  ...

うちのサイトにリクエストしてみたところ、文字コードの判別がうまくできておらず、browser.response.textを取得したところ化けらった。

# UTF-8なのにISO-8859-1が取れてる
resp.encoding
  #=> ISO-8859-1

これはうちのサイトがmetaタグ内で文字コードの表明をしていること、header内で文字コードを指定していないことが原因と思われる。

requestsの文字コード判別

requestsにはheaderとmetaタグの双方を見て文字コードを判別する機能が付けられてはいる。内部的にはrequests.utilsの各機能が呼ばれている。

# contentのmetaタグからcharsetを取ってもらう
requests.utils.get_encodings_from_content(resp.text)
  #=> utf-8

# headerから取るとISO-8859-1
requests.utils.get_encoding_from_headers(resp.headers)
  #=> 'ISO-8859-1'

上記はうちのサイトで実行した結果。headerには特にcharsetを設定していないのだが、textが設定されていると勝手にISO-8859-1が指定されてしまうらしい。

そこで下記のような記述で、headerでISO-8859-1が取れ、且つmetaタグからencodingが取れていた場合に限り、metaタグを優先するというコードを書いてみる。

content_encodings = requests.utils.get_encodings_from_content(resp.text)
header_encoding = requests.utils.get_encoding_from_headers(resp.headers)
if header_encoding == 'ISO-8859-1' and len(content_encodings) > 0 and content_encodings[0] != 'ISO-8859-1':
    resp.encoding = content_encodings[0]

# browser.response.encodingを設定後にtextを見ると、ちゃんと化けずに取得できる
resp.text

これで無事文字コードが取得できた、と思ったけどget_encodings_from_contentはDeprecatedでそのうちなくなるのだとか。

beautifulsoupはそのへんも見てくれるので、文字コードについてはrequestsでは処理せずにbyte配列でbeautifulsoupに渡した方が精度が良さそう。

chardetでの推測

コンテンツのbyte配列から文字コードを推測したい場合は、chardetが使える。

pip install chardet

実行してみる。

import chardet
chardet.detect(resp.content)
  #=> {'confidence': 0.99, 'encoding': 'utf-8'}

confidence（信頼度）0.99でutf-8だと推測された。

google.comで実行してみる。

resp = requests.get('http://www.google.com/')
chardet.detect(resp.content)
  #=> {'confidence': 0.99, 'encoding': 'SHIFT_JIS'}

あれ、SHIFT_JISと言われた。ASCII文字しか入ってない場合はSHIFT_JISになるのだろうか。

と思ったけど下記の例だとasciiと言われる。

chardet.detect("foobar".encode())
  #=> {'confidence': 1.0, 'encoding': 'ascii'}

www.google.comに我が家のIPからrequestsでリクエストするとSHIFT_JISで返るようだ。

コンピュータリソースに余裕があるなら、chardetを使った方がより安全。

大き過ぎるresponseは途中で切りたい場合

クロールしているとたまに数MBもあるようなページに出くわすことがある。そうした場合はStreamでbyte単位で値を収集し、指定サイズ以上になったら切り上げることで、大量データをfetchすることを避けられる。

下記は10byteずつ10回だけcontentの中身を取っている。こうすれば100byteだけ取れて残りは取得されないことになる。

chunk_size = 10
with requests.Session() as sess:
    resp = sess.get('http://www.mwsoft.jp/', stream=True)
    chunks = []
    for count, chunk in enumerate(resp.iter_content(chunk_size)):
        print(chunk)
        chunks.append(chunk)
        if count > 10:
            break
    content = b"".join(chunks)
print(content)
    #=> b'<!DOCTYPE html>\n<html>\n\n<head>\n  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n  <title>Top Page'

10byteずつというのは大げさなので、1024 * 10（requestsのCONTENT_CHUNK_SIZE）くらいを設定して100回くらい回せば1MBあたりで取得をやめる処理になる。

この処理だとラストあたりでbyteが切れるのが悩ましいところ。iter_linesを使えば改行のところで区切れるようになるのだけど、そうすると改行なしのデータの場合にまた困る。

下記は1行ずつ読み込んで100byteまで達したら読み込みをやめる記述。

byte_len = 0
with requests.Session() as sess:
    resp = sess.get('http://www.mwsoft.jp/', stream=True)
    chunks = []
    for chunk in resp.iter_lines():
        print(chunk)
        byte_len += len(chunk)
        chunks.append(chunk)
        if byte_len > 100:
            break
    content = b"".join(chunks)
print(content)

reppyでrobots.txtの確認

robobrowser単体ではrobots.txtは特に参照してないようなので（Issueには上がっていた）、reppyを使って判別する処理を噛ませる。

from reppy.cache import RobotsCache
robots = RobotsCache()

# robots.txt的に許可されているか確認
robots.allowed('http://www.mwsoft.jp/', 'python program')
  #=> True

# キャッシュされていることを確認
robots._cache
  #=> {'www.mwsoft.jp': }

うちのサイトはrobots.txtのテスト用にMyRoboXYZというエージェントは弾くようになっている。

robots.allowed('http://www.mwsoft.jp/', 'MyRoboXYZ')
  #=> False

リクエスト前に必ずreppyを噛ませるようにすれば、クロール禁止のサイトを

Python標準ライブラリでのrobots.txtの処理

Python3のurllibにはrobotparserモジュールがいる。下記のような記述でリクエストの可否を確認可能。

import urllib.robotparser
robot_parser = urllib.robotparser.RobotFileParser()

# 全リクエストを拒否するrobots.txtを読みこませる
robot_parser.set_url('http://www.mwsoft.jp/test/robots/disallow_all/robots.txt')
robot_parser.read()

# リクエストが許可されるか確認 → 拒否
robot_parser.can_fetch('*', 'http://www.mwsoft.jp/test/robots/disallow_all/robots.txt')
  #=> False

reppyと比べると、自前でrobots.txtのパスを指定しないといけなかったり、キャッシュ機能は自前で書かないといけなかったりするけど、そこまで面倒でもないので依存ライブラリを減らす目的で自前実装するのも十分に選択肢として考えられる。

metaタグのnofollow/noarchiveの対応

nofollow（リンクたどるな）/noarchive（アーカイブするな）も一応守っておきたい。

このあたりはBeautifulSoupを利用する。

import requests
resp = requests.get('http://www.mwsoft.jp/test/robots/meta_nofollow/no_follow.html')

from bs4 import BeautifulSoup
soup = BeautifulSoup(resp.content, 'html.parser')
nofolows = soup.find_all('meta', attrs={'content': 'nofollow'})
  #=> [<meta content="nofollow" name="robots"/>]  

len(nofolows) > 0 and nofolows[0]['name'].lower() == 'robots'
  #=> True

リンクを辿る

findAll('a', href=True)でhrefが設定されたanchorだけ辿れる。

リンクは相対パスが設定されていることもあるので、urljoinを設定することで絶対パスでのURLの取得も可能。

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = 'http://www.mwsoft.jp/'
resp = requests.get('http://www.mwsoft.jp/')

soup = BeautifulSoup(resp.content, 'html.parser')

for a in soup.findAll('a', href=True):
    print(a['href'])
    print(urljoin(url, a['href']))

これらを加味した上でのリクエスト処理

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from reppy.cache import RobotsCache

# ユーザエージェントを適当に決めておく
DEFAULT_USER_AGENT = 'requests'

# 取得するコンテンツの最大長を決めておく
MAX_CONTENT_LENGTH = 1000000 # 1MBくらい

# robots.txt用の例外
class RobotsException(requests.RequestException):
    ''' robots.txt not allowed. '''

# meta content=noarchive name=robotsが設定されてる際の例外
class MetaNoArchiveException(requests.RequestException):
    ''' meta noarchive robots. '''

def check_robots(url, user_agent=DEFAULT_USER_AGENT):
    ''' robots.txtの確認
    '''
    robots = RobotsCache()
    return robots.allowed(url, user_agent)
        
def check_noarchive(soup):
    ''' metaタグにnoarchiveが設定されてないか確認
    '''
    noarchive = soup.find_all('meta', attrs={'content': 'noarchive'})
    return len(noarchive) > 0 and noarchive[0]['name'].lower() == 'robots'

def get(url, max_content_len=MAX_CONTENT_LENGTH):
    ''' requestsで指定urlにリクエストする。
        取得サイズの上限機能付き。robots.txt確認付き。noarchive確認付き。
    '''
    # robots.txtのチェック
    if not check_robots(url):
        raise RobotsException(url)

    # 指定byteまでcontentを読み込み 
    byte_len = 0
    with requests.Session() as sess:
        resp = sess.get('http://www.mwsoft.jp/', stream=True)
        chunks = []
        for chunk in resp.iter_lines():
            byte_len += len(chunk)
            chunks.append(chunk)
            if byte_len > max_content_len:
                break
        content = b"".join(chunks)
 
    return resp, content

def store(url, content):
    # do something
    pass

resp, content = get('http://www.mwsoft.jp/')
soup = BeautifulSoup(content, 'html.parser')
if check_noarchive(soup):
    raise MetaNoArchiveException(url)