使用scrapy爬bbs时Twist失败

Twist failure when using scrapy to crawl a bbs

我是python scrapy的新手,写了一个简单的脚本来抓取我学校bbs的帖子。但是,当我的蜘蛛运行时,它会收到如下错误消息:

015-03-28 11:16:52+0800 [nju_spider] DEBUG: Retrying http://bbs.nju.edu.cn/bbstcon?board=WarAndPeace&file=M.1427299332.A> (failed 2 times): [>] 2015-03-28 11:16:52+0800 [nju_spider] DEBUG: Gave up retrying http://bbs.nju.edu.cn/bbstcon?board=WarAndPeace&file=M.1427281812.A> (failed 3 times): [>] 2015-03-28 11:16:52+0800 [nju_spider] ERROR: Error downloading http://bbs.nju.edu.cn/bbstcon?board=WarAndPeace&file=M.1427281812.A>: [>]

2015-03-28 11:16:56+0800 [nju_spider] INFO: Dumping Scrapy stats: {'downloader/exception_count': 99, 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 99, 'downloader/request_bytes': 36236, 'downloader/request_count': 113, 'downloader/request_method_count/GET': 113, 'downloader/response_bytes': 31135, 'downloader/response_count': 14, 'downloader/response_status_count/200': 14, 'dupefilter/filtered': 25, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 3, 28, 3, 16, 56, 677065), 'item_scraped_count': 11, 'log_count/DEBUG': 127, 'log_count/ERROR': 32, 'log_count/INFO': 8, 'request_depth_max': 3, 'response_received_count': 14, 'scheduler/dequeued': 113, 'scheduler/dequeued/memory': 113, 'scheduler/enqueued': 113, 'scheduler/enqueued/memory': 113, 'start_time': datetime.datetime(2015, 3, 28, 3, 16, 41, 874807)} 2015-03-28 11:16:56+0800 [nju_spider] INFO: Spider closed (finished)

似乎蜘蛛尝试了 url 但失败了,但是这个 url 确实存在。而且bbs上大概有几千个帖子,但是每次我运行我的蜘蛛,它只能得到一个运行dom其中的几个。我的代码如下,非常感谢您的帮助

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

from ScrapyTest.items import NjuPostItem


class NjuSpider(CrawlSpider):
    name = 'nju_spider'
    allowed_domains = ['bbs.nju.edu.cn']
    start_urls = ['http://bbs.nju.edu.cn/bbstdoc?board=WarAndPeace']
    rules = [Rule(LinkExtractor(allow=['bbstcon\?board=WarAndPeace&file=M\.\d+\.A']),
              callback='parse_post'),
             Rule(LinkExtractor(allow=['bbstdoc\?board=WarAndPeace&start=\d+']),
              follow=True)]

    def parse_post(self, response):
        # self.log('A response from %s just arrived!' % response.url)
        post = NjuPostItem()
        post['url'] = response.url
        post['title'] = 'to_do'
        post['content'] = 'to_do'
        return post

首先,请确保您采用 web-scraping 方法没有违反 web-site 的使用条款。 Be a good web-scraping citizen.

接下来可以设置User-Agentheader伪装成浏览器。在 DEFAULT_REQUEST_HEADERS 设置中提供 User-Agent

DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 Safari/537.36'
}

或者,您可以使用中间件轮换用户代理。这是我基于 fake-useragent 包实现的:


另一个可能的问题是您经常点击 web-site,请考虑调整 DOWNLOAD_DELAY setting:

The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard.

还有一个可以产生积极影响的相关设置:CONCURRENT_REQUESTS:

The maximum number of concurrent (ie. simultaneous) requests that will be performed by the Scrapy downloader.