使用 Scrapy 递归地抓取 Pinboard - “Spider must return Request”错误

Scrape Pinboard recursively with Scrapy - “Spider must return Request” error

为了磨练我的 python 和 Spark GraphX 技能,我一直在尝试构建 Pinboard 用户和书签的图表。为此,我以下列方式递归地抓取 Pinboard 书签:

  1. 从用户开始并抓取所有书签
  2. 对于每个由 url_slug 标识的书签,找到所有也保存了相同书签的用户。
  3. 对于第 2 步中的每个用户,重复该过程(转到 1,...)

尽管尝试了此处多个线程的建议(包括使用规则),但当我尝试实现此逻辑时,出现以下错误:

ERROR: Spider must return Request, BaseItem, dict or None, got 'generator'

我强烈怀疑这与我代码中 yield/return 的混合有关。

这里是我的代码的简要说明:

我的主要解析方法找到一个用户的所有书签项目(也跟随任何具有同一用户书签的先前页面)并产生 parse_bookmark 方法来抓取这些书签。

class PinSpider(scrapy.Spider):
    name = 'pinboard'

    # Before = datetime after 1970-01-01 in seconds, used to separate the bookmark pages of a user
    def __init__(self, user='notiv', before='3000000000', *args, **kwargs):
        super(PinSpider, self).__init__(*args, **kwargs)
        self.start_urls = ['https://pinboard.in/u:%s/before:%s' % (user, before)]
        self.before = before

    def parse(self, response):
        # fetches json representation of bookmarks instead of using css or xpath
        bookmarks = re.findall('bmarks\[\d+\] = (\{.*?\});', response.body.decode('utf-8'), re.DOTALL | re.MULTILINE)

        for b in bookmarks:
            bookmark = json.loads(b)
            yield self.parse_bookmark(bookmark)

        # Get bookmarks in previous pages
        previous_page = response.css('a#top_earlier::attr(href)').extract_first()
        if previous_page:
            previous_page = response.urljoin(previous_page)
            yield scrapy.Request(previous_page, callback=self.parse)

此方法抓取书签信息,包括相应的 url_slug,将其存储在 PinscrapyItem 中,然后生成 scrapy.Request 来解析 url_slug:

def parse_bookmark(self, bookmark):
    pin = PinscrapyItem()

    pin['url_slug'] = bookmark['url_slug']
    pin['title'] = bookmark['title']
    pin['author'] = bookmark['author']

    # IF I REMOVE THE FOLLOWING LINE THE PARSING OF ONE USER WORKS (STEP 1) BUT NO STEP 2 IS PERFORMED  
    yield scrapy.Request('https://pinboard.in/url:' + pin['url_slug'], callback=self.parse_url_slug)

    return pin

最后 parse_url_slug 方法找到保存此书签的其他用户并递归生成 scrape.Request 来解析每个用户。

def parse_url_slug(self, response):
    url_slug = UrlSlugItem()

    if response.body:
        soup = BeautifulSoup(response.body, 'html.parser')

        users = soup.find_all("div", class_="bookmark")
        user_list = [re.findall('/u:(.*)/t:', element.a['href'], re.DOTALL) for element in users]
        user_list_flat = sum(user_list, []) # Change from list of lists to list

        url_slug['user_list'] = user_list_flat

        for user in user_list:
            yield scrapy.Request('https://pinboard.in/u:%s/before:%s' % (user, self.before), callback=self.parse)

    return url_slug

(为了更简洁地呈现代码,我删除了存储其他有趣字段或检查重复项等的部分)

如有任何帮助,我们将不胜感激!

问题出在您的以下代码块

yield self.parse_bookmark(bookmark)

因为在你的 parse_bookmark 你有下面两行

# IF I REMOVE THE FOLLOWING LINE THE PARSING OF ONE USER WORKS (STEP 1) BUT NO STEP 2 IS PERFORMED  
yield scrapy.Request('https://pinboard.in/url:' + pin['url_slug'], callback=self.parse_url_slug)

return pin

因为你有 yield 这个函数的 return 值是一个生成器。你将生成器返回给 Scrapy,它不知道如何处理它。

修复很简单。将您的代码更改为以下

yield from self.parse_bookmark(bookmark)

这将从生成器而不是生成器本身一次生成一个值。或者你也可以这样做

for ret in self.parse_bookmark(bookmark):
    yield ret

编辑-1

更改函数以首先生成项目

yield pin
yield scrapy.Request('https://pinboard.in/url:' + pin['url_slug'], callback=self.parse_url_slug)

还有另外一个

    url_slug['user_list'] = user_list_flat
    yield url_slug
    for user in user_list:
        yield scrapy.Request('https://pinboard.in/u:%s/before:%s' % (user, self.before), callback=self.parse)

稍后让步将首先安排很多其他请求,当您开始看到已删除的项目时需要一些时间。我 运行 上面的代码进行了更改,它对我来说很好

2017-08-20 14:02:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pinboard.in/u:%5B'semanticdreamer'%5D/before:3000000000>
{'url_slug': 'e1ff3a9fb18873e494ec47d806349d90fec33c66', 'title': 'Flair Conky Offers Dark & Light Version For All Linux Distributions - NoobsLab | Ubuntu/Linux News, Reviews, Tutorials, Apps', 'author': 'semanticdreamer'}
2017-08-20 14:02:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pinboard.in/url:d9c16292ec9019fdc8411e02fe4f3d6046185c58>
{'user_list': ['ronert', 'notiv']}