在实际填充之前返回用 Scrapy 填充的列表

Question

这涉及到与我今天早上刚刚提出的不同问题几乎相同的代码，所以如果它看起来很眼熟，那是因为它很眼熟。

class LbcSubtopicSpider(scrapy.Spider):

...irrelevant/sensitive code...

    rawTranscripts = []
    rawTranslations = []

    def parse(self, response):
        rawTitles = []
        rawVideos = []
        for sel in response.xpath('//ul[1]'): #only scrape the first list

            ...irrelevant code...

            index = 0
            for sub in sel.xpath('li/ul/li/a'): #scrape the sublist items
                index += 1
                if index%2!=0: #odd numbered entries are the transcripts
                    transcriptLink = sub.xpath('@href').extract()
                    #url = response.urljoin(transcriptLink[0])
                    #yield scrapy.Request(url, callback=self.parse_transcript)
                else: #even numbered entries are the translations
                    translationLink = sub.xpath('@href').extract()
                    url = response.urljoin(translationLink[0])
                    yield scrapy.Request(url, callback=self.parse_translation)

        print rawTitles
        print rawVideos
        print "translations:" 
        print self.rawTranslations

    def parse_translation(self, response):
        for sel in response.xpath('//p[not(@class)]'):
            rawTranslation = sel.xpath('text()').extract()
            rawTranslation = ''.join(rawTranslation)
            #print rawTranslation
            self.rawTranslations.append(rawTranslation)
            #print self.rawTranslations

我的问题是 parse(...) 方法中的 "print self.rawTranslations" 只打印 "[]"。这可能意味着两件事之一：它可能在打印之前重置列表，或者它可能在调用 parse_translation(...) 之前打印，该调用从链接 parse(...) 填充列表完成。我倾向于怀疑是后者，因为我看不到任何会重置列表的代码，除非 class 正文中的 "rawTranslations = []" 是运行多次。

值得注意的是，如果我在 parse_translation(...) 中取消对同一行的注释，它将打印所需的输出，这意味着它正在正确提取文本并且问题似乎是主要 parse(...) 独有的方法。

我尝试解决我认为是同步问题的尝试非常漫无目的——我只是根据我能找到的尽可能多的 Google 教程尝试使用 RLock 对象，但我 99% 确定我误用了无论如何，结果是一样的。

Answer 1

这里的问题是您没有理解 scrapy 的真正工作原理。

Scrapy是一个爬虫框架，用于创建网站蜘蛛，不仅仅是做请求，那是requests的模块。

Scrapy 的请求异步工作，当您调用 yield Request(...) 时，您正在将请求添加到将在某个时刻执行的请求堆栈（您无法控制它）。这意味着您不能期望 yield Request(...) 之后的某些代码会在那一刻执行。事实上，您的方法应该始终以 Request 或 Item.

结尾

现在，据我所知以及大多数与 scrapy 混淆的情况，您想继续填充您通过某种方法创建的项目，但您需要的信息来自不同的请求。

在那种情况下，通信通常是通过Request的meta参数完成的，像这样：

    ...
    yield Request(url, callback=self.second_method, meta={'item': myitem, 'moreinfo': 'moreinfo', 'foo': 'bar'})

def second_method(self, response):
    previous_meta_info = response.meta
    # I can access the previous item with `response.meta['item']`
    ...

Answer 2

所以这看起来有点像一个 hacky 解决方案，特别是因为我刚刚了解了 Scrapy 的请求优先级功能，但这是我的新代码，可以提供所需的结果：

class LbcVideosSpider(scrapy.Spider):

    ...code omitted...

    done = 0 #variable to keep track of subtopic iterations
    rawTranscripts = []
    rawTranslations = []

    def parse(self, response):
        #initialize containers for each field
        rawTitles = []
        rawVideos = []

        ...code omitted...

            index = 0
            query = sel.xpath('li/ul/li/a')
            for sub in query: #scrape the sublist items
                index += 1
                if index%2!=0: #odd numbered entries are the transcripts
                    transcriptLink = sub.xpath('@href').extract()
                    #url = response.urljoin(transcriptLink[0])
                    #yield scrapy.Request(url, callback=self.parse_transcript)
                else: #even numbered entries are the translations
                    translationLink = sub.xpath('@href').extract()
                    url = response.urljoin(translationLink[0])
                    yield scrapy.Request(url, callback=self.parse_translation, \
                        meta={'index': index/2, 'maxIndex': len(query)/2})

        print rawTitles
        print rawVideos

    def parse_translation(self, response):
        #grab meta variables
        i = response.meta['index']
        maxIndex = response.meta['maxIndex']

        #interested in p nodes without class
        query = response.xpath('//p[not(@class)]')
        for sel in query:
            rawTranslation = sel.xpath('text()').extract()
            rawTranslation = ''.join(rawTranslation) #collapse each line
            self.rawTranslations.append(rawTranslation)

            #increment number of translations done, check if finished
            self.done += 1
            print self.done
            if self.done==maxIndex:
                print self.rawTranslations

基本上，我只是跟踪完成了多少请求，并以最终请求为条件编写一些代码。这将打印完全填充的列表。

在实际填充之前返回用 Scrapy 填充的列表

List populated with Scrapy is returned before actually filled

synchronization

web-crawler

scrapy

python-2.7

scrapy-spider