Scrapy (Python): 遍历 'next' 页面没有多重功能

Scrapy (Python): Iterating over 'next' page without multiple functions

我正在使用 Scrapy 从 Yahoo! 抓取股票数据金融。

有时,我需要遍历几页,在 this example 中有 19 页,以获取所有股票数据。

以前(当我知道只有两页时),我会为每一页使用一个函数,就像这样:

def stocks_page_1(self, response):

    returns_page1 = []

    #Grabs data here...

    current_page = response.url
    next_page = current_page + "&z=66&y=66"
    yield Request(next_page, self.stocks_page_2, meta={'returns_page1': returns_page1})

def stocks_page_2(self, response):

    # Grab data again...

现在,我想知道是否有一种方法可以使用一个函数循环遍历迭代,以获取给定股票可用的所有页面的所有数据,而不是编写 19 个或更多函数。

像这样:

        for x in range(30): # 30 was randomly selected
            current_page = response.url
            # Grabs Data
            # Check if there is a 'next' page:
            if response.xpath('//td[@align="right"]/a[@rel="next"]').extract() != ' ': 
                u = x * 66
                next_page = current_page + "&z=66&y={0}".format(u)
                # Go to the next page somehow within the function???

更新代码:

有效,但只有 returns 一页数据。

class DmozSpider(CrawlSpider):


name = "dnot"
allowed_domains = ["finance.yahoo.com", "http://eoddata.com/"]
start_urls = ['http://finance.yahoo.com/q?s=CAT']
rules = [
Rule(LinkExtractor(restrict_xpaths='//td[@align="right"]/a[@rel="next"]'),
     callback='stocks1',
     follow=True),
]
def stocks1(self, response):
        returns = []
        rows = response.xpath('//table[@class="yfnc_datamodoutline1"]//table/tr')[1:]
        for row in rows:
            cells = row.xpath('.//td/text()').extract()
            try:
                values = cells[-1]
                try:
                    float(values)
                    returns.append(values)
                except ValueError:
                    continue
            except ValueError:
                continue  

        unformatted_returns = response.meta.get('returns_pages')
        returns = [float(i) for i in returns]
        global required_amount_of_returns, counter
        if counter == 1 and "CAT" in response.url:
            required_amount_of_returns = len(returns)
        elif required_amount_of_returns == 0:
            raise CloseSpider("'Error with initiating required amount of returns'")

        counter += 1
        print counter

        # Iterator to calculate Rate of return 
        # ====================================
        if data_intervals == "m": 
            k = 12
        elif data_intervals == "w":
            k = 4
        else: 
            k = 30

        sub_returns_amount = required_amount_of_returns - k
        sub_returns = returns[:sub_returns_amount]
        rate_of_return = []

        if len(returns) == required_amount_of_returns or "CAT" in response.url:
            for number in sub_returns:
                numerator = number - returns[k]
                rate = numerator/returns[k]
                if rate == '': 
                    rate = 0
                rate_of_return.append(rate)
                k += 1

        item = Website()
        items = []
        item['url'] = response.url
        item['name'] = response.xpath('//div[@class="title"]/h2/text()').extract()
        item['avg_returns'] = numpy.average(rate_of_return)
        item['var_returns'] = numpy.cov(rate_of_return)
        item['sd_returns'] = numpy.std(rate_of_return)
        item['returns'] = returns
        item['rate_of_returns'] = rate_of_return
        item['exchange'] = response.xpath('//span[@class="rtq_exch"]/text()').extract()
        item['ind_sharpe'] = ((numpy.average(rate_of_return) - RFR) / numpy.std(rate_of_return))
        items.append(item)
        yield item

你看,解析回调只是一个接受响应和 returns 或产生 Items 或 Requests 或两者的函数。重用这些回调完全没有问题,因此您可以为每个请求传递相同的回调。

现在,您 可以 使用 Request 元传递当前页面信息,但我会利用 CrawlSpider 来抓取每个页面.真的很简单,开始用命令行生成Spider

scrapy genspider --template crawl finance finance.yahoo.com

然后这样写:

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

Scrapy 1.0 已弃用上述模块的 scrapy.contrib 命名空间,但如果你坚持使用 0.24,请使用 scrapy.contrib.linkextractorsscrapy.contrib.spiders.

from yfinance.items import YfinanceItem


class FinanceSpider(CrawlSpider):
    name = 'finance'
    allowed_domains = ['finance.yahoo.com']
    start_urls = ['http://finance.yahoo.com/q/hp?s=PWF.TO&a=04&b=19&c=2005&d=04&e=19&f=2010&g=d&z=66&y=132']

    rules = (
        Rule(LinkExtractor(restrict_css='[rel="next"]'),
             callback='parse_items',
             follow=True),
    )

LinkExtractor 将选择响应中的链接进行跟踪,但可以使用 XPath(或 CSS)和正则表达式对其进行限制。有关更多信息,请参阅 documentation

Rules 将跟随链接并在每次响应时调用 callbackfollow=True 将在每个新回复中继续提取链接,但可能会受到深度的限制。再看documentation

    def parse_items(self, response):
        for line in response.css('.yfnc_datamodoutline1 table tr')[1:-1]:
            yield YfinanceItem(date=line.css('td:first-child::text').extract()[0])

只需让出 Items,因为下一页的 Requests 将由 CrawlSpider Rules 处理。