在 zip() 中显示 response.request.url

Scrapy display response.request.url inside zip()

我正在尝试创建一个简单的 Scrapy 函数,它将循环遍历一组标准 URLs 并提取他们的 Alexa 排名。我想要的输出只有两列:一列显示被抓取的 Alexa 排名,一列显示被抓取的 URL。

除了我无法在我的输出中正确显示已抓取的 URL 之外,一切似乎都正常。我目前的代码是:

import scrapy

class AlexarSpider(scrapy.Spider):
    name = 'AlexaR'
    #Will update allowed domains and start URL once I fix this problem
    start_urls = ['http://www.alexa.com/siteinfo/google.com/', 
    'https://www.alexa.com/siteinfo/reddit.com']

    def parse(self, response):
        rank = response.css(".rankmini-rank::text").extract()
        url_raw = response.request.url
    
        #extract content into rows
        for item in zip(url_raw,rank):
            scraped_info = {
                str('url_raw') : item[0],
                'rank' : item[1]
            }

        yield scraped_info

然后当 运行 时,代码输出 table 显示:

AlexaRank Output

url_raw rank
h
t 21
t
h
t 1
t

这些是正确的抓取排名(21 和 1),但 url_raw 字段显示“h”或“t”,而不是实际的 URL 字符串值。我试过将 url_raw 变量转换为字符串,但没有成功。

如何设置变量以使其显示正确 URL?

提前感谢您的帮助!

此处 zip() 采用列表 'rank' 和字符串 'url_raw',因此每次迭代都会从 'url_raw' 中获取一个字符。

循环解决方案:

import scrapy
from itertools import cycle


class AlexarSpider(scrapy.Spider):
    name = 'AlexaR'
    #Will update allowed domains and start URL once I fix this problem
    start_urls = ['http://www.alexa.com/siteinfo/google.com/',
                  'https://www.alexa.com/siteinfo/reddit.com']

    def parse(self, response):
        rank = response.css(".rankmini-rank::text").extract()
        url_raw = response.request.url
        #extract content into rows
        for item in zip(cycle([url_raw]), rank):
            scraped_info = {
                str('url_raw'): item[0],
                'rank': item[1]
            }
            yield scraped_info

列表的解决方案:

import scrapy


class AlexarSpider(scrapy.Spider):
    name = 'AlexaR'
    #Will update allowed domains and start URL once I fix this problem
    start_urls = ['http://www.alexa.com/siteinfo/google.com/',
                  'https://www.alexa.com/siteinfo/reddit.com']

    def parse(self, response):
        rank = response.css(".rankmini-rank::text").extract()
        url_raw = [response.request.url for i in range(len(rank))]
        #extract content into rows
        for item in zip(url_raw, rank):
            scraped_info = {
                str('url_raw'): item[0],
                'rank': item[1]
            }
            yield scraped_info