在 zip() 中显示 response.request.url
Scrapy display response.request.url inside zip()
我正在尝试创建一个简单的 Scrapy 函数,它将循环遍历一组标准 URLs 并提取他们的 Alexa 排名。我想要的输出只有两列:一列显示被抓取的 Alexa 排名,一列显示被抓取的 URL。
除了我无法在我的输出中正确显示已抓取的 URL 之外,一切似乎都正常。我目前的代码是:
import scrapy
class AlexarSpider(scrapy.Spider):
name = 'AlexaR'
#Will update allowed domains and start URL once I fix this problem
start_urls = ['http://www.alexa.com/siteinfo/google.com/',
'https://www.alexa.com/siteinfo/reddit.com']
def parse(self, response):
rank = response.css(".rankmini-rank::text").extract()
url_raw = response.request.url
#extract content into rows
for item in zip(url_raw,rank):
scraped_info = {
str('url_raw') : item[0],
'rank' : item[1]
}
yield scraped_info
然后当 运行 时,代码输出 table 显示:
AlexaRank Output
url_raw
rank
h
t
21
t
h
t
1
t
这些是正确的抓取排名(21 和 1),但 url_raw 字段显示“h”或“t”,而不是实际的 URL 字符串值。我试过将 url_raw 变量转换为字符串,但没有成功。
如何设置变量以使其显示正确 URL?
提前感谢您的帮助!
此处 zip()
采用列表 'rank' 和字符串 'url_raw',因此每次迭代都会从 'url_raw' 中获取一个字符。
循环解决方案:
import scrapy
from itertools import cycle
class AlexarSpider(scrapy.Spider):
name = 'AlexaR'
#Will update allowed domains and start URL once I fix this problem
start_urls = ['http://www.alexa.com/siteinfo/google.com/',
'https://www.alexa.com/siteinfo/reddit.com']
def parse(self, response):
rank = response.css(".rankmini-rank::text").extract()
url_raw = response.request.url
#extract content into rows
for item in zip(cycle([url_raw]), rank):
scraped_info = {
str('url_raw'): item[0],
'rank': item[1]
}
yield scraped_info
列表的解决方案:
import scrapy
class AlexarSpider(scrapy.Spider):
name = 'AlexaR'
#Will update allowed domains and start URL once I fix this problem
start_urls = ['http://www.alexa.com/siteinfo/google.com/',
'https://www.alexa.com/siteinfo/reddit.com']
def parse(self, response):
rank = response.css(".rankmini-rank::text").extract()
url_raw = [response.request.url for i in range(len(rank))]
#extract content into rows
for item in zip(url_raw, rank):
scraped_info = {
str('url_raw'): item[0],
'rank': item[1]
}
yield scraped_info
我正在尝试创建一个简单的 Scrapy 函数,它将循环遍历一组标准 URLs 并提取他们的 Alexa 排名。我想要的输出只有两列:一列显示被抓取的 Alexa 排名,一列显示被抓取的 URL。
除了我无法在我的输出中正确显示已抓取的 URL 之外,一切似乎都正常。我目前的代码是:
import scrapy
class AlexarSpider(scrapy.Spider):
name = 'AlexaR'
#Will update allowed domains and start URL once I fix this problem
start_urls = ['http://www.alexa.com/siteinfo/google.com/',
'https://www.alexa.com/siteinfo/reddit.com']
def parse(self, response):
rank = response.css(".rankmini-rank::text").extract()
url_raw = response.request.url
#extract content into rows
for item in zip(url_raw,rank):
scraped_info = {
str('url_raw') : item[0],
'rank' : item[1]
}
yield scraped_info
然后当 运行 时,代码输出 table 显示:
AlexaRank Output
url_raw | rank |
---|---|
h | |
t | 21 |
t | |
h | |
t | 1 |
t |
这些是正确的抓取排名(21 和 1),但 url_raw 字段显示“h”或“t”,而不是实际的 URL 字符串值。我试过将 url_raw 变量转换为字符串,但没有成功。
如何设置变量以使其显示正确 URL?
提前感谢您的帮助!
此处 zip()
采用列表 'rank' 和字符串 'url_raw',因此每次迭代都会从 'url_raw' 中获取一个字符。
循环解决方案:
import scrapy
from itertools import cycle
class AlexarSpider(scrapy.Spider):
name = 'AlexaR'
#Will update allowed domains and start URL once I fix this problem
start_urls = ['http://www.alexa.com/siteinfo/google.com/',
'https://www.alexa.com/siteinfo/reddit.com']
def parse(self, response):
rank = response.css(".rankmini-rank::text").extract()
url_raw = response.request.url
#extract content into rows
for item in zip(cycle([url_raw]), rank):
scraped_info = {
str('url_raw'): item[0],
'rank': item[1]
}
yield scraped_info
列表的解决方案:
import scrapy
class AlexarSpider(scrapy.Spider):
name = 'AlexaR'
#Will update allowed domains and start URL once I fix this problem
start_urls = ['http://www.alexa.com/siteinfo/google.com/',
'https://www.alexa.com/siteinfo/reddit.com']
def parse(self, response):
rank = response.css(".rankmini-rank::text").extract()
url_raw = [response.request.url for i in range(len(rank))]
#extract content into rows
for item in zip(url_raw, rank):
scraped_info = {
str('url_raw'): item[0],
'rank': item[1]
}
yield scraped_info