Scrapy 分步收集数据
Scrapy collecting data in various steps
我正在尝试从足球网站抓取数据,但我遇到了一些 difficulties.I有两种类型的一系列链接:
- 1)网站伪造。com/player/p1234
- 2)websitefake.com/player/p1234/statistics
所以机器人应该:登录,并开始抓取每个链接。
这是我的尝试:
class fanta(CrawlSpider):
from website.players_id import players_id #list with all players id like p40239
name = 'bot2'
login_page = "https://loginpage.com/"
#### HELPER ####
prova = "https://websitefake.com/player/p40239"
#This is the part that generates the 600 player profiles
start_urls = [prova.replace("p40239", i) for i in players_id]
def start_requests(self): #LOGIN
return [FormRequest(
self.login_page,
formdata = {'name':'aaa', 'pass':'aaa'},
callback = self.logged_in)]
def logged_in(self, response):
if "Attenzione" in response.body: #Login check
self.log("Could not log in")
else:
self.log("Logged In") #If logged in start scraping
for url in self.start_urls:
yield Request(url, callback=self.parse)
#Scrape the data from the https://websitefake.com/player/p1234 page
def parse(self, response):
name = response.css("response name::text").extract()
surname =response.css("response surname::text").extract()
team_name =response.css("response team_name::text").extract()
role = response.css("response role_name::text").extract()
#Add /statistics after the "p1234" creating the url for parse_statistics
p = re.findall("p\d+", response.url)
new_string = p[0] + "/statistics"
url_replaced = re.sub("p\d+", new_string, response.url)
#Creating the Request for https://websitefake.com/player/p1234/statistics within the items to pass through the parse_stats
r= Request(url_replaced, callback=self.parse_stats, encoding="utf-8")
r.meta['name'] = name
r.meta['surname'] = surname
r.meta['team_name'] = team_name
r.meta['role'] = role
yield r
def parse_stats(self, response):
player = Player()
stats = response.xpath("/response/Stat").extract() #N. of stat tags
for s in range(1,len(stats)+1):
time = response.xpath("/response/Stat[{}]/timestamp/text()".format(s)).extract()
player['name'] = response.meta['name']
player['surname'] = response.meta['surname']
player['team_name'] = response.meta['team_name']
player['role'] = response.meta['role']
#### DATA FROM THE STATISTICS PAGE ####
yield player
这里的问题是,当我 运行 蜘蛛时,它一直在使用解析方法("player's page" 并且不遵循回调 parse_stats)进行抓取,所以我得到了什么是:
- -200 抓取虚假网站。com/player/p1234
- -200 抓取虚假网站。com/player/p1111
- -200 抓取虚假网站。com/player/p2222
而不是这个:
- -200 抓取虚假网站。com/player/p1234
- -200 抓取网站fake.com/player/p1234/statistics
- -200 抓取虚假网站。com/player/p1111
- -200 抓取网站fake.com/player/p1111/statistics
我已经尝试了所有想到的方法,可能是我弄错了产量,我不知道:S
感谢所有未来的答案!
您不能同时使用 CrawlSpider
和 parse
。因为你不使用任何 rules
你可能想使用正常的 Spider
.
请参阅 documentation
中的警告
我正在尝试从足球网站抓取数据,但我遇到了一些 difficulties.I有两种类型的一系列链接:
- 1)网站伪造。com/player/p1234
- 2)websitefake.com/player/p1234/statistics
所以机器人应该:登录,并开始抓取每个链接。 这是我的尝试:
class fanta(CrawlSpider):
from website.players_id import players_id #list with all players id like p40239
name = 'bot2'
login_page = "https://loginpage.com/"
#### HELPER ####
prova = "https://websitefake.com/player/p40239"
#This is the part that generates the 600 player profiles
start_urls = [prova.replace("p40239", i) for i in players_id]
def start_requests(self): #LOGIN
return [FormRequest(
self.login_page,
formdata = {'name':'aaa', 'pass':'aaa'},
callback = self.logged_in)]
def logged_in(self, response):
if "Attenzione" in response.body: #Login check
self.log("Could not log in")
else:
self.log("Logged In") #If logged in start scraping
for url in self.start_urls:
yield Request(url, callback=self.parse)
#Scrape the data from the https://websitefake.com/player/p1234 page
def parse(self, response):
name = response.css("response name::text").extract()
surname =response.css("response surname::text").extract()
team_name =response.css("response team_name::text").extract()
role = response.css("response role_name::text").extract()
#Add /statistics after the "p1234" creating the url for parse_statistics
p = re.findall("p\d+", response.url)
new_string = p[0] + "/statistics"
url_replaced = re.sub("p\d+", new_string, response.url)
#Creating the Request for https://websitefake.com/player/p1234/statistics within the items to pass through the parse_stats
r= Request(url_replaced, callback=self.parse_stats, encoding="utf-8")
r.meta['name'] = name
r.meta['surname'] = surname
r.meta['team_name'] = team_name
r.meta['role'] = role
yield r
def parse_stats(self, response):
player = Player()
stats = response.xpath("/response/Stat").extract() #N. of stat tags
for s in range(1,len(stats)+1):
time = response.xpath("/response/Stat[{}]/timestamp/text()".format(s)).extract()
player['name'] = response.meta['name']
player['surname'] = response.meta['surname']
player['team_name'] = response.meta['team_name']
player['role'] = response.meta['role']
#### DATA FROM THE STATISTICS PAGE ####
yield player
这里的问题是,当我 运行 蜘蛛时,它一直在使用解析方法("player's page" 并且不遵循回调 parse_stats)进行抓取,所以我得到了什么是:
- -200 抓取虚假网站。com/player/p1234
- -200 抓取虚假网站。com/player/p1111
- -200 抓取虚假网站。com/player/p2222
而不是这个:
- -200 抓取虚假网站。com/player/p1234
- -200 抓取网站fake.com/player/p1234/statistics
- -200 抓取虚假网站。com/player/p1111
- -200 抓取网站fake.com/player/p1111/statistics
我已经尝试了所有想到的方法,可能是我弄错了产量,我不知道:S 感谢所有未来的答案!
您不能同时使用 CrawlSpider
和 parse
。因为你不使用任何 rules
你可能想使用正常的 Spider
.
请参阅 documentation
中的警告