使用scrapy从多个页面中收集一个项目的信息(并输出为嵌套字典)
Use scrapy to collect information for one item from multiple pages (and output it as a nested dictionary)
我正在尝试从锦标赛网站抓取数据。
每场比赛都有一些信息,例如场地、日期、价格等。
还有参加的球队的排名。排名是一个 table,仅提供团队的名称及其在排名中的位置。
然后,您可以单击球队的名称,这会将您带到一个页面,我们可以获得该球队为该锦标赛选择的球员名单。
我想将数据抓取成如下形式:
[{
"name": "Grand Tournament",
"venue": "...",
"date": "...",
"rank": [
{"team_name": "Team name",
"rank": 1,
"roster": ["player1", "player2", "..."]
},
{"team_name": "Team name",
"rank": 2,
"roster": ["player1", "player2", "..."]
}
]
}]
我有以下爬虫来抓取单个锦标赛页面(用法:scrapy crawl tournamentspider -a strat_url="<tournamenturl>"
)
class TournamentSpider(scrapy.Spider):
name = "tournamentspider"
allowed_domains = ["..."]
def start_requests(self):
try:
yield scrapy.Request(url=self.start_url, callback=self.parse)
except AttributeError:
raise ValueError("You must use this spider with argument start_url.")
def parse(self, response):
tournament_item = TournamentItem()
tournament_item['teams'] = []
tournament_item ['name'] = "Tournament Name"
tournament_item['date'] = "Date"
tournament_item['venue'] = "Venue"
ladder = response.css('#ladder')
for row in ladder.css('table tbody tr'):
row_cells = row.xpath('td')
participation_item = PlayerParticipationItem()
participation_item['team_name'] = "Team Name"
participation_item['rank'] = "x"
# Parse roster
roster_url_page = row_cells[2].xpath('a/@href').get()
# Follow link to extract list
base_url = urlparse(response.url)
absolute_url = f'{base_url.scheme}://{base_url.hostname}/{list_url_page}'
request = scrapy.Request(absolute_url, callback=self.parse_roster_page)
request.meta['participation_item'] = participation_item
yield request
# Include participation item in the roster
tournament_item['players'].append(participation_item)
yield tournament_item
def parse_roster_page(self, response):
participation_item = response.meta['participation_item']
participation_item['roster'] = ["Player1", "Player2", "..."]
return participation_item
我的问题是这个蜘蛛产生以下输出:
[{
"name": "Grand Tournament",
"venue": "...",
"date": "...",
"rank": [
{"team_name": "Team name",
"rank": 1,
},
{"team_name": "Team name",
"rank": 2,
}
]
},
{"team_name": "Team name",
"rank": 1,
"roster": ["player1", "player2", "..."]
},
{"team_name": "Team name",
"rank": 2,
"roster": ["player1", "player2", "..."]
}]
我知道输出中的那些额外项是由 yield request
行生成的。当我删除它时,我不再抓取花名册页面,所以多余的项目消失了,但我不再有花名册数据。
是否可以得到我想要的输出?
我知道另一种方法可能是抓取锦标赛信息,然后使用标识锦标赛的字段进行分组。但我想知道最初的方法是否可行。
你可以使用scrapy inline requests来调用parse_roster_page
,你将得到花名册数据而不会产生它。
您需要包含的唯一更改是具有函数 parse_roster_page
的装饰器 @inline_requests
。
from inline_requests import inline_requests
class TournamentSpider(scrapy.Spider):
def parse(self, response):
...
@inline_requests
def parse_roster_page(self, response):
...
我正在尝试从锦标赛网站抓取数据。
每场比赛都有一些信息,例如场地、日期、价格等。 还有参加的球队的排名。排名是一个 table,仅提供团队的名称及其在排名中的位置。
然后,您可以单击球队的名称,这会将您带到一个页面,我们可以获得该球队为该锦标赛选择的球员名单。
我想将数据抓取成如下形式:
[{
"name": "Grand Tournament",
"venue": "...",
"date": "...",
"rank": [
{"team_name": "Team name",
"rank": 1,
"roster": ["player1", "player2", "..."]
},
{"team_name": "Team name",
"rank": 2,
"roster": ["player1", "player2", "..."]
}
]
}]
我有以下爬虫来抓取单个锦标赛页面(用法:scrapy crawl tournamentspider -a strat_url="<tournamenturl>"
)
class TournamentSpider(scrapy.Spider):
name = "tournamentspider"
allowed_domains = ["..."]
def start_requests(self):
try:
yield scrapy.Request(url=self.start_url, callback=self.parse)
except AttributeError:
raise ValueError("You must use this spider with argument start_url.")
def parse(self, response):
tournament_item = TournamentItem()
tournament_item['teams'] = []
tournament_item ['name'] = "Tournament Name"
tournament_item['date'] = "Date"
tournament_item['venue'] = "Venue"
ladder = response.css('#ladder')
for row in ladder.css('table tbody tr'):
row_cells = row.xpath('td')
participation_item = PlayerParticipationItem()
participation_item['team_name'] = "Team Name"
participation_item['rank'] = "x"
# Parse roster
roster_url_page = row_cells[2].xpath('a/@href').get()
# Follow link to extract list
base_url = urlparse(response.url)
absolute_url = f'{base_url.scheme}://{base_url.hostname}/{list_url_page}'
request = scrapy.Request(absolute_url, callback=self.parse_roster_page)
request.meta['participation_item'] = participation_item
yield request
# Include participation item in the roster
tournament_item['players'].append(participation_item)
yield tournament_item
def parse_roster_page(self, response):
participation_item = response.meta['participation_item']
participation_item['roster'] = ["Player1", "Player2", "..."]
return participation_item
我的问题是这个蜘蛛产生以下输出:
[{
"name": "Grand Tournament",
"venue": "...",
"date": "...",
"rank": [
{"team_name": "Team name",
"rank": 1,
},
{"team_name": "Team name",
"rank": 2,
}
]
},
{"team_name": "Team name",
"rank": 1,
"roster": ["player1", "player2", "..."]
},
{"team_name": "Team name",
"rank": 2,
"roster": ["player1", "player2", "..."]
}]
我知道输出中的那些额外项是由 yield request
行生成的。当我删除它时,我不再抓取花名册页面,所以多余的项目消失了,但我不再有花名册数据。
是否可以得到我想要的输出?
我知道另一种方法可能是抓取锦标赛信息,然后使用标识锦标赛的字段进行分组。但我想知道最初的方法是否可行。
你可以使用scrapy inline requests来调用parse_roster_page
,你将得到花名册数据而不会产生它。
您需要包含的唯一更改是具有函数 parse_roster_page
的装饰器 @inline_requests
。
from inline_requests import inline_requests
class TournamentSpider(scrapy.Spider):
def parse(self, response):
...
@inline_requests
def parse_roster_page(self, response):
...