带有 scrapy 的多个嵌套请求
Multiple nested request with scrapy
我尝试在 www.flightradar24.com 网站上删除一些飞机时刻表信息用于研究项目。
我想要获取的 json 文件的层次结构是这样的:
Object ID
- country
- link
- name
- airports
- airport0
- code_total
- link
- lat
- lon
- name
- schedule
- ...
- ...
- airport1
- code_total
- link
- lat
- lon
- name
- schedule
- ...
- ...
Country
和 Airport
是使用项目存储的,正如您在 json 文件中看到的那样,最后 CountryItem
(link,名称属性)存储多个 AirportItem
(code_total, link, lat, lon, name, schedule) :
class CountryItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
airports = scrapy.Field()
other_url= scrapy.Field()
last_updated = scrapy.Field(serializer=str)
class AirportItem(scrapy.Item):
name = scrapy.Field()
code_little = scrapy.Field()
code_total = scrapy.Field()
lat = scrapy.Field()
lon = scrapy.Field()
link = scrapy.Field()
schedule = scrapy.Field()
这是我的 scrapy 代码 AirportsSpider
来做到这一点:
class AirportsSpider(scrapy.Spider):
name = "airports"
start_urls = ['https://www.flightradar24.com/data/airports']
allowed_domains = ['flightradar24.com']
def clean_html(self, html_text):
soup = BeautifulSoup(html_text, 'html.parser')
return soup.get_text()
rules = [
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LxmlLinkExtractor(allow=('data/airports/',)), callback='parse')
]
def parse(self, response):
count_country = 0
countries = []
for country in response.xpath('//a[@data-country]'):
if count_country > 5:
break
item = CountryItem()
url = country.xpath('./@href').extract()
name = country.xpath('./@title').extract()
item['link'] = url[0]
item['name'] = name[0]
count_country += 1
countries.append(item)
yield scrapy.Request(url[0],meta={'my_country_item':item}, callback=self.parse_airports)
def parse_airports(self,response):
item = response.meta['my_country_item']
airports = []
for airport in response.xpath('//a[@data-iata]'):
url = airport.xpath('./@href').extract()
iata = airport.xpath('./@data-iata').extract()
iatabis = airport.xpath('./small/text()').extract()
name = ''.join(airport.xpath('./text()').extract()).strip()
lat = airport.xpath("./@data-lat").extract()
lon = airport.xpath("./@data-lon").extract()
iAirport = AirportItem()
iAirport['name'] = self.clean_html(name)
iAirport['link'] = url[0]
iAirport['lat'] = lat[0]
iAirport['lon'] = lon[0]
iAirport['code_little'] = iata[0]
iAirport['code_total'] = iatabis[0]
airports.append(iAirport)
for airport in airports:
json_url = 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin\[\]=&plugin-setting\[schedule\]\[mode\]=&plugin-setting\[schedule\]\[timestamp\]={timestamp}&page=1&limit=50&token='.format(code=airport['code_little'], timestamp="1484150483")
yield scrapy.Request(json_url, meta={'airport_item': airport}, callback=self.parse_schedule)
item['airports'] = airports
yield {"country" : item}
def parse_schedule(self,response):
item = response.request.meta['airport_item']
jsonload = json.loads(response.body_as_unicode())
json_expression = jmespath.compile("result.response.airport.pluginData.schedule")
item['schedule'] = json_expression.search(jsonload)
解释:
在我的第一个解析中,我为每个国家/地区调用请求 link 我发现 CountryItem
通过 meta={'my_country_item':item}
创建。这些请求回调中的每一个 self.parse_airports
在我的第二级解析 parse_airports
中,我捕获了使用 item = response.meta['my_country_item']
创建的 CountryItem
并为每个机场创建了一个新项目 iAirport = AirportItem()
我发现进入这个国家页面。现在我想获取 schedule
创建并存储在 airports
列表中的每个 AirportItem
的信息。
在第二级解析 parse_airports
中,我 运行 在 airports
上进行 for 循环以使用新请求捕获 schedule
信息。因为我想将此时间表信息包含到我的 AirportItem 中,所以我将此项目包含到元信息中 meta={'airport_item': airport}
。本次请求的回调运行parse_schedule
在第三级解析parse_schedule
中,我将scrapy收集的日程信息注入到之前使用response.request.meta['airport_item']
[=68=创建的AirportItem中]
但我的源代码有问题,scrapy 正确地废弃了所有信息(国家、机场、时间表),但我对嵌套项目的理解似乎不正确。如您所见,我生成的 json 包含 country > list of (airport)
,但不包含 country > list of (airport > schedule )
我的代码在 github : https://github.com/IDEES-Rouen/Flight-Scrapping
问题是您 fork 了您的项目,根据您的逻辑,您每个国家/地区只需要 1 个项目,因此在解析国家/地区后您无法在任何时候产生多个项目。你要做的是将它们全部堆叠成一个项目。
为此,您需要创建一个解析循环:
def parse_airports(self, response):
item = response.meta['my_country_item']
item['airports'] = []
for airport in response.xpath('//a[@data-iata]'):
url = airport.xpath('./@href').extract()
iata = airport.xpath('./@data-iata').extract()
iatabis = airport.xpath('./small/text()').extract()
name = ''.join(airport.xpath('./text()').extract()).strip()
lat = airport.xpath("./@data-lat").extract()
lon = airport.xpath("./@data-lon").extract()
iAirport = dict()
iAirport['name'] = 'foobar'
iAirport['link'] = url[0]
iAirport['lat'] = lat[0]
iAirport['lon'] = lon[0]
iAirport['code_little'] = iata[0]
iAirport['code_total'] = iatabis[0]
item['airports'].append(iAirport)
urls = []
for airport in item['airports']:
json_url = 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin\[\]=&plugin-setting\[schedule\]\[mode\]=&plugin-setting\[schedule\]\[timestamp\]={timestamp}&page=1&limit=50&token='.format(
code=airport['code_little'], timestamp="1484150483")
urls.append(json_url)
if not urls:
return item
# start with first url
next_url = urls.pop()
return Request(next_url, self.parse_schedule,
meta={'airport_item': item, 'airport_urls': urls, 'i': 0})
def parse_schedule(self, response):
"""we want to loop this continuously for every schedule item"""
item = response.meta['airport_item']
i = response.meta['i']
urls = response.meta['airport_urls']
jsonload = json.loads(response.body_as_unicode())
item['airports'][i]['schedule'] = 'foobar'
# now do next schedule items
if not urls:
yield item
return
url = urls.pop()
yield Request(url, self.parse_schedule,
meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1})
我尝试在 www.flightradar24.com 网站上删除一些飞机时刻表信息用于研究项目。
我想要获取的 json 文件的层次结构是这样的:
Object ID
- country
- link
- name
- airports
- airport0
- code_total
- link
- lat
- lon
- name
- schedule
- ...
- ...
- airport1
- code_total
- link
- lat
- lon
- name
- schedule
- ...
- ...
Country
和 Airport
是使用项目存储的,正如您在 json 文件中看到的那样,最后 CountryItem
(link,名称属性)存储多个 AirportItem
(code_total, link, lat, lon, name, schedule) :
class CountryItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
airports = scrapy.Field()
other_url= scrapy.Field()
last_updated = scrapy.Field(serializer=str)
class AirportItem(scrapy.Item):
name = scrapy.Field()
code_little = scrapy.Field()
code_total = scrapy.Field()
lat = scrapy.Field()
lon = scrapy.Field()
link = scrapy.Field()
schedule = scrapy.Field()
这是我的 scrapy 代码 AirportsSpider
来做到这一点:
class AirportsSpider(scrapy.Spider):
name = "airports"
start_urls = ['https://www.flightradar24.com/data/airports']
allowed_domains = ['flightradar24.com']
def clean_html(self, html_text):
soup = BeautifulSoup(html_text, 'html.parser')
return soup.get_text()
rules = [
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LxmlLinkExtractor(allow=('data/airports/',)), callback='parse')
]
def parse(self, response):
count_country = 0
countries = []
for country in response.xpath('//a[@data-country]'):
if count_country > 5:
break
item = CountryItem()
url = country.xpath('./@href').extract()
name = country.xpath('./@title').extract()
item['link'] = url[0]
item['name'] = name[0]
count_country += 1
countries.append(item)
yield scrapy.Request(url[0],meta={'my_country_item':item}, callback=self.parse_airports)
def parse_airports(self,response):
item = response.meta['my_country_item']
airports = []
for airport in response.xpath('//a[@data-iata]'):
url = airport.xpath('./@href').extract()
iata = airport.xpath('./@data-iata').extract()
iatabis = airport.xpath('./small/text()').extract()
name = ''.join(airport.xpath('./text()').extract()).strip()
lat = airport.xpath("./@data-lat").extract()
lon = airport.xpath("./@data-lon").extract()
iAirport = AirportItem()
iAirport['name'] = self.clean_html(name)
iAirport['link'] = url[0]
iAirport['lat'] = lat[0]
iAirport['lon'] = lon[0]
iAirport['code_little'] = iata[0]
iAirport['code_total'] = iatabis[0]
airports.append(iAirport)
for airport in airports:
json_url = 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin\[\]=&plugin-setting\[schedule\]\[mode\]=&plugin-setting\[schedule\]\[timestamp\]={timestamp}&page=1&limit=50&token='.format(code=airport['code_little'], timestamp="1484150483")
yield scrapy.Request(json_url, meta={'airport_item': airport}, callback=self.parse_schedule)
item['airports'] = airports
yield {"country" : item}
def parse_schedule(self,response):
item = response.request.meta['airport_item']
jsonload = json.loads(response.body_as_unicode())
json_expression = jmespath.compile("result.response.airport.pluginData.schedule")
item['schedule'] = json_expression.search(jsonload)
解释:
在我的第一个解析中,我为每个国家/地区调用请求 link 我发现
CountryItem
通过meta={'my_country_item':item}
创建。这些请求回调中的每一个self.parse_airports
在我的第二级解析
parse_airports
中,我捕获了使用item = response.meta['my_country_item']
创建的CountryItem
并为每个机场创建了一个新项目iAirport = AirportItem()
我发现进入这个国家页面。现在我想获取schedule
创建并存储在airports
列表中的每个AirportItem
的信息。在第二级解析
parse_airports
中,我 运行 在airports
上进行 for 循环以使用新请求捕获schedule
信息。因为我想将此时间表信息包含到我的 AirportItem 中,所以我将此项目包含到元信息中meta={'airport_item': airport}
。本次请求的回调运行parse_schedule
在第三级解析
[=68=创建的AirportItem中]parse_schedule
中,我将scrapy收集的日程信息注入到之前使用response.request.meta['airport_item']
但我的源代码有问题,scrapy 正确地废弃了所有信息(国家、机场、时间表),但我对嵌套项目的理解似乎不正确。如您所见,我生成的 json 包含 country > list of (airport)
,但不包含 country > list of (airport > schedule )
我的代码在 github : https://github.com/IDEES-Rouen/Flight-Scrapping
问题是您 fork 了您的项目,根据您的逻辑,您每个国家/地区只需要 1 个项目,因此在解析国家/地区后您无法在任何时候产生多个项目。你要做的是将它们全部堆叠成一个项目。
为此,您需要创建一个解析循环:
def parse_airports(self, response):
item = response.meta['my_country_item']
item['airports'] = []
for airport in response.xpath('//a[@data-iata]'):
url = airport.xpath('./@href').extract()
iata = airport.xpath('./@data-iata').extract()
iatabis = airport.xpath('./small/text()').extract()
name = ''.join(airport.xpath('./text()').extract()).strip()
lat = airport.xpath("./@data-lat").extract()
lon = airport.xpath("./@data-lon").extract()
iAirport = dict()
iAirport['name'] = 'foobar'
iAirport['link'] = url[0]
iAirport['lat'] = lat[0]
iAirport['lon'] = lon[0]
iAirport['code_little'] = iata[0]
iAirport['code_total'] = iatabis[0]
item['airports'].append(iAirport)
urls = []
for airport in item['airports']:
json_url = 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin\[\]=&plugin-setting\[schedule\]\[mode\]=&plugin-setting\[schedule\]\[timestamp\]={timestamp}&page=1&limit=50&token='.format(
code=airport['code_little'], timestamp="1484150483")
urls.append(json_url)
if not urls:
return item
# start with first url
next_url = urls.pop()
return Request(next_url, self.parse_schedule,
meta={'airport_item': item, 'airport_urls': urls, 'i': 0})
def parse_schedule(self, response):
"""we want to loop this continuously for every schedule item"""
item = response.meta['airport_item']
i = response.meta['i']
urls = response.meta['airport_urls']
jsonload = json.loads(response.body_as_unicode())
item['airports'][i]['schedule'] = 'foobar'
# now do next schedule items
if not urls:
yield item
return
url = urls.pop()
yield Request(url, self.parse_schedule,
meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1})