尝试使用 Scrapy 解析 JSON 个文件
Trying to parse JSON files using Scrapy
我正在尝试像 this 一样解析文件,但要处理很多经度和纬度。爬虫循环遍历所有网页,但不输出任何内容。
这是我的代码:
import scrapy
import json
from tutorial.items import DmozItem
from scrapy.http import Request
from scrapy.contrib.spiders import CrawlSpider, Rule
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["proadvisorservice.intuit.com"]
min_lat = 35
max_lat = 40
min_long = -100
max_long = -90
def start_requests(self):
for i in range(self.min_lat, self.max_lat):
for j in range(self.min_long, self.max_long):
yield scrapy.Request('http://proadvisorservice.intuit.com/v1/search?latitude=%d&longitude=%d&radius=100&pageNumber=1&pageSize=&sortBy=distance' % (i, j),
meta={'index':(i, j)},
callback=self.parse)
def parse(self, response):
jsonresponse = json.loads(response.body_as_unicode())
for x in jsonresponse['searchResults']:
item = DmozItem()
item['firstName'] = x['firstName']
item['lastName'] = x['lastName']
item['phoneNumber'] = x['phoneNumber']
item['email'] = x['email']
item['companyName'] = x['companyName']
item['qbo'] = x['qbopapCertVersions']
item['qbd'] = x['papCertVersions']
yield item
使用 CrawlSpider
时,您应该 而不是 覆盖 parse()
方法:
When writing crawl spider rules, avoid using parse as callback, since
the CrawlSpider uses the parse method itself to implement its logic.
So if you override the parse method, the crawl spider will no longer
work.
(source)
但是由于您是手动自定义您的爬虫,而且无论如何都不使用 CrawlSpider
功能,我建议您不要继承它。相反,继承自 scrapy.Spider
:
class DmozSpider(scrapy.Spider):
...
我正在尝试像 this 一样解析文件,但要处理很多经度和纬度。爬虫循环遍历所有网页,但不输出任何内容。
这是我的代码:
import scrapy
import json
from tutorial.items import DmozItem
from scrapy.http import Request
from scrapy.contrib.spiders import CrawlSpider, Rule
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["proadvisorservice.intuit.com"]
min_lat = 35
max_lat = 40
min_long = -100
max_long = -90
def start_requests(self):
for i in range(self.min_lat, self.max_lat):
for j in range(self.min_long, self.max_long):
yield scrapy.Request('http://proadvisorservice.intuit.com/v1/search?latitude=%d&longitude=%d&radius=100&pageNumber=1&pageSize=&sortBy=distance' % (i, j),
meta={'index':(i, j)},
callback=self.parse)
def parse(self, response):
jsonresponse = json.loads(response.body_as_unicode())
for x in jsonresponse['searchResults']:
item = DmozItem()
item['firstName'] = x['firstName']
item['lastName'] = x['lastName']
item['phoneNumber'] = x['phoneNumber']
item['email'] = x['email']
item['companyName'] = x['companyName']
item['qbo'] = x['qbopapCertVersions']
item['qbd'] = x['papCertVersions']
yield item
使用 CrawlSpider
时,您应该 而不是 覆盖 parse()
方法:
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work. (source)
但是由于您是手动自定义您的爬虫,而且无论如何都不使用 CrawlSpider
功能,我建议您不要继承它。相反,继承自 scrapy.Spider
:
class DmozSpider(scrapy.Spider):
...