Scrapy如何跳转到下一页
How can I jump to next page in Scrapy
我正在尝试使用 scrapy 从 here 中抓取结果。问题是在单击 'load more results' 选项卡之前,并非所有 类 都显示在页面上。
问题可以看这里:
我的代码如下所示:
class ClassCentralSpider(CrawlSpider):
name = "class_central"
allowed_domains = ["www.class-central.com"]
start_urls = (
'https://www.class-central.com/courses/recentlyAdded',
)
rules = (
Rule(
LinkExtractor(
# allow=("index\d00\.html",),
restrict_xpaths=('//div[@id="show-more-courses"]',)
),
callback='parse',
follow=True
),
)
def parse(self, response):
x = response.xpath('//span[@class="course-name-text"]/text()').extract()
item = ClasscentralItem()
for y in x:
item['name'] = y
print item['name']
pass
该网站的第二页似乎是通过 AJAX 调用生成的。如果您查看任何浏览器检查工具的网络选项卡,您会看到如下内容:
在这种情况下,它似乎是从 https://www.class-central.com/maestro/courses/recentlyAdded?page=2&_=1469471093134
中检索 json 文件
现在 url 参数 _=1469471093134
似乎没有任何作用,因此您可以 trim 将其移至:https://www.class-central.com/maestro/courses/recentlyAdded?page=2
return json 包含下一页的 html 代码:
# so you just need to load it up with
data = json.loads(response.body)
# and convert it to scrapy selector -
sel = Selector(text=data['table'])
要在您的代码中复制它,请尝试类似的操作:
from w3lib.url import add_or_replace_parameter
def parse(self, response):
# check if response is json, if so convert to selector
if response.meta.get('is_json',False):
# convert the json to scrapy.Selector here for parsing
sel = Selector(text=json.loads(response.body)['table'])
else:
sel = Selector(response)
# parse page here for items
x = sel.xpath('//span[@class="course-name-text"]/text()').extract()
item = ClasscentralItem()
for y in x:
item['name'] = y
print(item['name'])
# do next page
next_page_el = respones.xpath("//div[@id='show-more-courses']")
if next_page_el: # there is next page
next_page = response.meta.get('page',1) + 1
# make next page url
url = add_or_replace_parameter(url, 'page', next_page)
yield Request(url, self.parse, meta={'page': next_page, 'is_json': True)
我正在尝试使用 scrapy 从 here 中抓取结果。问题是在单击 'load more results' 选项卡之前,并非所有 类 都显示在页面上。
问题可以看这里:
我的代码如下所示:
class ClassCentralSpider(CrawlSpider):
name = "class_central"
allowed_domains = ["www.class-central.com"]
start_urls = (
'https://www.class-central.com/courses/recentlyAdded',
)
rules = (
Rule(
LinkExtractor(
# allow=("index\d00\.html",),
restrict_xpaths=('//div[@id="show-more-courses"]',)
),
callback='parse',
follow=True
),
)
def parse(self, response):
x = response.xpath('//span[@class="course-name-text"]/text()').extract()
item = ClasscentralItem()
for y in x:
item['name'] = y
print item['name']
pass
该网站的第二页似乎是通过 AJAX 调用生成的。如果您查看任何浏览器检查工具的网络选项卡,您会看到如下内容:
在这种情况下,它似乎是从 https://www.class-central.com/maestro/courses/recentlyAdded?page=2&_=1469471093134
中检索 json 文件现在 url 参数 _=1469471093134
似乎没有任何作用,因此您可以 trim 将其移至:https://www.class-central.com/maestro/courses/recentlyAdded?page=2
return json 包含下一页的 html 代码:
# so you just need to load it up with
data = json.loads(response.body)
# and convert it to scrapy selector -
sel = Selector(text=data['table'])
要在您的代码中复制它,请尝试类似的操作:
from w3lib.url import add_or_replace_parameter
def parse(self, response):
# check if response is json, if so convert to selector
if response.meta.get('is_json',False):
# convert the json to scrapy.Selector here for parsing
sel = Selector(text=json.loads(response.body)['table'])
else:
sel = Selector(response)
# parse page here for items
x = sel.xpath('//span[@class="course-name-text"]/text()').extract()
item = ClasscentralItem()
for y in x:
item['name'] = y
print(item['name'])
# do next page
next_page_el = respones.xpath("//div[@id='show-more-courses']")
if next_page_el: # there is next page
next_page = response.meta.get('page',1) + 1
# make next page url
url = add_or_replace_parameter(url, 'page', next_page)
yield Request(url, self.parse, meta={'page': next_page, 'is_json': True)