Python 用于转到下一页的 LinkExtractor 不起作用
Python LinkExtractor to go to next pages doesn't work
接下来是一段代码,我必须尝试抓取超过 1 个页面的网站...我无法使规则 class 正常工作。我做错了什么?
#import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from tutorial.items import SkodaItem
class SkodaSpider(CrawlSpider):
name = "skodas"
allowed_domains = ["marktplaats.nl"]
start_urls = [
"http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always"
]
rules = [
Rule(LinkExtractor(restrict_xpaths=('//a[@class="button secondary medium pagination-next"]/a',)), follow=True),
]
# def parse_item(self, response):
def parse(self, response):
#self.logger.info('Hi, this is an item page! %s', response.url)
x = 0
items = []
for sel in response.xpath('//*[@id="search-results"]/section[2]/article'):
x = x + 1
item = SkodaItem()
item["title"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span').re('.+>(.+)</span>')
#print sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span').extract()
item["leeftijd"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[1]').re('.+">(.+)</span>')
item["prijs"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[2]/div[1]/div/div').re('.+\n +(.+)\n.+')
item["km"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[3]').re('.+">(.+)</span>')
#handle output (print or safe to database)
items.append(item)
print item ["title"],item["leeftijd"],item["prijs"],item["km"]
一些要更改的内容:
- 当使用
CrawlSpider
、you should not redefine the parse
method 时,它是所有 "magic" 发生这种特定蜘蛛类型的地方
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.
- 正如我在评论中提到的,您的 XPath 需要通过删除末尾的额外
/a
来修复(链接中的链接将不匹配任何元素)
CrawlSpider
如果您想从后续页面中提取项目,规则需要一个回调方法
- 要同时解析起始 URL 中的元素,您需要定义一个
parse_start_url
method
这是一个极简主义的 CrawlSpider
,遵循示例输入的 3 页,并打印出每页中有多少 "articles":
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class SkodaSpider(CrawlSpider):
name = "skodas"
allowed_domains = ["marktplaats.nl"]
start_urls = [
"http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always"
]
rules = [
Rule(LinkExtractor(restrict_xpaths=('//a[@class="button secondary medium pagination-next"]',)),
follow=True,
callback='parse_page'),
]
def parse_page(self, response):
articles = response.css('#search-results > section + section > article')
self.logger.info('%d articles' % len(articles))
# define this, otherwise "parse_page" will not be called for the URLs in start_urls
parse_start_url = parse_page
输出:
$ scrapy runspider 001.py
2016-02-09 11:07:16 [scrapy] INFO: Scrapy 1.0.4 started (bot: scrapybot)
2016-02-09 11:07:16 [scrapy] INFO: Optional features available: ssl, http11
2016-02-09 11:07:16 [scrapy] INFO: Overridden settings: {}
2016-02-09 11:07:16 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-02-09 11:07:16 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-09 11:07:16 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-09 11:07:16 [scrapy] INFO: Enabled item pipelines:
2016-02-09 11:07:16 [scrapy] INFO: Spider opened
2016-02-09 11:07:16 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-09 11:07:16 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-02-09 11:07:16 [scrapy] DEBUG: Crawled (200) <GET http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always> (referer: None)
2016-02-09 11:07:16 [skodas] INFO: 32 articles
2016-02-09 11:07:17 [scrapy] DEBUG: Crawled (200) <GET http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=S%2C1185+S%2C484+M%2C11564&categoryId=151¤tPage=2&mileageTo=150.000&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010> (referer: http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always)
2016-02-09 11:07:17 [skodas] INFO: 30 articles
2016-02-09 11:07:17 [scrapy] DEBUG: Crawled (200) <GET http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=S%2C1185+S%2C484+M%2C11564&categoryId=151¤tPage=3&mileageTo=150.000&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010> (referer: http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=S%2C1185+S%2C484+M%2C11564&categoryId=151¤tPage=2&mileageTo=150.000&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010)
2016-02-09 11:07:17 [skodas] INFO: 7 articles
2016-02-09 11:07:17 [scrapy] INFO: Closing spider (finished)
2016-02-09 11:07:17 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1919,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 96682,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 2, 9, 10, 7, 17, 638179),
'log_count/DEBUG': 4,
'log_count/INFO': 10,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2016, 2, 9, 10, 7, 16, 452272)}
2016-02-09 11:07:17 [scrapy] INFO: Spider closed (finished)
接下来是一段代码,我必须尝试抓取超过 1 个页面的网站...我无法使规则 class 正常工作。我做错了什么?
#import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from tutorial.items import SkodaItem
class SkodaSpider(CrawlSpider):
name = "skodas"
allowed_domains = ["marktplaats.nl"]
start_urls = [
"http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always"
]
rules = [
Rule(LinkExtractor(restrict_xpaths=('//a[@class="button secondary medium pagination-next"]/a',)), follow=True),
]
# def parse_item(self, response):
def parse(self, response):
#self.logger.info('Hi, this is an item page! %s', response.url)
x = 0
items = []
for sel in response.xpath('//*[@id="search-results"]/section[2]/article'):
x = x + 1
item = SkodaItem()
item["title"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span').re('.+>(.+)</span>')
#print sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span').extract()
item["leeftijd"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[1]').re('.+">(.+)</span>')
item["prijs"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[2]/div[1]/div/div').re('.+\n +(.+)\n.+')
item["km"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[3]').re('.+">(.+)</span>')
#handle output (print or safe to database)
items.append(item)
print item ["title"],item["leeftijd"],item["prijs"],item["km"]
一些要更改的内容:
- 当使用
CrawlSpider
、you should not redefine theparse
method 时,它是所有 "magic" 发生这种特定蜘蛛类型的地方
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.
- 正如我在评论中提到的,您的 XPath 需要通过删除末尾的额外
/a
来修复(链接中的链接将不匹配任何元素) CrawlSpider
如果您想从后续页面中提取项目,规则需要一个回调方法- 要同时解析起始 URL 中的元素,您需要定义一个
parse_start_url
method
这是一个极简主义的 CrawlSpider
,遵循示例输入的 3 页,并打印出每页中有多少 "articles":
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class SkodaSpider(CrawlSpider):
name = "skodas"
allowed_domains = ["marktplaats.nl"]
start_urls = [
"http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always"
]
rules = [
Rule(LinkExtractor(restrict_xpaths=('//a[@class="button secondary medium pagination-next"]',)),
follow=True,
callback='parse_page'),
]
def parse_page(self, response):
articles = response.css('#search-results > section + section > article')
self.logger.info('%d articles' % len(articles))
# define this, otherwise "parse_page" will not be called for the URLs in start_urls
parse_start_url = parse_page
输出:
$ scrapy runspider 001.py
2016-02-09 11:07:16 [scrapy] INFO: Scrapy 1.0.4 started (bot: scrapybot)
2016-02-09 11:07:16 [scrapy] INFO: Optional features available: ssl, http11
2016-02-09 11:07:16 [scrapy] INFO: Overridden settings: {}
2016-02-09 11:07:16 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-02-09 11:07:16 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-09 11:07:16 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-09 11:07:16 [scrapy] INFO: Enabled item pipelines:
2016-02-09 11:07:16 [scrapy] INFO: Spider opened
2016-02-09 11:07:16 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-09 11:07:16 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-02-09 11:07:16 [scrapy] DEBUG: Crawled (200) <GET http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always> (referer: None)
2016-02-09 11:07:16 [skodas] INFO: 32 articles
2016-02-09 11:07:17 [scrapy] DEBUG: Crawled (200) <GET http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=S%2C1185+S%2C484+M%2C11564&categoryId=151¤tPage=2&mileageTo=150.000&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010> (referer: http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always)
2016-02-09 11:07:17 [skodas] INFO: 30 articles
2016-02-09 11:07:17 [scrapy] DEBUG: Crawled (200) <GET http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=S%2C1185+S%2C484+M%2C11564&categoryId=151¤tPage=3&mileageTo=150.000&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010> (referer: http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=S%2C1185+S%2C484+M%2C11564&categoryId=151¤tPage=2&mileageTo=150.000&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010)
2016-02-09 11:07:17 [skodas] INFO: 7 articles
2016-02-09 11:07:17 [scrapy] INFO: Closing spider (finished)
2016-02-09 11:07:17 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1919,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 96682,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 2, 9, 10, 7, 17, 638179),
'log_count/DEBUG': 4,
'log_count/INFO': 10,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2016, 2, 9, 10, 7, 16, 452272)}
2016-02-09 11:07:17 [scrapy] INFO: Spider closed (finished)