Scrapy 蜘蛛不跟踪链接和错误
Scrapy spider not following links and error
我正在尝试使用 scrapy 编写我的第一个网络爬虫/数据提取器,但无法通过链接获取它。我也收到一个错误:
ERROR: Spider error processing < GET
https://en.wikipedia.org/wiki/Wikipedia:Unusual_articles>
我知道蜘蛛程序正在扫描页面一次,因为我能够从我弄乱的 a
标记和 h1
元素中提取信息。
有谁知道我怎样才能使它跟随页面上的链接并消除错误?
import scrapy
from scrapy.linkextractors import LinkExtractor
from wikiCrawler.items import WikicrawlerItem
from scrapy.spiders import Rule
class WikispyderSpider(scrapy.Spider):
name = "wikiSpyder"
allowed_domains = ['https://en.wikipedia.org/']
start_urls = ['https://en.wikipedia.org/wiki/Wikipedia:Unusual_articles']
rules = (
Rule(LinkExtractor(canonicalize=True, unique=True), follow=True, callback="parse"),
)
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, dont_filter=True)
def parse(self, response):
items = []
links = LinkExtractor(canonicalize=True, unique=True).extract_links(response)
for link in links:
item = WikicrawlerItem()
item['url_from'] = response.url
item['url_to'] = link.url
items.append(item)
print(items)
return items
如果你想使用Link提取器,你需要使用一个特殊的蜘蛛class - CrawlSpider
:
from scrapy.spiders import CrawlSpider
class WikispyderSpider(CrawlSpider):
# ...
这是一个简单的蜘蛛程序,它从您开始 url 开始跟踪链接并打印出页面标题:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
class WikispyderSpider(CrawlSpider):
name = "wikiSpyder"
allowed_domains = ['en.wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/Wikipedia:Unusual_articles']
rules = (
Rule(LinkExtractor(canonicalize=True, unique=True), follow=True, callback="parse_link"),
)
def parse_link(self, response):
print(response.xpath("//title/text()").extract_first())
我正在尝试使用 scrapy 编写我的第一个网络爬虫/数据提取器,但无法通过链接获取它。我也收到一个错误:
ERROR: Spider error processing < GET https://en.wikipedia.org/wiki/Wikipedia:Unusual_articles>
我知道蜘蛛程序正在扫描页面一次,因为我能够从我弄乱的 a
标记和 h1
元素中提取信息。
有谁知道我怎样才能使它跟随页面上的链接并消除错误?
import scrapy
from scrapy.linkextractors import LinkExtractor
from wikiCrawler.items import WikicrawlerItem
from scrapy.spiders import Rule
class WikispyderSpider(scrapy.Spider):
name = "wikiSpyder"
allowed_domains = ['https://en.wikipedia.org/']
start_urls = ['https://en.wikipedia.org/wiki/Wikipedia:Unusual_articles']
rules = (
Rule(LinkExtractor(canonicalize=True, unique=True), follow=True, callback="parse"),
)
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, dont_filter=True)
def parse(self, response):
items = []
links = LinkExtractor(canonicalize=True, unique=True).extract_links(response)
for link in links:
item = WikicrawlerItem()
item['url_from'] = response.url
item['url_to'] = link.url
items.append(item)
print(items)
return items
如果你想使用Link提取器,你需要使用一个特殊的蜘蛛class - CrawlSpider
:
from scrapy.spiders import CrawlSpider
class WikispyderSpider(CrawlSpider):
# ...
这是一个简单的蜘蛛程序,它从您开始 url 开始跟踪链接并打印出页面标题:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
class WikispyderSpider(CrawlSpider):
name = "wikiSpyder"
allowed_domains = ['en.wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/Wikipedia:Unusual_articles']
rules = (
Rule(LinkExtractor(canonicalize=True, unique=True), follow=True, callback="parse_link"),
)
def parse_link(self, response):
print(response.xpath("//title/text()").extract_first())