Scrapy 没有执行 CrawlSpider 的所有规则
Scrapy not executing all rules for CrawlSpider
我有以下刮板:
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class SpiderOpUpcoming(CrawlSpider):
name = "upcoming"
start_urls = ["https://www.oddsportal.com/tennis/"]
custom_settings = {"USER_AGENT": "*"}
tournament_linkxtr = LinkExtractor(
allow="/tennis/",
restrict_xpaths=(
"//table[@id='sport_content_tennis' and @class='table-main sportcount']"
"//a[@foo='f']"
),
)
match_linkxtr = LinkExtractor(
allow="/tennis/",
restrict_xpaths=("//td[@class='name table-participant']//a"),
)
rules = (
Rule(tournament_linkxtr, callback="parse_tournament", follow=True),
Rule(match_linkxtr, callback="parse_match", follow=True),
)
handle_httpstatus_list = [301]
def parse_tournament(self, response):
print("TOURNAMENT -", response.url)
def parse_match(self, response):
print("MATCH -", response.url)
process = CrawlerProcess()
process.crawl(SpiderOpUpcoming)
process.start()
parse_tournament
打印正常,但 parse_match
打印不成功。
为了排除故障,我将 start_urls
更改为来自 'tournament' 页面的 URL,该页面由上面的抓取工具通过 tournament_linkxtr
抓取。然后我删除了 Rule
和 callback
到 parse_tournament
。见下文:
start_urls = ["https://www.oddsportal.com/tennis/usa/champaign-challenger-men"]
rules = (Rule(match_linkxtr, callback="parse_match", follow=True))
然后刮刀从 parse_match
打印,所以 XPath 没问题。我也不认为我对规则顺序有任何问题,所以我很困惑。
谁能指出我哪里出错了?
Scrapy版本为2.4.1。 OSX 2016 年 MacBook Pro 上的蒙特雷。
变化中:
handle_httpstatus_list = [301]
收件人:
httperror_allowed_codes = [301]
帮我整理了这个。请参阅文档 here.
我有以下刮板:
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class SpiderOpUpcoming(CrawlSpider):
name = "upcoming"
start_urls = ["https://www.oddsportal.com/tennis/"]
custom_settings = {"USER_AGENT": "*"}
tournament_linkxtr = LinkExtractor(
allow="/tennis/",
restrict_xpaths=(
"//table[@id='sport_content_tennis' and @class='table-main sportcount']"
"//a[@foo='f']"
),
)
match_linkxtr = LinkExtractor(
allow="/tennis/",
restrict_xpaths=("//td[@class='name table-participant']//a"),
)
rules = (
Rule(tournament_linkxtr, callback="parse_tournament", follow=True),
Rule(match_linkxtr, callback="parse_match", follow=True),
)
handle_httpstatus_list = [301]
def parse_tournament(self, response):
print("TOURNAMENT -", response.url)
def parse_match(self, response):
print("MATCH -", response.url)
process = CrawlerProcess()
process.crawl(SpiderOpUpcoming)
process.start()
parse_tournament
打印正常,但 parse_match
打印不成功。
为了排除故障,我将 start_urls
更改为来自 'tournament' 页面的 URL,该页面由上面的抓取工具通过 tournament_linkxtr
抓取。然后我删除了 Rule
和 callback
到 parse_tournament
。见下文:
start_urls = ["https://www.oddsportal.com/tennis/usa/champaign-challenger-men"]
rules = (Rule(match_linkxtr, callback="parse_match", follow=True))
然后刮刀从 parse_match
打印,所以 XPath 没问题。我也不认为我对规则顺序有任何问题,所以我很困惑。
谁能指出我哪里出错了?
Scrapy版本为2.4.1。 OSX 2016 年 MacBook Pro 上的蒙特雷。
变化中:
handle_httpstatus_list = [301]
收件人:
httperror_allowed_codes = [301]
帮我整理了这个。请参阅文档 here.