Scrapy CrawlSpider 不关注链接
Scrapy CrawlSpider not following links
我正在尝试从该类别页面上给出的所有 (#123) 详细信息页面中抓取一些属性 - http://stinkybklyn.com/shop/cheese/ 但是 scrapy 无法遵循我设置的 link 模式,我检查了scrapy 文档和一些教程,但没有运气!
代码如下:
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
class Stinkybklyn(CrawlSpider):
name = "Stinkybklyn"
allowed_domains = ["stinkybklyn.com"]
start_urls = [
"http://stinkybklyn.com/shop/cheese/chandoka",
]
Rule(LinkExtractor(allow=r'\/shop\/cheese\/.*'),
callback='parse_items', follow=True)
def parse_items(self, response):
print "response", response
hxs= HtmlXPathSelector(response)
title=hxs.select("//*[@id='content']/div/h4").extract()
title="".join(title)
title=title.strip().replace("\n","").lstrip()
print "title is:",title
有人可以告诉我我在这里做错了什么吗?
您似乎有一些语法错误。
试试这个,
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
class Stinkybklyn(CrawlSpider):
name = "Stinkybklyn"
allowed_domains = ["stinkybklyn.com"]
start_urls = [
"http://stinkybklyn.com/shop/cheese/",
]
rules = (
Rule(LinkExtractor(allow=(r'/shop/cheese/')), callback='parse_items'),
)
def parse_items(self, response):
print "response", response
您的代码的关键问题是 您没有为 CrawlSpider
设置 rules
。
我建议的其他改进:
- 不需要实例化
HtmlXPathSelector
,可以直接使用response
select()
现在已弃用,请使用 xpath()
- 获取
title
元素的 text()
以便检索,例如,获取 Chandoka
而不是 <h4>Chandoka</h4>
- 我认为您打算改为从奶酪店目录页面开始:http://stinkybklyn.com/shop/cheese
应用改进的完整代码:
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
class Stinkybklyn(CrawlSpider):
name = "Stinkybklyn"
allowed_domains = ["stinkybklyn.com"]
start_urls = [
"http://stinkybklyn.com/shop/cheese",
]
rules = [
Rule(LinkExtractor(allow=r'\/shop\/cheese\/.*'), callback='parse_items', follow=True)
]
def parse_items(self, response):
title = response.xpath("//*[@id='content']/div/h4/text()").extract()
title = "".join(title)
title = title.strip().replace("\n", "").lstrip()
print "title is:", title
我正在尝试从该类别页面上给出的所有 (#123) 详细信息页面中抓取一些属性 - http://stinkybklyn.com/shop/cheese/ 但是 scrapy 无法遵循我设置的 link 模式,我检查了scrapy 文档和一些教程,但没有运气!
代码如下:
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
class Stinkybklyn(CrawlSpider):
name = "Stinkybklyn"
allowed_domains = ["stinkybklyn.com"]
start_urls = [
"http://stinkybklyn.com/shop/cheese/chandoka",
]
Rule(LinkExtractor(allow=r'\/shop\/cheese\/.*'),
callback='parse_items', follow=True)
def parse_items(self, response):
print "response", response
hxs= HtmlXPathSelector(response)
title=hxs.select("//*[@id='content']/div/h4").extract()
title="".join(title)
title=title.strip().replace("\n","").lstrip()
print "title is:",title
有人可以告诉我我在这里做错了什么吗?
您似乎有一些语法错误。 试试这个,
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
class Stinkybklyn(CrawlSpider):
name = "Stinkybklyn"
allowed_domains = ["stinkybklyn.com"]
start_urls = [
"http://stinkybklyn.com/shop/cheese/",
]
rules = (
Rule(LinkExtractor(allow=(r'/shop/cheese/')), callback='parse_items'),
)
def parse_items(self, response):
print "response", response
您的代码的关键问题是 您没有为 CrawlSpider
设置 rules
。
我建议的其他改进:
- 不需要实例化
HtmlXPathSelector
,可以直接使用response
select()
现在已弃用,请使用xpath()
- 获取
title
元素的text()
以便检索,例如,获取Chandoka
而不是<h4>Chandoka</h4>
- 我认为您打算改为从奶酪店目录页面开始:http://stinkybklyn.com/shop/cheese
应用改进的完整代码:
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
class Stinkybklyn(CrawlSpider):
name = "Stinkybklyn"
allowed_domains = ["stinkybklyn.com"]
start_urls = [
"http://stinkybklyn.com/shop/cheese",
]
rules = [
Rule(LinkExtractor(allow=r'\/shop\/cheese\/.*'), callback='parse_items', follow=True)
]
def parse_items(self, response):
title = response.xpath("//*[@id='content']/div/h4/text()").extract()
title = "".join(title)
title = title.strip().replace("\n", "").lstrip()
print "title is:", title