Scrapy 从 Link 中提取
Scrapy extracting from Link
我正在尝试从某些链接中提取信息,但我无法访问这些链接,我从 start_url 中提取信息,但我不确定为什么。
这是我的代码:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from tutorial.items import DmozItem
from scrapy.selector import HtmlXPathSelector
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python"
]
rules = [Rule(SgmlLinkExtractor(allow=[r'Books']), callback='parse')]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = DmozItem()
# Extract links
item['link'] = hxs.select("//li/a/text()").extract() # Xpath selector for tag(s)
print item['title']
for cont, i in enumerate(item['link']):
print "link: ", cont, i
我没有从“http://www.dmoz.org/Computers/Programming/Languages/Python/Books", instead I get the links from "http://www.dmoz.org/Computers/Programming/Languages/Python”获得链接。
为什么?
要rules
正常工作,您需要使用 CrawlSpider 而不是一般的 scrapy Spider。
此外,您需要将第一个解析函数重命名为 parse
以外的名称。否则,您将覆盖 CrawlSpider 的一个重要方法,它将无法工作。请参阅文档中的警告 http://doc.scrapy.org/en/0.24/topics/spiders.html?highlight=rules#crawlspider
您的代码正在抓取来自“http://www.dmoz.org/Computers/Programming/Languages/Python”的链接,因为 rule
命令被通用 Spider 忽略。
此代码应该有效:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from dmoz.items import DmozItem
from scrapy.selector import HtmlXPathSelector
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python"
]
rules = [Rule(SgmlLinkExtractor(allow=[r'Books']), callback='parse_item')]
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
item = DmozItem()
# Extract links
item['link'] = hxs.select("//li/a/text()").extract() # Xpath selector for tag(s)
print item['link']
for cont, i in enumerate(item['link']):
print "link: ", cont, i
我正在尝试从某些链接中提取信息,但我无法访问这些链接,我从 start_url 中提取信息,但我不确定为什么。
这是我的代码:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from tutorial.items import DmozItem
from scrapy.selector import HtmlXPathSelector
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python"
]
rules = [Rule(SgmlLinkExtractor(allow=[r'Books']), callback='parse')]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = DmozItem()
# Extract links
item['link'] = hxs.select("//li/a/text()").extract() # Xpath selector for tag(s)
print item['title']
for cont, i in enumerate(item['link']):
print "link: ", cont, i
我没有从“http://www.dmoz.org/Computers/Programming/Languages/Python/Books", instead I get the links from "http://www.dmoz.org/Computers/Programming/Languages/Python”获得链接。
为什么?
要rules
正常工作,您需要使用 CrawlSpider 而不是一般的 scrapy Spider。
此外,您需要将第一个解析函数重命名为 parse
以外的名称。否则,您将覆盖 CrawlSpider 的一个重要方法,它将无法工作。请参阅文档中的警告 http://doc.scrapy.org/en/0.24/topics/spiders.html?highlight=rules#crawlspider
您的代码正在抓取来自“http://www.dmoz.org/Computers/Programming/Languages/Python”的链接,因为 rule
命令被通用 Spider 忽略。
此代码应该有效:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from dmoz.items import DmozItem
from scrapy.selector import HtmlXPathSelector
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python"
]
rules = [Rule(SgmlLinkExtractor(allow=[r'Books']), callback='parse_item')]
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
item = DmozItem()
# Extract links
item['link'] = hxs.select("//li/a/text()").extract() # Xpath selector for tag(s)
print item['link']
for cont, i in enumerate(item['link']):
print "link: ", cont, i