Python 网络抓取脚本,遵循教程但有问题
Python web-crawl script, following tutorial and having issues
在 youtube 上学习教程:
使用 Scrapy 抓取网页
它很旧,Python 2.x,我正在学习 3.x 版本。到目前为止,我已经 运行 解决了一些可以通过 Google 解决的问题。但是目前我收到一个错误:
File "/usr/lib64/python3.5/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/home/skeer/PycharmProjects/scrape_craigslists/scrape_cl/scrape_cl/spiders/scrape.py", line 11, in parse xpath = scrapy.selector(response) TypeError: 'module' object is not callable
早些时候谷歌搜索我发现对其他人的引用是由于非大写字符,就好像选择器中的 's' 应该是大写一样。试过了,遇到了关于如何找不到 scrapy.Selector 模块的错误。
这是我的代码:
from scrapy.spider import Spider
import scrapy.selector
class MySpider(Spider):
name = "craigslist"
allowed_domains = ["craigslist.org"]
start_urls = ["https://helena.craigslist.org/search/sad"]
def parse(self, response):
xpath = scrapy.selector(response)
titles = xpath.select("//p")
for titles in titles:
title = xpath("/body/section/form/div/li/p[@class]()").extract()
link =
xpath("/body/section/form/div/ul/li/a[@href]").extract()
print (title, link)
scrapy.selector 是包含选择器的模块。尝试
from scrapy.selector import Selector
然而,这不是必需的,因为响应对象已经有selector interface and an xpath method,所以你应该这样做:
def parse(self, response):
xpath = response.xpath
titles = xpath("//p")
for titles in titles:
title = xpath("/body/section/form/div/li/p[@class]()").extract()
link = xpath("/body/section/form/div/ul/li/a[@href]").extract()
print (title, link)
此外,如果您打算抓取 craigslist,则需要一份非常好的代理列表。他们迅速禁止 ip,特别是为了防止抓取。
我建议学习 official docs, and also the curated resources。
对于您的问题,请检查 official docs for Scrapy Selectors:
from scrapy.selector import Selector
class MySpider(Spider):
...
def parse(self, response):
xpath = Selector(response)
...
更改函数定义:
def parse(self, response):
xpath = scrapy.selector.Selector(response)
titles = xpath.select("//p")
for titles in titles:
title = xpath.xpath("/body/section/form/div/li/p[@class]()").extract()
link = xpath.xpath("/body/section/form/div/ul/li/a[@href]").extract()
print(title, link)
备注xpath("/body/section/form/div/li/p[@class]()")
-> xpath.xpath("/body/section/form/div/li/p[@class]()")
在 youtube 上学习教程: 使用 Scrapy 抓取网页
它很旧,Python 2.x,我正在学习 3.x 版本。到目前为止,我已经 运行 解决了一些可以通过 Google 解决的问题。但是目前我收到一个错误:
File "/usr/lib64/python3.5/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/home/skeer/PycharmProjects/scrape_craigslists/scrape_cl/scrape_cl/spiders/scrape.py", line 11, in parse xpath = scrapy.selector(response) TypeError: 'module' object is not callable
早些时候谷歌搜索我发现对其他人的引用是由于非大写字符,就好像选择器中的 's' 应该是大写一样。试过了,遇到了关于如何找不到 scrapy.Selector 模块的错误。
这是我的代码:
from scrapy.spider import Spider
import scrapy.selector
class MySpider(Spider):
name = "craigslist"
allowed_domains = ["craigslist.org"]
start_urls = ["https://helena.craigslist.org/search/sad"]
def parse(self, response):
xpath = scrapy.selector(response)
titles = xpath.select("//p")
for titles in titles:
title = xpath("/body/section/form/div/li/p[@class]()").extract()
link =
xpath("/body/section/form/div/ul/li/a[@href]").extract()
print (title, link)
scrapy.selector 是包含选择器的模块。尝试
from scrapy.selector import Selector
然而,这不是必需的,因为响应对象已经有selector interface and an xpath method,所以你应该这样做:
def parse(self, response):
xpath = response.xpath
titles = xpath("//p")
for titles in titles:
title = xpath("/body/section/form/div/li/p[@class]()").extract()
link = xpath("/body/section/form/div/ul/li/a[@href]").extract()
print (title, link)
此外,如果您打算抓取 craigslist,则需要一份非常好的代理列表。他们迅速禁止 ip,特别是为了防止抓取。
我建议学习 official docs, and also the curated resources。
对于您的问题,请检查 official docs for Scrapy Selectors:
from scrapy.selector import Selector
class MySpider(Spider):
...
def parse(self, response):
xpath = Selector(response)
...
更改函数定义:
def parse(self, response):
xpath = scrapy.selector.Selector(response)
titles = xpath.select("//p")
for titles in titles:
title = xpath.xpath("/body/section/form/div/li/p[@class]()").extract()
link = xpath.xpath("/body/section/form/div/ul/li/a[@href]").extract()
print(title, link)
备注xpath("/body/section/form/div/li/p[@class]()")
-> xpath.xpath("/body/section/form/div/li/p[@class]()")