Scrapy 教程示例
Scrapy Tutorial Example
想看看是否有人可以为我指明在 python.
中使用 Scrapy 的正确方向
几天来我一直在尝试按照示例进行操作,但仍然无法获得预期的输出。使用了 Scrapy 教程 http://doc.scrapy.org/en/latest/intro/tutorial.html#defining-our-item,甚至从 github 存储库下载了一个确切的项目,但我得到的输出与教程中描述的不同。
from scrapy.spiders import Spider
from scrapy.selector import Selector
from dirbot.items import Website
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
]
def parse(self, response):
"""
The lines below is a spider contract. For more info see:
http://doc.scrapy.org/en/latest/topics/contracts.html
@url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/
@scrapes name
"""
sel = Selector(response)
sites = sel.xpath('//ul[@class="directory-url"]/li')
items = []
for site in sites:
item = Website()
item['name'] = site.xpath('a/text()').extract()
item['url'] = site.xpath('a/@href').extract()
item['description'] = site.xpath('text()').re('-\s[^\n]*\r')
items.append(item)
return items
我从 github 下载项目后,我 运行 "scrapy crawl dmoz" 在顶级目录。我得到以下输出:
2016-08-31 00:08:19 [scrapy] INFO: Scrapy 1.1.1 started (bot: scrapybot)
2016-08-31 00:08:19 [scrapy] INFO: Overridden settings: {'DEFAULT_ITEM_CLASS': 'dirbot.items.Website', 'NEWSPIDER_MODULE': 'dirbot.spiders', 'SPIDER_MODULES': ['dirbot.spiders']}
2016-08-31 00:08:19 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-08-31 00:08:19 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-08-31 00:08:19 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-08-31 00:08:19 [scrapy] INFO: Enabled item pipelines:
['dirbot.pipelines.FilterWordsPipeline']
2016-08-31 00:08:19 [scrapy] INFO: Spider opened
2016-08-31 00:08:19 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-08-31 00:08:19 [scrapy] DEBUG: Telnet console listening on 128.1.2.1:2700
2016-08-31 00:08:20 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2016-08-31 00:08:20 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2016-08-31 00:08:20 [scrapy] INFO: Closing spider (finished)
2016-08-31 00:08:20 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 514,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 16179,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 8, 31, 7, 8, 20, 314625),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2016, 8, 31, 7, 8, 19, 882944)}
2016-08-31 00:08:20 [scrapy] INFO: Spider closed (finished)
按照教程的预期:
[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],
'link': [u'http://gnosis.cx/TPiP/'],
'title': [u'Text Processing in Python']}
[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'],
'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],
'title': [u'XML Processing with Python']}
教程中的这个蜘蛛似乎已经过时了。该网站发生了一些变化,因此现在所有的 xpath 都没有捕捉到任何东西。这很容易修复:
def parse(self, response):
sites = response.xpath('//div[@class="title-and-desc"]/a')
for site in sites:
item = dict()
item['name'] = site.xpath("text()").extract_first()
item['url'] = site.xpath("@href").extract_first()
item['description'] = site.xpath("following-sibling::div/text()").extract_first('').strip()
yield item
为了将来参考,您可以随时测试特定的 xpath 是否适用于 scrapy shell
命令。
例如我做了什么来测试这个:
$ scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
# test sites xpath
response.xpath('//ul[@class="directory-url"]/li')
[]
# ok it doesn't work, check out page in web browser
view(response)
# find correct xpath and test that:
response.xpath('//div[@class="title-and-desc"]/a')
# 21 result nodes printed
# it works!
这里是对Scrapy代码的更正,从中提取细节
DMOZ:
import scrapy
class MozSpider(scrapy.Spider):
name = "moz"
allowed_domains = ["www.dmoz.org"]
start_urls = ['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/',
'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/']
def parse(self, response):
sites = response.xpath('//div[@class="title-and-desc"]')
for site in sites:
name = site.xpath('a/div[@class="site-title"]/text()').extract_first()
url = site.xpath('a/@href').extract_first()
description = site.xpath('div[@class="site-descr "]/text()').extract_first().strip()
yield{'Name':name, 'URL':url, 'Description':description}
要将其导出为 CSV,请打开 Terminal/CMD 中的 spider 文件夹并输入:
scrapy crawl moz -o result.csv
这是另一个基本的 Scrapy 教程:从黄页中提取公司详细信息:
import scrapy
class YlpSpider(scrapy.Spider):
name = "ylp"
allowed_domains = ["www.yellowpages.com"]
start_urls = ['http://www.yellowpages.com/search?search_terms=Translation&geo_location_terms=Virginia+Beach%2C+VA']
def parse(self, response):
companies = response.xpath('//*[@class="info"]')
for company in companies:
name = company.xpath('h3/a/span[@itemprop="name"]/text()').extract_first()
phone = company.xpath('div/div[@class="phones phone primary"]/text()').extract_first()
website = company.xpath('div/div[@class="links"]/a/@href').extract_first()
yield{'Name':name,'Phone':phone, 'Website':website}
要将其导出为 CSV,请打开 Terminal/CMD 中的 spider 文件夹并输入:
scrapy crawl ylp -o result.csv
此 Scrapy 代码用于从 Yelp:
中提取公司详细信息
import scrapy
class YlpSpider(scrapy.Spider):
name = "yelp"
allowed_domains = ["www.yelp.com"]
start_urls = ['https://www.yelp.com/search?find_desc=Java+Developer&find_loc=Denver,+CO']
def parse(self, response):
companies = response.xpath('//*[@class="biz-listing-large"]')
for company in companies:
name = company.xpath('.//span[@class="indexed-biz-name"]/a/span/text()').extract_first()
address1 = company.xpath('.//address/text()').extract_first('').strip()
address2 = company.xpath('.//address/text()[2]').extract_first('').strip() # '' means the default attribute if not found to avoid adding None.
address = address1 + " - " + address2
phone = company.xpath('.//*[@class="biz-phone"]/text()').extract_first().strip()
website = "https://www.yelp.com" + company.xpath('.//@href').extract_first()
yield{'Name':name, 'Address':address, 'Phone':phone, 'Website':website}
要将其导出为 CSV,请打开 Terminal/CMD 中的 spider 文件夹并输入:
scrapy crawl yelp -o result.csv
想看看是否有人可以为我指明在 python.
中使用 Scrapy 的正确方向几天来我一直在尝试按照示例进行操作,但仍然无法获得预期的输出。使用了 Scrapy 教程 http://doc.scrapy.org/en/latest/intro/tutorial.html#defining-our-item,甚至从 github 存储库下载了一个确切的项目,但我得到的输出与教程中描述的不同。
from scrapy.spiders import Spider
from scrapy.selector import Selector
from dirbot.items import Website
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
]
def parse(self, response):
"""
The lines below is a spider contract. For more info see:
http://doc.scrapy.org/en/latest/topics/contracts.html
@url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/
@scrapes name
"""
sel = Selector(response)
sites = sel.xpath('//ul[@class="directory-url"]/li')
items = []
for site in sites:
item = Website()
item['name'] = site.xpath('a/text()').extract()
item['url'] = site.xpath('a/@href').extract()
item['description'] = site.xpath('text()').re('-\s[^\n]*\r')
items.append(item)
return items
我从 github 下载项目后,我 运行 "scrapy crawl dmoz" 在顶级目录。我得到以下输出:
2016-08-31 00:08:19 [scrapy] INFO: Scrapy 1.1.1 started (bot: scrapybot)
2016-08-31 00:08:19 [scrapy] INFO: Overridden settings: {'DEFAULT_ITEM_CLASS': 'dirbot.items.Website', 'NEWSPIDER_MODULE': 'dirbot.spiders', 'SPIDER_MODULES': ['dirbot.spiders']}
2016-08-31 00:08:19 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-08-31 00:08:19 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-08-31 00:08:19 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-08-31 00:08:19 [scrapy] INFO: Enabled item pipelines:
['dirbot.pipelines.FilterWordsPipeline']
2016-08-31 00:08:19 [scrapy] INFO: Spider opened
2016-08-31 00:08:19 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-08-31 00:08:19 [scrapy] DEBUG: Telnet console listening on 128.1.2.1:2700
2016-08-31 00:08:20 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2016-08-31 00:08:20 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2016-08-31 00:08:20 [scrapy] INFO: Closing spider (finished)
2016-08-31 00:08:20 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 514,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 16179,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 8, 31, 7, 8, 20, 314625),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2016, 8, 31, 7, 8, 19, 882944)}
2016-08-31 00:08:20 [scrapy] INFO: Spider closed (finished)
按照教程的预期:
[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],
'link': [u'http://gnosis.cx/TPiP/'],
'title': [u'Text Processing in Python']}
[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'],
'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],
'title': [u'XML Processing with Python']}
教程中的这个蜘蛛似乎已经过时了。该网站发生了一些变化,因此现在所有的 xpath 都没有捕捉到任何东西。这很容易修复:
def parse(self, response):
sites = response.xpath('//div[@class="title-and-desc"]/a')
for site in sites:
item = dict()
item['name'] = site.xpath("text()").extract_first()
item['url'] = site.xpath("@href").extract_first()
item['description'] = site.xpath("following-sibling::div/text()").extract_first('').strip()
yield item
为了将来参考,您可以随时测试特定的 xpath 是否适用于 scrapy shell
命令。
例如我做了什么来测试这个:
$ scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
# test sites xpath
response.xpath('//ul[@class="directory-url"]/li')
[]
# ok it doesn't work, check out page in web browser
view(response)
# find correct xpath and test that:
response.xpath('//div[@class="title-and-desc"]/a')
# 21 result nodes printed
# it works!
这里是对Scrapy代码的更正,从中提取细节 DMOZ:
import scrapy
class MozSpider(scrapy.Spider):
name = "moz"
allowed_domains = ["www.dmoz.org"]
start_urls = ['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/',
'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/']
def parse(self, response):
sites = response.xpath('//div[@class="title-and-desc"]')
for site in sites:
name = site.xpath('a/div[@class="site-title"]/text()').extract_first()
url = site.xpath('a/@href').extract_first()
description = site.xpath('div[@class="site-descr "]/text()').extract_first().strip()
yield{'Name':name, 'URL':url, 'Description':description}
要将其导出为 CSV,请打开 Terminal/CMD 中的 spider 文件夹并输入:
scrapy crawl moz -o result.csv
这是另一个基本的 Scrapy 教程:从黄页中提取公司详细信息:
import scrapy
class YlpSpider(scrapy.Spider):
name = "ylp"
allowed_domains = ["www.yellowpages.com"]
start_urls = ['http://www.yellowpages.com/search?search_terms=Translation&geo_location_terms=Virginia+Beach%2C+VA']
def parse(self, response):
companies = response.xpath('//*[@class="info"]')
for company in companies:
name = company.xpath('h3/a/span[@itemprop="name"]/text()').extract_first()
phone = company.xpath('div/div[@class="phones phone primary"]/text()').extract_first()
website = company.xpath('div/div[@class="links"]/a/@href').extract_first()
yield{'Name':name,'Phone':phone, 'Website':website}
要将其导出为 CSV,请打开 Terminal/CMD 中的 spider 文件夹并输入:
scrapy crawl ylp -o result.csv
此 Scrapy 代码用于从 Yelp:
import scrapy
class YlpSpider(scrapy.Spider):
name = "yelp"
allowed_domains = ["www.yelp.com"]
start_urls = ['https://www.yelp.com/search?find_desc=Java+Developer&find_loc=Denver,+CO']
def parse(self, response):
companies = response.xpath('//*[@class="biz-listing-large"]')
for company in companies:
name = company.xpath('.//span[@class="indexed-biz-name"]/a/span/text()').extract_first()
address1 = company.xpath('.//address/text()').extract_first('').strip()
address2 = company.xpath('.//address/text()[2]').extract_first('').strip() # '' means the default attribute if not found to avoid adding None.
address = address1 + " - " + address2
phone = company.xpath('.//*[@class="biz-phone"]/text()').extract_first().strip()
website = "https://www.yelp.com" + company.xpath('.//@href').extract_first()
yield{'Name':name, 'Address':address, 'Phone':phone, 'Website':website}
要将其导出为 CSV,请打开 Terminal/CMD 中的 spider 文件夹并输入:
scrapy crawl yelp -o result.csv