抓取时添加条件
adding conditions when scraping
我正在尝试抓取一个网页,似乎每个分离都有不同的 div
,这取决于用户支付的金额或它拥有的页面类型。
示例:
<div class="figuration Web company-stats">
..information i want to scrap..
</div>
<div class="figuration Commercial" >
..information i want to scrap..
</div>
它似乎有超过 3 种类型的 div,所以我想知道是否有办法只 select 每个 div
包含第一个单词的图形?
这是我的爬虫代码:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from pagina.items import PaginaItem
from scrapy.contrib.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = "pagina"
allowed_domains = ["paginasamarillas.com.co"]
start_urls = ["http://www.paginasamarillas.com.co/busqueda/bicicletas-medellin"]
rules = (Rule(SgmlLinkExtractor( restrict_xpaths=('//ul[@class="paginator"]')),
callback='parse_item', follow=True, ),
)
def parse_item(self, response):
item = PaginaItem()
for sel in response.xpath('//div[@class="figuration Web company-stats"]'):
item = PaginaItem()
item['nombre'] = sel.xpath('.//h2[@class="titleFig"]/a/text()').extract()
#item['lugar'] = sel.xpath('.//div[@class="infoContact"]/div/h3/text()').extract()
#item['numero'] = sel.xpath('.//div[@class="infoContact"]/span/text()').extract()
#item['pagina'] = sel.xpath('.//div[@class="infoContact"]/a/@href').extract()
#item['sobre'] = sel.xpath('.//p[@class="CopyText"]/div/h3/text()').extract()
yield item
使用 CSS 选择器:
for sel in response.css('div.figuration'):
...
CSS 上面提到的选择器可以工作,但是如果你想使用 xpath 选择器,你可以像这样使用它:
for each in response.xpath('//div[contains(@class,"figuration")]'):
...
实际上,response.xpath('//div[contains(@class,"figuration")]')
可以与 response.css('div.figuration')
互换使用
我正在尝试抓取一个网页,似乎每个分离都有不同的 div
,这取决于用户支付的金额或它拥有的页面类型。
示例:
<div class="figuration Web company-stats">
..information i want to scrap..
</div>
<div class="figuration Commercial" >
..information i want to scrap..
</div>
它似乎有超过 3 种类型的 div,所以我想知道是否有办法只 select 每个 div
包含第一个单词的图形?
这是我的爬虫代码:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from pagina.items import PaginaItem
from scrapy.contrib.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = "pagina"
allowed_domains = ["paginasamarillas.com.co"]
start_urls = ["http://www.paginasamarillas.com.co/busqueda/bicicletas-medellin"]
rules = (Rule(SgmlLinkExtractor( restrict_xpaths=('//ul[@class="paginator"]')),
callback='parse_item', follow=True, ),
)
def parse_item(self, response):
item = PaginaItem()
for sel in response.xpath('//div[@class="figuration Web company-stats"]'):
item = PaginaItem()
item['nombre'] = sel.xpath('.//h2[@class="titleFig"]/a/text()').extract()
#item['lugar'] = sel.xpath('.//div[@class="infoContact"]/div/h3/text()').extract()
#item['numero'] = sel.xpath('.//div[@class="infoContact"]/span/text()').extract()
#item['pagina'] = sel.xpath('.//div[@class="infoContact"]/a/@href').extract()
#item['sobre'] = sel.xpath('.//p[@class="CopyText"]/div/h3/text()').extract()
yield item
使用 CSS 选择器:
for sel in response.css('div.figuration'):
...
CSS 上面提到的选择器可以工作,但是如果你想使用 xpath 选择器,你可以像这样使用它:
for each in response.xpath('//div[contains(@class,"figuration")]'):
...
实际上,response.xpath('//div[contains(@class,"figuration")]')
可以与 response.css('div.figuration')