从嵌套的 xpath 中提取数据
extract data from nested xpath
我是使用 xpath
的新手,
我想提取每个标题,body,link,发布日期 this link
一切似乎都还好,但在 body 上不行,如何在嵌套的 xPath 上提取每个 body,在此先感谢 :)
这里是我的来源
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from thehack.items import ThehackItem
class MySpider(BaseSpider):
name = "thehack"
allowed_domains = ["thehackernews.com"]
start_urls = ["http://thehackernews.com/search/label/mobile%20hacking"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.xpath('//article[@class="post item module"]')
items = []
for titles in titles:
item = ThehackItem()
item['title'] = titles.select('span/h2/a/text()').extract()
item['link'] = titles.select('span/h2/a/@href').extract()
item['body'] = titles.select('span/div/div/div/div/a/div/text()').extract()
item['date'] = titles.select('span/div/span/text()').extract()
items.append(item)
return items
任何body 可以修复大约body 块?仅在 body...
先谢谢队友
这里是网站上的检查元素图片
我想你是在为选择器而苦苦挣扎,对吧?我认为您应该查看 selectors 的文档,那里有很多有用的信息。在这个特定的例子中,使用 css 选择器,我认为它会是这样的:
class MySpider(scrapy.Spider):
name = "thehack"
allowed_domains = ["thehackernews.com"]
start_urls = ["http://thehackernews.com/search/label/mobile%20hacking"]
def parse(self, response):
for article in response.css('article.post'):
item = ThehackItem()
item['title'] = article.css('.post-title>a::text').extract_first()
item['link'] = article.css('.post-title>a::attr(href)').extract_first()
item['body'] = ''. join(article.css('[id^=summary] *::text').extract()).strip()
item['date'] = article.css('[itemprop="datePublished"]::attr(content)').extract_first()
yield item
把它们改成xpath选择器对你来说是一个很好的练习,也许还检查一下ItemLoaders,一起非常有用。
我是使用 xpath
的新手,
我想提取每个标题,body,link,发布日期 this link
一切似乎都还好,但在 body 上不行,如何在嵌套的 xPath 上提取每个 body,在此先感谢 :)
这里是我的来源
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from thehack.items import ThehackItem
class MySpider(BaseSpider):
name = "thehack"
allowed_domains = ["thehackernews.com"]
start_urls = ["http://thehackernews.com/search/label/mobile%20hacking"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.xpath('//article[@class="post item module"]')
items = []
for titles in titles:
item = ThehackItem()
item['title'] = titles.select('span/h2/a/text()').extract()
item['link'] = titles.select('span/h2/a/@href').extract()
item['body'] = titles.select('span/div/div/div/div/a/div/text()').extract()
item['date'] = titles.select('span/div/span/text()').extract()
items.append(item)
return items
任何body 可以修复大约body 块?仅在 body...
先谢谢队友
这里是网站上的检查元素图片
我想你是在为选择器而苦苦挣扎,对吧?我认为您应该查看 selectors 的文档,那里有很多有用的信息。在这个特定的例子中,使用 css 选择器,我认为它会是这样的:
class MySpider(scrapy.Spider):
name = "thehack"
allowed_domains = ["thehackernews.com"]
start_urls = ["http://thehackernews.com/search/label/mobile%20hacking"]
def parse(self, response):
for article in response.css('article.post'):
item = ThehackItem()
item['title'] = article.css('.post-title>a::text').extract_first()
item['link'] = article.css('.post-title>a::attr(href)').extract_first()
item['body'] = ''. join(article.css('[id^=summary] *::text').extract()).strip()
item['date'] = article.css('[itemprop="datePublished"]::attr(content)').extract_first()
yield item
把它们改成xpath选择器对你来说是一个很好的练习,也许还检查一下ItemLoaders,一起非常有用。