如何在 scrapy 中使用 XPath 获取项目

Question

我正在更新本教程，因为它已过时：
http://mherman.org/blog/2012/11/05/scraping-web-pages-with-scrapy/#.VwpeOfl96Ul

它应该获取 Craigslist 上针对非营利组织的每个职位列表的 link 和标题。 link 被提取，但标题没有。

这是该元素的页面代码：

<span class="pl"> 
  <time datetime="2016-04-09 14:10" title="Sat 09 Apr 02:10:57 PM">Apr 9</time> 
  <a href="/nby/npo/5531527495.html" data-id="5531527495" class="hdrlnk">
  <span id="titletextonly">Therapist</span>

这是爬虫的代码：

    def parse(self, response):
    hxs = HtmlXPathSelector(response)
    titles = hxs.xpath("//span[@class='pl']")
    items = []
    for titles in titles:
        item = CraigslistSampleItem()
        item["title"] = titles.select("a/text()").extract()
        item["link"] = titles.select("a/@href").extract()
        items.append(item)
    return items

如果我检查 Chrome 中的元素并获取 XPath，我会得到以下标题： //*[@id='titletextonly']，但这给了我完整的标题列表，而不仅仅是 link 的标题（在这种情况下，我应该得到 '/nby/npo/5531527495。 html' link，'Therapist' 标题）

我知道这行

item["title"] = titles.select("a/text()").extract()

需要更新，但是如果我输入 //*[@id='titletextonly'] 我会得到每一个标题，所以我很接近，但我不知道如何在 'titletextonly' 中获取 'titletextonly' 的 XPath =37=]元素。

我是 Scrapy 和 Xpath 的新手，所以请多多指教。

谢谢。

Answer 1

如下更改 Xpath 以遍历 'span' 标记。

item["title"] = titles.select("a/span/text()").extract()

Answer 2

a/text() 只会 select 作为 a 元素的直接子元素的文本元素。您想要的文本不是 a 元素的子元素；它在 span.

内

我没用过 scrapy，但我建议试试这个：

item["title"] = titles.select("a").extract()

这应该获取 a 元素的字符串值，其中包含其中的所有文本。

如果还是不行，你也可以试试：

item["title"] = titles.select("a//text()").extract()

如何在 scrapy 中使用 XPath 获取项目

How to get item using XPath in scrapy

xpath

web-crawler

scrapy