CSS 在两个 i 标签之间获取信息的选择器或 XPath？

Question

我正在尝试抓取价格信息，网站的 HTML 看起来像这样

<span class="def-price" datasku='....'>
   <i>$</i>
   "999"
   <i>.00<i>
</span>

我想要 999。（我不想要美元符号或 .00）我现在有

product_price_sn = product.css('.def-price i').extract()

我知道这是错误的，但不确定如何解决。知道如何抓取价格信息吗？谢谢！

Answer 1

Scrapy 实现了一个扩展，因为它不是 CSS 选择器的标准。所以这应该适合你：

product_price_sn = product.css('.def-price i::text').extract()

docs 是这样说的：

Per W3C standards, CSS selectors do not support selecting text nodes or attribute values. But selecting these is so essential in a web scraping context that Scrapy (parsel) implements a couple of non-standard pseudo-elements:

to select text nodes, use ::text

to select attribute values, use ::attr(name) where name is the name of the attribute that you want the value of

Answer 2

你可以使用这个 xpath //span[@class="def-price"]/text()

确保您使用的是 /text() 而不是 //text()。否则它将 return span 标签内的所有文本节点。

或

这个css选择器.def-price::text。使用 css 选择器时不要使用 .def-price ::text，它将 return 所有文本节点，如 xpath 中的 //text()。

使用scrapy response.xpath对象

from scrapy.http import Request, HtmlResponse as Response

content = '''<span class="def-price" datasku='....'>
   <i>$</i>
   "999"
   <i>.00<i>
</span>'''.encode('utf-8')

url = '

''' mocking scrapy request object '''
request = Request(url=url)

''' mocking scrapy response object '''
response = Response(url=url, request=request, body=content)

''' using xpath  '''

print(response.xpath('//span[@class="def-price"]/text()').extract())
# outputs ['\n   ', '\n   "999"\n   ']

print(''.join(response.xpath('//span[@class="def-price"]/text()').extract()).strip())
# outputs "99"

''' using css selector '''

print(response.css('.def-price::text').extract())
# outputs ['\n   ', '\n   "999"\n   ']

print(''.join(response.css('.def-price::text').extract()).strip())
# outputs "99"

查看实际效果 here

使用 lxml html 解析器

from lxml import html

parser = html.fromstring("""
<span class="def-price" datasku='....'>
   <i>$</i>
   "999"
   <i>.00<i>
</span>
"""
)

print(parser.xpath('//span[@class="def-price"]/text()'))
# outputs ['\n   ', '\n   "999"\n   ']

print(''.join(parser.xpath('//span[@class="def-price"]/text()')).strip())
# outputs "999"

查看实际效果 here

Answer 3

使用 BeautifulSoup，您可以使用 CSS 选择器 .def_price 然后 .find_all(text=True, recursive=0) 获取所有即时文本。

例如：

from bs4 import BeautifulSoup


txt = '''<span class="def-price" datasku='....'>
   <i>$</i>
   "999"
   <i>.00<i>
</span>'''

soup = BeautifulSoup(txt, 'html.parser')

print( ''.join(soup.select_one('.def-price').find_all(text=True, recursive=0)).strip() )

打印：

"999"

CSS 在两个 i 标签之间获取信息的选择器或 XPath？

CSS selector or XPath that gets information between two i tags?

css

xpath

web-crawler

scrapy

web-scraping