CSS 在两个 i 标签之间获取信息的选择器或 XPath?
CSS selector or XPath that gets information between two i tags?
我正在尝试抓取价格信息,网站的 HTML 看起来像这样
<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>
我想要 999。(我不想要美元符号或 .00)我现在有
product_price_sn = product.css('.def-price i').extract()
我知道这是错误的,但不确定如何解决。知道如何抓取价格信息吗?谢谢!
Scrapy 实现了一个扩展,因为它不是 CSS 选择器的标准。所以这应该适合你:
product_price_sn = product.css('.def-price i::text').extract()
docs 是这样说的:
Per W3C standards, CSS selectors do not support selecting text nodes
or attribute values. But selecting these is so essential in a web
scraping context that Scrapy (parsel) implements a couple of
non-standard pseudo-elements:
to select text nodes, use ::text
to select attribute values, use ::attr(name) where name is the name of the attribute that you want the value of
你可以使用这个 xpath //span[@class="def-price"]/text()
确保您使用的是 /text()
而不是 //text()
。否则它将 return span 标签内的所有文本节点。
或
这个css选择器.def-price::text
。使用 css 选择器时不要使用 .def-price ::text
,它将 return 所有文本节点,如 xpath 中的 //text()
。
使用scrapy response.xpath对象
from scrapy.http import Request, HtmlResponse as Response
content = '''<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>'''.encode('utf-8')
url = '
''' mocking scrapy request object '''
request = Request(url=url)
''' mocking scrapy response object '''
response = Response(url=url, request=request, body=content)
''' using xpath '''
print(response.xpath('//span[@class="def-price"]/text()').extract())
# outputs ['\n ', '\n "999"\n ']
print(''.join(response.xpath('//span[@class="def-price"]/text()').extract()).strip())
# outputs "99"
''' using css selector '''
print(response.css('.def-price::text').extract())
# outputs ['\n ', '\n "999"\n ']
print(''.join(response.css('.def-price::text').extract()).strip())
# outputs "99"
查看实际效果 here
使用 lxml html 解析器
from lxml import html
parser = html.fromstring("""
<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>
"""
)
print(parser.xpath('//span[@class="def-price"]/text()'))
# outputs ['\n ', '\n "999"\n ']
print(''.join(parser.xpath('//span[@class="def-price"]/text()')).strip())
# outputs "999"
查看实际效果 here
使用 BeautifulSoup,您可以使用 CSS 选择器 .def_price
然后 .find_all(text=True, recursive=0)
获取所有即时文本。
例如:
from bs4 import BeautifulSoup
txt = '''<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>'''
soup = BeautifulSoup(txt, 'html.parser')
print( ''.join(soup.select_one('.def-price').find_all(text=True, recursive=0)).strip() )
打印:
"999"
我正在尝试抓取价格信息,网站的 HTML 看起来像这样
<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>
我想要 999。(我不想要美元符号或 .00)我现在有
product_price_sn = product.css('.def-price i').extract()
我知道这是错误的,但不确定如何解决。知道如何抓取价格信息吗?谢谢!
Scrapy 实现了一个扩展,因为它不是 CSS 选择器的标准。所以这应该适合你:
product_price_sn = product.css('.def-price i::text').extract()
docs 是这样说的:
Per W3C standards, CSS selectors do not support selecting text nodes or attribute values. But selecting these is so essential in a web scraping context that Scrapy (parsel) implements a couple of non-standard pseudo-elements:
to select text nodes, use ::text
to select attribute values, use ::attr(name) where name is the name of the attribute that you want the value of
你可以使用这个 xpath //span[@class="def-price"]/text()
确保您使用的是 /text()
而不是 //text()
。否则它将 return span 标签内的所有文本节点。
或
这个css选择器.def-price::text
。使用 css 选择器时不要使用 .def-price ::text
,它将 return 所有文本节点,如 xpath 中的 //text()
。
使用scrapy response.xpath对象
from scrapy.http import Request, HtmlResponse as Response
content = '''<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>'''.encode('utf-8')
url = '
''' mocking scrapy request object '''
request = Request(url=url)
''' mocking scrapy response object '''
response = Response(url=url, request=request, body=content)
''' using xpath '''
print(response.xpath('//span[@class="def-price"]/text()').extract())
# outputs ['\n ', '\n "999"\n ']
print(''.join(response.xpath('//span[@class="def-price"]/text()').extract()).strip())
# outputs "99"
''' using css selector '''
print(response.css('.def-price::text').extract())
# outputs ['\n ', '\n "999"\n ']
print(''.join(response.css('.def-price::text').extract()).strip())
# outputs "99"
查看实际效果 here
使用 lxml html 解析器
from lxml import html
parser = html.fromstring("""
<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>
"""
)
print(parser.xpath('//span[@class="def-price"]/text()'))
# outputs ['\n ', '\n "999"\n ']
print(''.join(parser.xpath('//span[@class="def-price"]/text()')).strip())
# outputs "999"
查看实际效果 here
使用 BeautifulSoup,您可以使用 CSS 选择器 .def_price
然后 .find_all(text=True, recursive=0)
获取所有即时文本。
例如:
from bs4 import BeautifulSoup
txt = '''<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>'''
soup = BeautifulSoup(txt, 'html.parser')
print( ''.join(soup.select_one('.def-price').find_all(text=True, recursive=0)).strip() )
打印:
"999"