scrapy 选择器 xpath 提取匹配正则表达式或切片字符串

Question

我是 scrapy 的新手，对 Python 有点了解。

我想取回项目['rating']。评级是字符串形式 "rating is 4" 但我只想要数字...我怎样才能得到它？

我对这些问题提出了以下解决方案，但不知道它们是否有意义。 none 正在工作。

> item_pub['rating'] = review.xpath('/html/body//*/div[@class="details"]/table[@class="detailtoptable"]/tbody/tr[1]/td/img/@alt').re(r'\d+') #to extract only the number since the result with extract() would be "rating is 4"

或

 > item_pub['rating'] = review.xpath('/html/body//*/div[@class="details"]/table[@class="detailtoptable"]/tbody/tr[1]/td/img/@alt')[-1:].extract() #to extract only the number since the result with extract() would be "rating is 4"

非常感谢您的帮助，抱歉我的英语不好，我希望我的问题很清楚。

Answer 1

通过Beautiful Soup，你可以这样做，

>>> from bs4 import BeautifulSoup
>>> s = '''<td> <img alt="rating is 4" title="rating is 4" src="/Shared\images\ratingstars_web8.gif"/> </td>'''
>>> [re.search(r'\d+', i['alt']).group() for i in soup.select('td > img[alt*="rating"]')]
['4']

Answer 2

你的思路没问题，用regex。你的 Xpath 不好。
这里有一些提示：

不需要做/html/body//，你可以做//
无需 select 所有带有 //* 的元素，只是稍后 select 单个元素。您可以继续 select 所需的元素：//div
如果您使用浏览器找到此 xpath，很可能实际上并没有 tbody 元素，因为浏览器经常添加这些元素

像这样尝试：

item_pub['rating'] = review.xpath('//div[@class="details"]/table[@class="detailtoptable"]/tr[1]/td/img/@alt').re_first(r'\d+')

scrapy 选择器 xpath 提取匹配正则表达式或切片字符串

scrapy selector xpath extract matching regex or slicing string

python

regex

xpath

selector

scrapy