即使使用正确的 xpath，Scraper 也会抛出错误

Question

我在 python 中结合 lxml 库编写了一个脚本，用于从 html elements 的一大块中解析一些 price（在本例中为 80 和 100）。我用 xpaths 来完成这项工作。当我开始使用 .fromstring() 时，我在下面的刮板中使用的 xpaths 都可以正常工作。但是，当我选择使用从 lxml.etree 导入的 HTML 时，xpath containig contains() 表达式失败。事实证明，当我在抓取工具中使用多个 class 名称时，它可以工作，但是当从 compound class names 中选择一个 single class name 时，它会抛出一个错误。

如何在不使用 compound class names 的情况下处理这种情况；而是使用 single class name 遵循 .contains() 模式或其他方式？

这是我的尝试：

from lxml.etree import HTML

elements =\
"""
    <li class="ProductPrice">
      <span class="Regular Price">80.00</span>
    </li>
    <li class="ProductPrice">
      <span class="Regular Price">100.00</span>
    </li>
"""
root = HTML(elements)
for item in root.findall(".//*[@class='ProductPrice']"):
    # regular = item.find('.//span[@class="Regular Price"]').text
    regular = item.find('.//span[contains(@class,"Regular")]').text
    print(regular)

顺便说一句，上面脚本中使用的注释掉的 xpath 工作正常。但是不能去 fo .contains() 抛出以下错误的表达式：

Traceback (most recent call last):
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\SO.py", line 15, in <module>
    regular = item.find('.//span[contains(@class,"Regular")]').text
  File "src\lxml\etree.pyx", line 1526, in lxml.etree._Element.find
  File "src\lxml\_elementpath.py", line 311, in lxml._elementpath.find
  File "src\lxml\_elementpath.py", line 300, in lxml._elementpath.iterfind
  File "src\lxml\_elementpath.py", line 283, in lxml._elementpath._build_path_iterator
  File "src\lxml\_elementpath.py", line 229, in lxml._elementpath.prepare_predicate
SyntaxError: invalid predicate

最后一件事：我不想使用 compound class names 因为很少有网站动态生成它们。谢谢。

Answer 1

.find() 只支持基本的 xpath。

试试 .xpath()。

示例（未经测试）...

regular = item.xpath('.//span[contains(@class,"Regular")]')[0].text

有关详细信息，请参阅 http://lxml.de/xpathxslt.html。

即使使用正确的 xpath，Scraper 也会抛出错误

Scraper throws an error even if right xpath is used

python

xpath

lxml

web-scraping

python-3.x