努力使用 xPath 子选择条件

Question

目前我一直在努力处理一些 xPath 操作。我在 Python 中有这个 html scraper 它将在一组特定的 <li> 之后解析 HTML 树并提取其 text()。问题是其中一些 <li> 的 <i class='ok'></i> 里面没有文字。

<html>
  <body>
    <div>
     <ul>
       <li>Text...</li>
       <li>Other text...</li>
       <li><i class='ok'></i></li>
       <li><i class='ok'></i>Another text...</li>
     </ul>
    </div>
  </body>
</html>

目前我的 xPath 选择器如下：

row_value = '(//div[contains(@id,"phone_columns")]' \
            '/div/ul[contains(@class,"phone_column_features")]' \
            '/li/text() | ' \
            '//div[contains(@id,"phone_columns")]' \
            '/div/ul[contains(@class,"phone_column_features")]' \
            '/li/i/@class)'

我想在某些情况下获得 class 值，但大多数情况下 text() 就可以了。

当前输出：

[ "Text...", "Other text...", "ok", "ok", "Another text..." ]

期望的输出：

[ "Text...", "Other text...", "ok", "ok Another text..." ]

提前致谢，塞萨尔·利德克

Answer 1

通常像 //li/concat(i/@class, text()) 这样的 XPath 应该可以解决问题，但我很确定 lxml 不支持这种语法。

您可以使用更复杂的代码：

source = lxml.html.fromstring(your_HTML)
li_nodes = source.xpath("//div/ul/li")  # replace this simplified XPath with actual XPath for li nodes

class_values = [i.xpath("./i/@class")[0] if i.xpath("./i/@class") else " " for i in li_nodes]
text_nodes = [i.text_content() if i.text_content() else " " for i in li_nodes]

output = [" ".join(item).strip() for item in zip(class_values, text_nodes)]

print(output)的输出：

['Text...', 'Other text...', 'ok', 'ok Another text...']

努力使用 xPath 子选择条件

Struggling with xPath sub selecting with conditionals

python

xpath

lxml