在 html table 中的 td 节点下递归搜索文本

Question

我正在使用 Python 抓取 html table。到目前为止，我已经成功解析出 table ：

root = etree.fromstring(browser.page_source, etree.HTMLParser())
rows = root.xpath("//table[@class='ms-listviewtable']/tbody/tr")

现在我想用 for 循环在每一行中逐一解析列，例如：

for row in rows:
    cols = row.xpath("./td")
    texts = [col.xpath("./findtextforme()") for col in cols)]
    # findtextforme() is a imaginary functionality

为什么我不能简单地做 col.xpath("./text()") 或 col.findtext("./")？因为他们放置文本的位置在 table 的列之间甚至在列内不一致，包括 td/text()、td/div/a/text()、td/div/font/text()、td/div/div/text()... 等.

因此我想要一些可以在给定 td 节点下递归查找文本的东西。我怎样才能做到这一点？

Answer 1

您可以使用 .text_content() 聚合 "text" 的 HTML 元素：

Returns the text content of the element, including the text content of its children, with no markup.

texts = [col.text_content() for col in cols]

在 html table 中的 td 节点下递归搜索文本

Search for texts recursively under a td node in an html table

python

xpath

lxml

web-crawler