Beautiful Soup 过滤器功能无法找到 table 的所有行

Question

我正在尝试使用 Python Beautiful Soup 4 库解析大型 html 文档。

该页面包含一个非常大的 table，结构如下：

<table summary='foo'>
    <tbody>
        <tr> 
            A bunch of data 
        </tr>
        <tr>
            More data 
        </tr>
        .
        .
        .
        100s of <tr> tags later
    </tbody>
</table>

我有一个函数可以评估 soup.descendants 中的给定标签是否属于我正在寻找的类型。这是必要的，因为页面很大（BeautifulSoup 告诉我文档包含大约 4000 个标签）。是这样的：

def isrow(tag):
    if tag.name == u'tr':
        if tag.parent.parent.name == u'table' and \
                tag.parent.parent.has_attr('summary'): 
            return True

我的问题是，当我遍历 soup.descendants 时，函数仅 returns True table 中的前 77 行，当我知道<tr> 标签持续数百行。

这是我的函数有问题还是我不明白 BeautifulSoup 如何生成其后代集合？我怀疑这可能是 Python 或 bs4 内存问题，但我不知道如何进行故障排除。

Answer 1

更像是一个有根据的猜测，但我会试一试。

BeautifulSoup 解析 HTML 的方式在很大程度上取决于 underlying parser. If you don't specify it explicitly，BeautifulSoup 将根据内部排名自动选择一个：

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

对于你的情况，我会尝试切换解析器，看看你会得到什么结果：

soup = BeautifulSoup(data, "lxml")  # needs lxml to be installed
soup = BeautifulSoup(data, "html5lib")  # needs html5lib to be installed
soup = BeautifulSoup(data, "html.parser")  # uses built-in html.parser

Beautiful Soup 过滤器功能无法找到 table 的所有行

Beautiful Soup filter function fails to find all rows of a table

html

python

beautifulsoup

web-scraping