BeautifulSoup。元素索引错误

Question

我一直在解析 html 的 ol 元素，但遇到了元素索引问题。

假设我们有以下元素：

html_document = """
<ol>
    <li>Test lists</li>
    <li>Second option</li>
    <li>Third option</li>
</ol>
"""

所以，让我们解析它：

soup = BeautifulSoup(html_document)
all_li = tuple(soup.find_all('li'))
result = [el.parent.index(el) for el in all_li]
print(result)  # [1, 3, 5]

为什么是 1,3,5？或者我错过了什么？

Answer 1

您正在使用父标签tag.Just使用子标签。

html_document = """
<ol>
    <li>Test lists</li>
    <li>Second option</li>
    <li>Third option</li>
</ol>
"""

soup = BeautifulSoup(html_document,'lxml')
all_li = tuple(soup.find_all('li'))
result = [all_li.index(el) for el in all_li]
print(result)

输出：

[0, 1, 2]

Answer 2

在index()方法的定义中，我们看到如下代码：

    def index(self, element):
        """
        Find the index of a child by identity, not value. Avoids issues with
        tag.contents.index(element) getting the index of equal elements.
        """
        for i, child in enumerate(self.contents):
            if child is element:
                return i
        raise ValueError("Tag.index: element not in tag")

所以您真的需要查看 .contents 属性，它显示了以下成员（<ol> 标记的子项）：

0 <class 'bs4.element.NavigableString'> 
1 <class 'bs4.element.Tag'> <li>Test lists</li>
2 <class 'bs4.element.NavigableString'> 
3 <class 'bs4.element.Tag'> <li>Second option</li>
4 <class 'bs4.element.NavigableString'> 
5 <class 'bs4.element.Tag'> <li>Third option</li>
6 <class 'bs4.element.NavigableString'>

换句话说，<li> 标签的父标签 <ol> 有其他子标签——可导航字符串，您没有直接捕获它们，因为您只搜索了列表项 (soup.find_all('li')).

BeautifulSoup。元素索引错误

BeautifulSoup. Wrong element index

python

beautifulsoup

html-parsing