使用 BS4 抓取未关闭的元标签

Question

我正在尝试获取元标记的内容。问题是 BS4 无法在某些网站上正确解析标签，标签没有按应有的方式关闭。使用标签作为下面的示例，我的函数的输出包括大量混乱，包括其他标签，如脚本、链接等。我相信浏览器会自动关闭头部末尾某处的元标签，这种行为会混淆 BS4。

我的代码适用于此：

<meta name="description" content="content" />

不适用于：

<meta name="description" content="content">

这是我的 BS4 函数的代码：

from bs4 import BeautifulSoup

html = BeautifulSoup(open('/path/file.html'), 'html.parser')
desc = html.find(attrs={'name':'description'})

print(desc)

有什么方法可以让它与那些未关闭的元标记一起使用？

Answer 1

html5lib or lxml parser 会妥善处理问题：

In [1]: from bs4 import BeautifulSoup
   ...: 
   ...: data = """
   ...: <html>
   ...:     <head>
   ...:         <meta name="description" content="content">
   ...:         <script>
   ...:             var i = 0;
   ...:         </script>
   ...:     </head>
   ...:     <body>
   ...:         <div id="content">content</div>
   ...:     </body>
   ...: </html>"""
   ...: 

In [2]: BeautifulSoup(data, 'html.parser').find(attrs={'name': 'description'})
Out[2]: <meta content="content" name="description">\n<script>\n            var i = 0;\n        </script>\n</meta>

In [3]: BeautifulSoup(data, 'html5lib').find(attrs={'name': 'description'})
Out[3]: <meta content="content" name="description"/>

In [4]: BeautifulSoup(data, 'lxml').find(attrs={'name': 'description'})
Out[4]: <meta content="content" name="description"/>

Answer 2

有了新的东西，希望它能给你一些帮助，我想每次BeautifulSoup找到一个没有正确结束标签的元素，然后它会继续搜索下一个元素，直到它的父标签结束tag.Maybe你还不明白我的想法，我在这里做了一个小演示：

    hello.html
<!DOCTYPE html>
    <html lang="en">
    <meta name="description" content="content">
    <head>
        <meta charset="UTF-8">
        <title>Title</title>
    </head>
    <div>
    <p class="title"><b>The Dormouse's story</b>

    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    </p></div>
    </body>
    </html>

和运行就像你之前做的那样，找到下面的结果：

<meta content="content" name="description">
<head>
<meta charset="utf-8">
<title>Title</title>
</meta></head>
<body>
...
</div></body>
</meta>

好的！ BeautifulSoup自动生成结束meta标签，位置在</body>标签之后，但是还是看不到meta的父结束标签</html>，所以我的意思是结束标签应该反映为与其开始标记相同的位置。但是我还是无法说服自己这样的意见所以我做了一个测试，删除<p class='title'>结束标签所以<div>...</div>中只有一个</p>标签，但是在运行ning[之后=19=]

c = soup.find_all('p', attrs={'class':'title'}) print(c[0])

结果中有两个 </p> 标签。所以正如我之前所说的那样。

使用 BS4 抓取未关闭的元标签

Scraping un-closed meta tags with BS4

python

beautifulsoup

html-parsing

python-3.x