如何使 BeautifulSoup "understand" 加 html 实体

Question

假设我们有一个这样的 html 文件：

test.html

<div>
<i>Some text here.</i>
Some text here also.<br>
2 &plus; 4 = 6<br>
2 &lt; 4 = True
</div>

如果我将此 html 传递给 BeautifulSoup 它将转义 plus 实体附近的 & 符号并且输出 html 将类似于这个：

<div>
<i>Some text here.</i>
Some text here also.<br>
2 &amp;plus 4 = 6<br>
2 &lt; 4 = True
</div>

示例 python3 代码：

from bs4 import BeautifulSoup

with open('test.html', 'rb') as file:
    soup = BeautifulSoup(file, 'html.parser')

print(soup)

如何避免这种行为？

Answer 1

这可以解决您的问题：

s = '''
<div>
<i>Some text here.</i>
Some text here also.<br>
2 &plus; 4 = 6<br>
2 &lt; 4 = True
</div>'''

soup = BeautifulSoup(s, 'html5lib')

你得到：

>>> soup

<html><head></head><body><div>
<i>Some text here.</i>
Some text here also.<br/>
2 + 4 = 6<br/>
2 &lt; 4 = True
</div></body></html>

How to make BeautifulSoup "understand" the plus html entity