如何使 BeautifulSoup "understand" 加 html 实体
How to make BeautifulSoup "understand" the plus html entity
假设我们有一个这样的 html
文件:
test.html
<div>
<i>Some text here.</i>
Some text here also.<br>
2 + 4 = 6<br>
2 < 4 = True
</div>
如果我将此 html
传递给 BeautifulSoup
它将转义 plus
实体附近的 &
符号并且输出 html
将类似于这个:
<div>
<i>Some text here.</i>
Some text here also.<br>
2 &plus 4 = 6<br>
2 < 4 = True
</div>
示例 python3
代码:
from bs4 import BeautifulSoup
with open('test.html', 'rb') as file:
soup = BeautifulSoup(file, 'html.parser')
print(soup)
如何避免这种行为?
阅读不同解析器库的描述:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser:
这可以解决您的问题:
s = '''
<div>
<i>Some text here.</i>
Some text here also.<br>
2 + 4 = 6<br>
2 < 4 = True
</div>'''
soup = BeautifulSoup(s, 'html5lib')
你得到:
>>> soup
<html><head></head><body><div>
<i>Some text here.</i>
Some text here also.<br/>
2 + 4 = 6<br/>
2 < 4 = True
</div></body></html>
假设我们有一个这样的 html
文件:
test.html
<div>
<i>Some text here.</i>
Some text here also.<br>
2 + 4 = 6<br>
2 < 4 = True
</div>
如果我将此 html
传递给 BeautifulSoup
它将转义 plus
实体附近的 &
符号并且输出 html
将类似于这个:
<div>
<i>Some text here.</i>
Some text here also.<br>
2 &plus 4 = 6<br>
2 < 4 = True
</div>
示例 python3
代码:
from bs4 import BeautifulSoup
with open('test.html', 'rb') as file:
soup = BeautifulSoup(file, 'html.parser')
print(soup)
如何避免这种行为?
阅读不同解析器库的描述:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser:
这可以解决您的问题:
s = '''
<div>
<i>Some text here.</i>
Some text here also.<br>
2 + 4 = 6<br>
2 < 4 = True
</div>'''
soup = BeautifulSoup(s, 'html5lib')
你得到:
>>> soup
<html><head></head><body><div>
<i>Some text here.</i>
Some text here also.<br/>
2 + 4 = 6<br/>
2 < 4 = True
</div></body></html>