标签已转换为 HTML 个实体?

Tags are converted to HTML entities?

我正在尝试使用 BeautifulSoup 来解析一些脏东西 HTML。其中一个 HTML 是 http://f10.5post.com/forums/showthread.php?t=1142017

发生的事情是,首先,树遗漏了页面的一大块。其次,tostring(tree) 会将页面一半的标签(如 <div>)转换为 HTML 实体(如 &lt;/div&gt;)。例如

原文:

<div class="smallfont" align="centre">All times are GMT -4. The time now is <span class="time">02:12 PM</span>.</div>`

toString(tree) 给出

&lt;div class="smallfont" align="center"&gt;All times are GMT -4. The time now is &lt;span class="time"&gt;02:12 PM&lt;/span&gt;.&lt;/div&gt;

这是我的代码:

from BeautifulSoup import BeautifulSoup
import urllib2

page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page)

print soup

谢谢

使用beautifulsoup4 and an extremely lenient html5lib parser:

import urllib2
from bs4 import BeautifulSoup  # NOTE: importing beautifulsoup4 here

page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page, "html5lib")

print soup