标签已转换为 HTML 个实体？

Question

我正在尝试使用 BeautifulSoup 来解析一些脏东西 HTML。其中一个 HTML 是 http://f10.5post.com/forums/showthread.php?t=1142017

发生的事情是，首先，树遗漏了页面的一大块。其次，tostring(tree) 会将页面一半的标签（如 <div>）转换为 HTML 实体（如 </div>）。例如

原文：

<div class="smallfont" align="centre">All times are GMT -4. The time now is <span class="time">02:12 PM</span>.</div>`

toString(tree) 给出

&lt;div class="smallfont" align="center"&gt;All times are GMT -4. The time now is &lt;span class="time"&gt;02:12 PM&lt;/span&gt;.&lt;/div&gt;

这是我的代码：

from BeautifulSoup import BeautifulSoup
import urllib2

page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page)

print soup

谢谢

Answer 1

使用beautifulsoup4 and an extremely lenient html5lib parser:

import urllib2
from bs4 import BeautifulSoup  # NOTE: importing beautifulsoup4 here

page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page, "html5lib")

print soup

标签已转换为 HTML 个实体？

Tags are converted to HTML entities?

html

python

parsing

beautifulsoup

html-parsing