标签已转换为 HTML 个实体?
Tags are converted to HTML entities?
我正在尝试使用 BeautifulSoup 来解析一些脏东西 HTML。其中一个 HTML 是 http://f10.5post.com/forums/showthread.php?t=1142017
发生的事情是,首先,树遗漏了页面的一大块。其次,tostring(tree)
会将页面一半的标签(如 <div>
)转换为 HTML 实体(如 </div>
)。例如
原文:
<div class="smallfont" align="centre">All times are GMT -4. The time now is <span class="time">02:12 PM</span>.</div>`
toString(tree)
给出
<div class="smallfont" align="center">All times are GMT -4. The time now is <span class="time">02:12 PM</span>.</div>
这是我的代码:
from BeautifulSoup import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page)
print soup
谢谢
使用beautifulsoup4
and an extremely lenient html5lib
parser:
import urllib2
from bs4 import BeautifulSoup # NOTE: importing beautifulsoup4 here
page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page, "html5lib")
print soup
我正在尝试使用 BeautifulSoup 来解析一些脏东西 HTML。其中一个 HTML 是 http://f10.5post.com/forums/showthread.php?t=1142017
发生的事情是,首先,树遗漏了页面的一大块。其次,tostring(tree)
会将页面一半的标签(如 <div>
)转换为 HTML 实体(如 </div>
)。例如
原文:
<div class="smallfont" align="centre">All times are GMT -4. The time now is <span class="time">02:12 PM</span>.</div>`
toString(tree)
给出
<div class="smallfont" align="center">All times are GMT -4. The time now is <span class="time">02:12 PM</span>.</div>
这是我的代码:
from BeautifulSoup import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page)
print soup
谢谢
使用beautifulsoup4
and an extremely lenient html5lib
parser:
import urllib2
from bs4 import BeautifulSoup # NOTE: importing beautifulsoup4 here
page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page, "html5lib")
print soup