BeautifulSoup 涉及平面 HTML 层次结构和 next_sibling 循环的故障排除
BeautifulSoup Troubleshooting involving flat HTML hierarchy and next_sibling loop
所以我有一个平面层次结构 HTML 这样:
<div class="caption">
<strong>July 1</strong>
<br>
<em>Top Gun</em>
<br>
"Location: Millennium Park"
<br>
"Amenities: Please be a volleyball tournament..."
<br>
<em>Captain Phillips</em>
<br>
"Location: Montgomery Ward Park"
<br>
<br>
<strong>July 2</strong>
<br>
<em>The Fantastic Mr. Fox </em>
而且我从一开始就被我的代码绊倒了..我是在错误地使用 find_sibling 还是这里还有什么问题我无法得到任何东西 return当我运行print title
?谢谢大家。
import csv
import re
from bs4 import BeautifulSoup
from urllib2 import urlopen
URL = 'http://www.thrillist.com/entertainment/chicago/free-outdoor-summer-movies-chicago'
html = urlopen(URL).read()
soup = BeautifulSoup(html)
root = soup.find_all("strong")
for row in root:
sibling = row.next_sibling
while sibling and sibling.name != "strong":
if sibling.name == "em":
title = sibling.text
sibling = sibling.next_sibling
print title <---- still not getting the movie titles under tag<em>
Setting an underlying parser 到 lxml
(需要安装),或 html.parser
帮助我解决了问题(像往常一样,所有功劳归于@abarnert),演示:
>>> from bs4 import BeautifulSoup
>>> from urllib2 import urlopen
>>>
>>> URL = 'http://www.thrillist.com/entertainment/chicago/free-outdoor-summer-movies-chicago'
>>> html = urlopen(URL).read()
>>> len(BeautifulSoup(html, "html.parser").find_all('strong'))
81
>>> len(BeautifulSoup(html, "lxml").find_all('strong'))
81
>>> len(BeautifulSoup(html, "html5lib").find_all('strong'))
0
请注意,如果您没有明确指定解析器,BeautifulSoup
将自动选择一个:
If you don’t specify anything, you’ll get the best HTML parser that’s
installed. Beautiful Soup ranks lxml’s parser as being the best, then
html5lib’s, then Python’s built-in parser.
我想,在您的情况下,选择是 html5lib
,并且正如您在演示中看到的那样,它存在问题,没有找到 strong
标签,因此,您没有请参阅打印的 title
。
此外,按照@abarnert 的说明,您需要在点击下一个 strong
标签后退出内循环:
root = soup.find_all("strong")
for row in root:
for sibling in row.next_siblings:
if sibling.name == "strong":
break
if sibling.name == "em":
print sibling.text
打印:
A League of Their Own
It's a Mad, Mad, Mad, Mad World
Monsters University
...
Cloudy with a Chance of Meatballs 2
Best in Show
Ironman 3
Sean Cooley is Thrillist's Chicago Editor and is still mad that Ben Affleck is the new Batman. Follow him @SeanCooley.
所以我有一个平面层次结构 HTML 这样:
<div class="caption">
<strong>July 1</strong>
<br>
<em>Top Gun</em>
<br>
"Location: Millennium Park"
<br>
"Amenities: Please be a volleyball tournament..."
<br>
<em>Captain Phillips</em>
<br>
"Location: Montgomery Ward Park"
<br>
<br>
<strong>July 2</strong>
<br>
<em>The Fantastic Mr. Fox </em>
而且我从一开始就被我的代码绊倒了..我是在错误地使用 find_sibling 还是这里还有什么问题我无法得到任何东西 return当我运行print title
?谢谢大家。
import csv
import re
from bs4 import BeautifulSoup
from urllib2 import urlopen
URL = 'http://www.thrillist.com/entertainment/chicago/free-outdoor-summer-movies-chicago'
html = urlopen(URL).read()
soup = BeautifulSoup(html)
root = soup.find_all("strong")
for row in root:
sibling = row.next_sibling
while sibling and sibling.name != "strong":
if sibling.name == "em":
title = sibling.text
sibling = sibling.next_sibling
print title <---- still not getting the movie titles under tag<em>
Setting an underlying parser 到 lxml
(需要安装),或 html.parser
帮助我解决了问题(像往常一样,所有功劳归于@abarnert),演示:
>>> from bs4 import BeautifulSoup
>>> from urllib2 import urlopen
>>>
>>> URL = 'http://www.thrillist.com/entertainment/chicago/free-outdoor-summer-movies-chicago'
>>> html = urlopen(URL).read()
>>> len(BeautifulSoup(html, "html.parser").find_all('strong'))
81
>>> len(BeautifulSoup(html, "lxml").find_all('strong'))
81
>>> len(BeautifulSoup(html, "html5lib").find_all('strong'))
0
请注意,如果您没有明确指定解析器,BeautifulSoup
将自动选择一个:
If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.
我想,在您的情况下,选择是 html5lib
,并且正如您在演示中看到的那样,它存在问题,没有找到 strong
标签,因此,您没有请参阅打印的 title
。
此外,按照@abarnert 的说明,您需要在点击下一个 strong
标签后退出内循环:
root = soup.find_all("strong")
for row in root:
for sibling in row.next_siblings:
if sibling.name == "strong":
break
if sibling.name == "em":
print sibling.text
打印:
A League of Their Own
It's a Mad, Mad, Mad, Mad World
Monsters University
...
Cloudy with a Chance of Meatballs 2
Best in Show
Ironman 3
Sean Cooley is Thrillist's Chicago Editor and is still mad that Ben Affleck is the new Batman. Follow him @SeanCooley.