Beautiful Soup,从维基百科获取 table 数据
Beautiful Soup, fetching table data from Wikipedia
我正在关注 Seppe vanden Broucke 和 Bart Baesens 合着的“数据科学实用 Web 抓取最佳实践和示例 Python”一书。
下一个代码应该从维基百科获取数据,权力的游戏剧集列表:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikiepisodetable')
for table in ep_tables:
headers = []
rows = table.find_all('tr')
for header in table.find('tr').find_all('th'):
headers.append(header.text)
for row in table.find_all('tr')[1:]:
values = []
for col in row.find_all(['th','td']):
values.append(col.text)
if values:
episode_dict = {headers[i]: values[i] for i in
range(len(values))}
episodes.append(episode_dict)
for episode in episodes:
print(episode)
但是在 运行 下一个错误代码显示:
{'No.overall': '1'}
IndexError Traceback(最后一次调用)
<ipython-input-8-d2e64c7e0540> in <module>
20 if values:
21 episode_dict = {headers[i]: values[i] for i in
---> 22 range(len(values))}
23 episodes.append(episode_dict)
24 for episode in episodes:
<ipython-input-8-d2e64c7e0540> in <dictcomp>(.0)
19 values.append(col.text)
20 if values:
---> 21 episode_dict = {headers[i]: values[i] for i in
22 range(len(values))}
23 episodes.append(episode_dict)
IndexError: list index out of range
谁能告诉我为什么会这样?
你的踪迹是
{'No.overall': '1'}
Traceback (most recent call last):
File "/Users/karl/code/deleteme/foo.py", line 20, in <module>
episode_dict = {headers[i]: values[i] for i in
File "/Users/karl/code/deleteme/foo.py", line 20, in <dictcomp>
episode_dict = {headers[i]: values[i] for i in
IndexError: list index out of range
代码可能缩进太多,变量的选择有点难以阅读。准确了解您要提取的内容会很有用。剧集列表?
自从这本书以来,table 结构可能已经改变。
如果是,那么每个相关的剧集标题都是这个形状。
<td class="summary" style="text-align:left">"<a href="/wiki/Stormborn" title="Stormborn">Stormborn</a>"</td>
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikiepisodetable')
for table in ep_tables:
headers = []
rows = table.find_all('tr')
for header in table.find('tr').find_all('th'):
headers.append(header.text)
for row in table.find_all('tr')[1:]:
values = []
for col in row.find_all('td', class_='summary'):
print(col.text)
问题不在于代码,而是代码的缩进。第三个 for
循环应该与第二个平行,而不是在第二个 for
循环内。书上是这样的
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikitable plainrowheaders wikiepisodetable')
for table in ep_tables:
headers = []
rows = table.find_all('tr')
# Start by fetching the header cells from the first row to determine
# the field names
for header in table.find('tr').find_all('th'):
headers.append(header.text)
# Then go through all the rows except the first one
for row in table.find_all('tr')[1:]:
values = []
# And get the column cells, the first one being inside a th-tag
for col in row.find_all(['th','td']):
values.append(col.text)
if values:
episode_dict = {headers[i]: values[i] for i in
range(len(values))}
episodes.append(episode_dict)
# Show the results
for episode in episodes:
print(episode)
我正在关注 Seppe vanden Broucke 和 Bart Baesens 合着的“数据科学实用 Web 抓取最佳实践和示例 Python”一书。
下一个代码应该从维基百科获取数据,权力的游戏剧集列表:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikiepisodetable')
for table in ep_tables:
headers = []
rows = table.find_all('tr')
for header in table.find('tr').find_all('th'):
headers.append(header.text)
for row in table.find_all('tr')[1:]:
values = []
for col in row.find_all(['th','td']):
values.append(col.text)
if values:
episode_dict = {headers[i]: values[i] for i in
range(len(values))}
episodes.append(episode_dict)
for episode in episodes:
print(episode)
但是在 运行 下一个错误代码显示:
{'No.overall': '1'}
IndexError Traceback(最后一次调用)
<ipython-input-8-d2e64c7e0540> in <module>
20 if values:
21 episode_dict = {headers[i]: values[i] for i in
---> 22 range(len(values))}
23 episodes.append(episode_dict)
24 for episode in episodes:
<ipython-input-8-d2e64c7e0540> in <dictcomp>(.0)
19 values.append(col.text)
20 if values:
---> 21 episode_dict = {headers[i]: values[i] for i in
22 range(len(values))}
23 episodes.append(episode_dict)
IndexError: list index out of range
谁能告诉我为什么会这样?
你的踪迹是
{'No.overall': '1'}
Traceback (most recent call last):
File "/Users/karl/code/deleteme/foo.py", line 20, in <module>
episode_dict = {headers[i]: values[i] for i in
File "/Users/karl/code/deleteme/foo.py", line 20, in <dictcomp>
episode_dict = {headers[i]: values[i] for i in
IndexError: list index out of range
代码可能缩进太多,变量的选择有点难以阅读。准确了解您要提取的内容会很有用。剧集列表? 自从这本书以来,table 结构可能已经改变。
如果是,那么每个相关的剧集标题都是这个形状。
<td class="summary" style="text-align:left">"<a href="/wiki/Stormborn" title="Stormborn">Stormborn</a>"</td>
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikiepisodetable')
for table in ep_tables:
headers = []
rows = table.find_all('tr')
for header in table.find('tr').find_all('th'):
headers.append(header.text)
for row in table.find_all('tr')[1:]:
values = []
for col in row.find_all('td', class_='summary'):
print(col.text)
问题不在于代码,而是代码的缩进。第三个 for
循环应该与第二个平行,而不是在第二个 for
循环内。书上是这样的
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikitable plainrowheaders wikiepisodetable')
for table in ep_tables:
headers = []
rows = table.find_all('tr')
# Start by fetching the header cells from the first row to determine
# the field names
for header in table.find('tr').find_all('th'):
headers.append(header.text)
# Then go through all the rows except the first one
for row in table.find_all('tr')[1:]:
values = []
# And get the column cells, the first one being inside a th-tag
for col in row.find_all(['th','td']):
values.append(col.text)
if values:
episode_dict = {headers[i]: values[i] for i in
range(len(values))}
episodes.append(episode_dict)
# Show the results
for episode in episodes:
print(episode)