Beautiful Soup，从维基百科获取 table 数据

Question

我正在关注 Seppe vanden Broucke 和 Bart Baesens 合着的“数据科学实用 Web 抓取最佳实践和示例 Python”一书。

下一个代码应该从维基百科获取数据，权力的游戏剧集列表：

import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikiepisodetable')
for table in ep_tables:
    headers = []
    rows = table.find_all('tr')
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
        for row in table.find_all('tr')[1:]:
            values = []
            for col in row.find_all(['th','td']):
                values.append(col.text)
                if values:
                    episode_dict = {headers[i]: values[i] for i in
                                    range(len(values))}
                    episodes.append(episode_dict)
                    for episode in episodes:
                        print(episode)

但是在运行下一个错误代码显示：

{'No.overall': '1'}

IndexError Traceback（最后一次调用）

<ipython-input-8-d2e64c7e0540> in <module>
     20                 if values:
     21                     episode_dict = {headers[i]: values[i] for i in
---> 22                                     range(len(values))}
     23                     episodes.append(episode_dict)
     24                     for episode in episodes:

<ipython-input-8-d2e64c7e0540> in <dictcomp>(.0)
     19                 values.append(col.text)
     20                 if values:
---> 21                     episode_dict = {headers[i]: values[i] for i in
     22                                     range(len(values))}
     23                     episodes.append(episode_dict)

IndexError: list index out of range

谁能告诉我为什么会这样？

Answer 1

你的踪迹是

{'No.overall': '1'}
Traceback (most recent call last):
  File "/Users/karl/code/deleteme/foo.py", line 20, in <module>
    episode_dict = {headers[i]: values[i] for i in
  File "/Users/karl/code/deleteme/foo.py", line 20, in <dictcomp>
    episode_dict = {headers[i]: values[i] for i in
IndexError: list index out of range

代码可能缩进太多，变量的选择有点难以阅读。准确了解您要提取的内容会很有用。剧集列表？自从这本书以来，table 结构可能已经改变。

如果是，那么每个相关的剧集标题都是这个形状。

<td class="summary" style="text-align:left">"<a href="/wiki/Stormborn" title="Stormborn">Stormborn</a>"</td>

import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikiepisodetable')
for table in ep_tables:
    headers = []
    rows = table.find_all('tr')
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
        for row in table.find_all('tr')[1:]:
            values = []
            for col in row.find_all('td', class_='summary'):
                print(col.text)

Answer 2

问题不在于代码，而是代码的缩进。第三个 for 循环应该与第二个平行，而不是在第二个 for 循环内。书上是这样的

import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikitable plainrowheaders wikiepisodetable')
for table in ep_tables:
    headers = []
    rows = table.find_all('tr')
    # Start by fetching the header cells from the first row to determine
    # the field names
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
    # Then go through all the rows except the first one
    for row in table.find_all('tr')[1:]:
        values = []
        # And get the column cells, the first one being inside a th-tag
        for col in row.find_all(['th','td']):
            values.append(col.text)
        if values:
            episode_dict = {headers[i]: values[i] for i in
        range(len(values))}
        episodes.append(episode_dict)
# Show the results
for episode in episodes:
 print(episode)

Beautiful Soup，从维基百科获取 table 数据

Beautiful Soup, fetching table data from Wikipedia

python

beautifulsoup

web-crawler

web-scraping