使用Pandas从网页中获取多个表格

Question

我正在使用 Pandas 解析来自以下页面的数据：http://kenpom.com/index.php?y=2014

要获取数据，我在写：

dfs = pd.read_html(url)

数据看起来不错，解析也很完美，只是它只从前 40 行中提取数据。这似乎是表格分离的问题，这使得 pandas 无法获取所有信息。

如何pandas从该网页上的所有表格中获取所有数据？

Answer 1

您发布的页面 HTML 有多个 <thead> 和 <tbody> 标签，混淆了 pandas.read_html.

跟随此 you can manually unwrap 那些标签：

import urllib
from bs4 import BeautifulSoup

html_table = urllib.request.urlopen(url).read()

# fix HTML
soup = BeautifulSoup(html_table, "html.parser")
# warn! id ratings-table is your page specific
for table in soup.findChildren(attrs={'id': 'ratings-table'}): 
    for c in table.children:
        if c.name in ['tbody', 'thead']:
            c.unwrap()

df = pd.read_html(str(soup), flavor="bs4")
len(df[0])

其中 returns 369.

使用Pandas从网页中获取多个表格

Use Pandas to Get Multiple Tables From Webpage

python

html-parsing

web-scraping

pandas