如何从使用 Beautifulsoup 抓取 Table 形成的 Pandas 数据框中删除 header？ (Python)

Question

我从 pro-football-reference 中抓取了一个 table 并创建了一个 Dataframe 但由于需要将 html 转换为字符串似乎运行成了一个问题.

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
rb_r = requests.get('https://www.pro-football-reference.com/years/2021/rushing.htm')
rb_webpage = bs(rb_r.content, features='lxml')
rb_table = rb_webpage.find('table', attrs={'id': 'rushing'})
rb_df = pd.read_html(str(rb_table))[0]
print(rb_df.head().to_string())

输出：

  Unnamed: 0_level_0 Unnamed: 1_level_0 Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 Games     Rushing                                Unnamed: 14_level_0
                  Rk             Player                 Tm                Age                Pos     G  GS     Att   Yds  TD   1D Lng  Y/A    Y/G                 Fmb
0                  1  Jonathan Taylor*+                IND                 22                 RB    17  17     332  1811  18  107  83  5.5  106.5                   4
1                  2      Najee Harris*                PIT                 23                 RB    17  17     307  1200   7   62  37  3.9   70.6                   0
2                  3         Joe Mixon*                CIN                 25                 RB    16  16     292  1205  13   60  32  4.1   75.3                   2
3                  4     Antonio Gibson                WAS                 23                 RB    16  14     258  1037   7   65  27  4.0   64.8                   6
4                  5       Dalvin Cook*                MIN                 26                 RB    13  13     249  1159   6   57  66  4.7   89.2

我正在尝试删除“未命名：0_level_0...”header，但我尝试的一切都没有奏效。提前致谢！

Answer 1

您离目标很近了，只需将 header 参数添加到 pandas.read_html() 到 select 正确的参数：

pd.read_html(str(rb_table), header=1)[0]

例子

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
rb_r = requests.get('https://www.pro-football-reference.com/years/2021/rushing.htm')
rb_webpage = bs(rb_r.content, features='lxml')
rb_table = rb_webpage.find('table', attrs={'id': 'rushing'})
rb_df = pd.read_html(str(rb_table), header=1)[0]
print(rb_df.head().to_string())

输出

	Rk	Player	Tm	Age	Pos	G	GS	Att	Yds	TD	1D	Lng	Y/A	Y/G	Fmb
0	1	Jonathan Taylor*+	IND	22	RB	17	17	332	1811	18	107	83	5.5	106.5	4
1	2	Najee Harris*	PIT	23	RB	17	17	307	1200	7	62	37	3.9	70.6	0
2	3	Joe Mixon*	CIN	25	RB	16	16	292	1205	13	60	32	4.1	75.3	2
3	4	Antonio Gibson	WAS	23	RB	16	14	258	1037	7	65	27	4	64.8	6
4	5	Dalvin Cook*	MIN	26	RB	13	13	249	1159	6	57	66	4.7	89.2	3

如何从使用 Beautifulsoup 抓取 Table 形成的 Pandas 数据框中删除 header？ (Python)

How do you drop a header from a Pandas Dataframe formed by Scraping a Table using Beautifulsoup? (Python)

html

python

beautifulsoup

dataframe

pandas

例子

输出