如何从使用 Beautifulsoup 抓取 Table 形成的 Pandas 数据框中删除 header? (Python)
How do you drop a header from a Pandas Dataframe formed by Scraping a Table using Beautifulsoup? (Python)
我从 pro-football-reference 中抓取了一个 table 并创建了一个 Dataframe 但由于需要将 html 转换为字符串似乎 运行 成了一个问题.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
rb_r = requests.get('https://www.pro-football-reference.com/years/2021/rushing.htm')
rb_webpage = bs(rb_r.content, features='lxml')
rb_table = rb_webpage.find('table', attrs={'id': 'rushing'})
rb_df = pd.read_html(str(rb_table))[0]
print(rb_df.head().to_string())
输出:
Unnamed: 0_level_0 Unnamed: 1_level_0 Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 Games Rushing Unnamed: 14_level_0
Rk Player Tm Age Pos G GS Att Yds TD 1D Lng Y/A Y/G Fmb
0 1 Jonathan Taylor*+ IND 22 RB 17 17 332 1811 18 107 83 5.5 106.5 4
1 2 Najee Harris* PIT 23 RB 17 17 307 1200 7 62 37 3.9 70.6 0
2 3 Joe Mixon* CIN 25 RB 16 16 292 1205 13 60 32 4.1 75.3 2
3 4 Antonio Gibson WAS 23 RB 16 14 258 1037 7 65 27 4.0 64.8 6
4 5 Dalvin Cook* MIN 26 RB 13 13 249 1159 6 57 66 4.7 89.2
我正在尝试删除“未命名:0_level_0...”header,但我尝试的一切都没有奏效。提前致谢!
您离目标很近了,只需将 header 参数添加到 pandas.read_html()
到 select 正确的参数:
pd.read_html(str(rb_table), header=1)[0]
例子
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
rb_r = requests.get('https://www.pro-football-reference.com/years/2021/rushing.htm')
rb_webpage = bs(rb_r.content, features='lxml')
rb_table = rb_webpage.find('table', attrs={'id': 'rushing'})
rb_df = pd.read_html(str(rb_table), header=1)[0]
print(rb_df.head().to_string())
输出
Rk
Player
Tm
Age
Pos
G
GS
Att
Yds
TD
1D
Lng
Y/A
Y/G
Fmb
0
1
Jonathan Taylor*+
IND
22
RB
17
17
332
1811
18
107
83
5.5
106.5
4
1
2
Najee Harris*
PIT
23
RB
17
17
307
1200
7
62
37
3.9
70.6
0
2
3
Joe Mixon*
CIN
25
RB
16
16
292
1205
13
60
32
4.1
75.3
2
3
4
Antonio Gibson
WAS
23
RB
16
14
258
1037
7
65
27
4
64.8
6
4
5
Dalvin Cook*
MIN
26
RB
13
13
249
1159
6
57
66
4.7
89.2
3
我从 pro-football-reference 中抓取了一个 table 并创建了一个 Dataframe 但由于需要将 html 转换为字符串似乎 运行 成了一个问题.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
rb_r = requests.get('https://www.pro-football-reference.com/years/2021/rushing.htm')
rb_webpage = bs(rb_r.content, features='lxml')
rb_table = rb_webpage.find('table', attrs={'id': 'rushing'})
rb_df = pd.read_html(str(rb_table))[0]
print(rb_df.head().to_string())
输出:
Unnamed: 0_level_0 Unnamed: 1_level_0 Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 Games Rushing Unnamed: 14_level_0
Rk Player Tm Age Pos G GS Att Yds TD 1D Lng Y/A Y/G Fmb
0 1 Jonathan Taylor*+ IND 22 RB 17 17 332 1811 18 107 83 5.5 106.5 4
1 2 Najee Harris* PIT 23 RB 17 17 307 1200 7 62 37 3.9 70.6 0
2 3 Joe Mixon* CIN 25 RB 16 16 292 1205 13 60 32 4.1 75.3 2
3 4 Antonio Gibson WAS 23 RB 16 14 258 1037 7 65 27 4.0 64.8 6
4 5 Dalvin Cook* MIN 26 RB 13 13 249 1159 6 57 66 4.7 89.2
我正在尝试删除“未命名:0_level_0...”header,但我尝试的一切都没有奏效。提前致谢!
您离目标很近了,只需将 header 参数添加到 pandas.read_html()
到 select 正确的参数:
pd.read_html(str(rb_table), header=1)[0]
例子
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
rb_r = requests.get('https://www.pro-football-reference.com/years/2021/rushing.htm')
rb_webpage = bs(rb_r.content, features='lxml')
rb_table = rb_webpage.find('table', attrs={'id': 'rushing'})
rb_df = pd.read_html(str(rb_table), header=1)[0]
print(rb_df.head().to_string())
输出
Rk | Player | Tm | Age | Pos | G | GS | Att | Yds | TD | 1D | Lng | Y/A | Y/G | Fmb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Jonathan Taylor*+ | IND | 22 | RB | 17 | 17 | 332 | 1811 | 18 | 107 | 83 | 5.5 | 106.5 | 4 |
1 | 2 | Najee Harris* | PIT | 23 | RB | 17 | 17 | 307 | 1200 | 7 | 62 | 37 | 3.9 | 70.6 | 0 |
2 | 3 | Joe Mixon* | CIN | 25 | RB | 16 | 16 | 292 | 1205 | 13 | 60 | 32 | 4.1 | 75.3 | 2 |
3 | 4 | Antonio Gibson | WAS | 23 | RB | 16 | 14 | 258 | 1037 | 7 | 65 | 27 | 4 | 64.8 | 6 |
4 | 5 | Dalvin Cook* | MIN | 26 | RB | 13 | 13 | 249 | 1159 | 6 | 57 | 66 | 4.7 | 89.2 | 3 |