尝试抓取包含多个数据 table 的网页,但只提取了第一个 table?
Trying to scrape a webpage with multiple data tables, however only the first table is being extracted?
我正在尝试从 Basketball-Reference 中提取篮球运动员的数据,用于我正在进行的项目。在 B-R 上,播放器页面有多个 table 数据,我想抓取所有数据。但是,当我尝试从页面中获取 tables 时,它只会给我第一个 table 标签实例,即只有第一个 table.
我搜索了 html,发现在 table 标签的第一个实例之外,所有 table 标签都在注释块下。当我解析他们的父标签并尝试搜索包含 table 信息的子标签时,它 return 什么都没有。 Here is a link to an example page,这是我的代码:
url = 'https://www.basketball-reference.com/players/j/jamesle01.html'
get = requests.get(url)
soup = BeautifulSoup(get.text, 'html.parser')
per_36 = soup.find(id='all_per_minute')
table = per_36.find('table')
这 return 没什么,但是,如果我要查找第一个 table,它会 return 内容。我不明白这是怎么回事,但我认为这可能与那些评论块有关?
要通过 BeautifulSoup 抓取评论,您可以使用此脚本:
import requests
from bs4 import BeautifulSoup, Comment
url = 'https://www.basketball-reference.com/players/j/jamesle01.html'
get = requests.get(url)
soup = BeautifulSoup(get.text, 'html.parser')
pl = soup.select_one('#all_per_minute .placeholder')
comments = pl.find_next(string=lambda text: isinstance(text, Comment))
soup = BeautifulSoup(comments, 'html.parser')
rows = []
for tr in soup.select('tr'):
rows.append([td.get_text(strip=True) for td in tr.select('td, th')])
for row in rows:
print(''.join('{: ^7}'.format(td) for td in row))
打印:
Season Age Tm Lg Pos G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS
2003-04 19 CLE NBA SG 79 79 3122 7.2 17.2 .417 0.7 2.5 .290 6.4 14.7 .438 4.0 5.3 .754 1.1 3.8 5.0 5.4 1.5 0.7 3.1 1.7 19.1
2004-05 20 CLE NBA SF 80 80 3388 8.4 17.9 .472 1.1 3.3 .351 7.3 14.6 .499 5.1 6.8 .750 1.2 5.1 6.2 6.1 1.9 0.6 2.8 1.6 23.1
2005-06 21 CLE NBA SF 79 79 3361 9.4 19.5 .480 1.4 4.1 .335 8.0 15.5 .518 6.4 8.7 .738 0.8 5.2 6.0 5.6 1.3 0.7 2.8 1.9 26.5
2006-07 22 CLE NBA SF 78 78 3190 8.7 18.3 .476 1.1 3.5 .319 7.6 14.8 .513 5.5 7.9 .698 0.9 5.0 5.9 5.3 1.4 0.6 2.8 1.9 24.1
2007-08 23 CLE NBA SF 75 74 3027 9.4 19.5 .484 1.3 4.3 .315 8.1 15.3 .531 6.5 9.2 .712 1.6 5.5 7.0 6.4 1.6 1.0 3.0 2.0 26.8
2008-09 24 CLE NBA SF 81 81 3054 9.3 19.0 .489 1.6 4.5 .344 7.7 14.5 .535 7.0 9.0 .780 1.2 6.0 7.2 6.9 1.6 1.1 2.8 1.6 27.2
2009-10 25 CLE NBA SF 76 76 2966 9.3 18.5 .503 1.6 4.7 .333 7.8 13.8 .560 7.2 9.4 .767 0.9 5.9 6.7 7.9 1.5 0.9 3.2 1.4 27.4
2010-11 26 MIA NBA SF 79 79 3063 8.9 17.5 .510 1.1 3.3 .330 7.8 14.2 .552 5.9 7.8 .759 0.9 6.0 6.9 6.5 1.5 0.6 3.3 1.9 24.8
2011-12 27 MIA NBA SF 62 62 2326 9.6 18.1 .531 0.8 2.3 .362 8.8 15.8 .556 6.0 7.8 .771 1.5 6.2 7.6 6.0 1.8 0.8 3.3 1.5 26.0
...and so on.
我正在尝试从 Basketball-Reference 中提取篮球运动员的数据,用于我正在进行的项目。在 B-R 上,播放器页面有多个 table 数据,我想抓取所有数据。但是,当我尝试从页面中获取 tables 时,它只会给我第一个 table 标签实例,即只有第一个 table.
我搜索了 html,发现在 table 标签的第一个实例之外,所有 table 标签都在注释块下。当我解析他们的父标签并尝试搜索包含 table 信息的子标签时,它 return 什么都没有。 Here is a link to an example page,这是我的代码:
url = 'https://www.basketball-reference.com/players/j/jamesle01.html'
get = requests.get(url)
soup = BeautifulSoup(get.text, 'html.parser')
per_36 = soup.find(id='all_per_minute')
table = per_36.find('table')
这 return 没什么,但是,如果我要查找第一个 table,它会 return 内容。我不明白这是怎么回事,但我认为这可能与那些评论块有关?
要通过 BeautifulSoup 抓取评论,您可以使用此脚本:
import requests
from bs4 import BeautifulSoup, Comment
url = 'https://www.basketball-reference.com/players/j/jamesle01.html'
get = requests.get(url)
soup = BeautifulSoup(get.text, 'html.parser')
pl = soup.select_one('#all_per_minute .placeholder')
comments = pl.find_next(string=lambda text: isinstance(text, Comment))
soup = BeautifulSoup(comments, 'html.parser')
rows = []
for tr in soup.select('tr'):
rows.append([td.get_text(strip=True) for td in tr.select('td, th')])
for row in rows:
print(''.join('{: ^7}'.format(td) for td in row))
打印:
Season Age Tm Lg Pos G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS
2003-04 19 CLE NBA SG 79 79 3122 7.2 17.2 .417 0.7 2.5 .290 6.4 14.7 .438 4.0 5.3 .754 1.1 3.8 5.0 5.4 1.5 0.7 3.1 1.7 19.1
2004-05 20 CLE NBA SF 80 80 3388 8.4 17.9 .472 1.1 3.3 .351 7.3 14.6 .499 5.1 6.8 .750 1.2 5.1 6.2 6.1 1.9 0.6 2.8 1.6 23.1
2005-06 21 CLE NBA SF 79 79 3361 9.4 19.5 .480 1.4 4.1 .335 8.0 15.5 .518 6.4 8.7 .738 0.8 5.2 6.0 5.6 1.3 0.7 2.8 1.9 26.5
2006-07 22 CLE NBA SF 78 78 3190 8.7 18.3 .476 1.1 3.5 .319 7.6 14.8 .513 5.5 7.9 .698 0.9 5.0 5.9 5.3 1.4 0.6 2.8 1.9 24.1
2007-08 23 CLE NBA SF 75 74 3027 9.4 19.5 .484 1.3 4.3 .315 8.1 15.3 .531 6.5 9.2 .712 1.6 5.5 7.0 6.4 1.6 1.0 3.0 2.0 26.8
2008-09 24 CLE NBA SF 81 81 3054 9.3 19.0 .489 1.6 4.5 .344 7.7 14.5 .535 7.0 9.0 .780 1.2 6.0 7.2 6.9 1.6 1.1 2.8 1.6 27.2
2009-10 25 CLE NBA SF 76 76 2966 9.3 18.5 .503 1.6 4.7 .333 7.8 13.8 .560 7.2 9.4 .767 0.9 5.9 6.7 7.9 1.5 0.9 3.2 1.4 27.4
2010-11 26 MIA NBA SF 79 79 3063 8.9 17.5 .510 1.1 3.3 .330 7.8 14.2 .552 5.9 7.8 .759 0.9 6.0 6.9 6.5 1.5 0.6 3.3 1.9 24.8
2011-12 27 MIA NBA SF 62 62 2326 9.6 18.1 .531 0.8 2.3 .362 8.8 15.8 .556 6.0 7.8 .771 1.5 6.2 7.6 6.0 1.8 0.8 3.3 1.5 26.0
...and so on.