自动搜索列表和抓取 table
Auto Search list and Scrape table
我想在网站上自动执行搜索过程并抓取单个玩家的 table(我从 Excel sheet 中获取玩家的名字)。我想将抓取的信息添加到现有的 Excel sheet 和玩家列表中。对于该球员进入联盟的每一年,该球员的名字都需要位于第一列。到目前为止,我能够从现有的 Excel sheet 中获取信息,但我不确定如何使用它来自动执行搜索过程。我不确定 Selenium 是否可以提供帮助。网站是 https://basketball.realgm.com/.
import openpyxl
path = r"C:\Users\Name\Desktop\NBAPlayers.xlsx"
workbook = openpyxl.load_workbook(path)
sheet = workbook.active
rows = sheet.max_row
cols = sheet.max_column
print(rows)
print(cols)
for r in range(2, rows+1):
for c in range(2,cols+1):
print(sheet.cell(row=r,column=c).value, end=" ")
print()
你必须 url 列出玩家并使用漂亮的汤抓取页面。
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://example.com').read())
我假设你已经从 excel sheet 中获得了名称,所以我使用了名称 list
并使用了 python request
模块并获取了页面文本,然后使用 beautiful soup
获取 table 内容,然后我使用 pandas
获取 dataframe
中的信息。
代码:
import requests
import pandas as pd
from bs4 import BeautifulSoup
playernames=['Dominique Jones', 'Joe Young', 'Darius Adams', 'Lester Hudson', 'Marcus Denmon', 'Courtney Fortson']
for name in playernames:
fname=name.split(" ")[0]
lname=name.split(" ")[1]
url="https://basketball.realgm.com/search?q={}+{}".format(fname,lname)
print(url)
r=requests.get(url)
soup=BeautifulSoup(r.text,'html.parser')
table=soup.select_one(".tablesaw ")
dfs=pd.read_html(str(table))
for df in dfs:
print(df)
输出:
https://basketball.realgm.com/search?q=Dominique+Jones
Player Pos HT ... Draft Year College NBA
0 Dominique Jones G 6-4 ... 2010 South Florida Dallas Mavericks
1 Dominique Jones G 6-2 ... 2009 Liberty -
2 Dominique Jones PG 5-9 ... 2011 Fort Hays State -
[3 rows x 8 columns]
https://basketball.realgm.com/search?q=Joe+Young
Player Pos HT ... Draft Year College NBA
0 Joe Young F 6-6 ... 2007 Holy Cross -
1 Joe Young G 6-0 ... 2009 Canisius -
2 Joe Young G 6-2 ... 2015 Oregon Indiana Pacers
3 Joe Young G 6-2 ... 2009 Central Missouri -
[4 rows x 8 columns]
https://basketball.realgm.com/search?q=Darius+Adams
Player Pos HT ... Draft Year College NBA
0 Darius Adams PG 6-1 ... 2011 Indianapolis -
1 Darius Adams G 6-0 ... 2018 Coast Guard Academy -
[2 rows x 8 columns]
https://basketball.realgm.com/search?q=Lester+Hudson
Season Team GP GS MIN ... STL BLK PF TOV PTS
0 2009-10 * All Teams 25 0 5.3 ... 0.32 0.12 0.48 0.56 2.32
1 2009-10 * BOS 16 0 4.4 ... 0.19 0.12 0.44 0.56 1.38
2 2009-10 * MEM 9 0 6.8 ... 0.56 0.11 0.56 0.56 4.00
3 2010-11 WAS 11 0 6.7 ... 0.36 0.09 0.91 0.64 1.64
4 2011-12 * All Teams 16 0 20.9 ... 0.88 0.19 1.62 2.00 10.88
5 2011-12 * CLE 13 0 24.2 ... 1.08 0.23 2.00 2.31 12.69
6 2011-12 * MEM 3 0 6.5 ... 0.00 0.00 0.00 0.67 3.00
7 2014-15 LAC 5 0 11.1 ... 1.20 0.20 0.80 0.60 3.60
8 CAREER NaN 57 0 10.4 ... 0.56 0.14 0.91 0.98 4.70
[9 rows x 23 columns]
https://basketball.realgm.com/search?q=Marcus+Denmon
Season Team Location GP GS ... STL BLK PF TOV PTS
0 2012-13 SAN Las Vegas 5 0 ... 0.4 0.0 1.60 0.20 5.40
1 2013-14 SAN Las Vegas 5 1 ... 0.8 0.0 2.20 1.20 10.80
2 2014-15 SAN Las Vegas 6 2 ... 0.5 0.0 1.50 0.17 5.00
3 2015-16 SAN Salt Lake City 2 0 ... 0.0 0.0 0.00 0.00 0.00
4 CAREER NaN NaN 18 3 ... 0.5 0.0 1.56 0.44 6.17
[5 rows x 24 columns]
https://basketball.realgm.com/search?q=Courtney+Fortson
Season Team GP GS MIN FGM ... AST STL BLK PF TOV PTS
0 2011-12 * All Teams 10 0 9.5 1.10 ... 1.00 0.3 0.0 0.50 1.00 3.50
1 2011-12 * HOU 6 0 8.2 1.00 ... 0.83 0.5 0.0 0.33 0.83 3.00
2 2011-12 * LAC 4 0 11.5 1.25 ... 1.25 0.0 0.0 0.75 1.25 4.25
3 CAREER NaN 10 0 9.5 1.10 ... 1.00 0.3 0.0 0.50 1.00 3.50
[4 rows x 23 columns]
我想在网站上自动执行搜索过程并抓取单个玩家的 table(我从 Excel sheet 中获取玩家的名字)。我想将抓取的信息添加到现有的 Excel sheet 和玩家列表中。对于该球员进入联盟的每一年,该球员的名字都需要位于第一列。到目前为止,我能够从现有的 Excel sheet 中获取信息,但我不确定如何使用它来自动执行搜索过程。我不确定 Selenium 是否可以提供帮助。网站是 https://basketball.realgm.com/.
import openpyxl
path = r"C:\Users\Name\Desktop\NBAPlayers.xlsx"
workbook = openpyxl.load_workbook(path)
sheet = workbook.active
rows = sheet.max_row
cols = sheet.max_column
print(rows)
print(cols)
for r in range(2, rows+1):
for c in range(2,cols+1):
print(sheet.cell(row=r,column=c).value, end=" ")
print()
你必须 url 列出玩家并使用漂亮的汤抓取页面。
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://example.com').read())
我假设你已经从 excel sheet 中获得了名称,所以我使用了名称 list
并使用了 python request
模块并获取了页面文本,然后使用 beautiful soup
获取 table 内容,然后我使用 pandas
获取 dataframe
中的信息。
代码:
import requests
import pandas as pd
from bs4 import BeautifulSoup
playernames=['Dominique Jones', 'Joe Young', 'Darius Adams', 'Lester Hudson', 'Marcus Denmon', 'Courtney Fortson']
for name in playernames:
fname=name.split(" ")[0]
lname=name.split(" ")[1]
url="https://basketball.realgm.com/search?q={}+{}".format(fname,lname)
print(url)
r=requests.get(url)
soup=BeautifulSoup(r.text,'html.parser')
table=soup.select_one(".tablesaw ")
dfs=pd.read_html(str(table))
for df in dfs:
print(df)
输出:
https://basketball.realgm.com/search?q=Dominique+Jones
Player Pos HT ... Draft Year College NBA
0 Dominique Jones G 6-4 ... 2010 South Florida Dallas Mavericks
1 Dominique Jones G 6-2 ... 2009 Liberty -
2 Dominique Jones PG 5-9 ... 2011 Fort Hays State -
[3 rows x 8 columns]
https://basketball.realgm.com/search?q=Joe+Young
Player Pos HT ... Draft Year College NBA
0 Joe Young F 6-6 ... 2007 Holy Cross -
1 Joe Young G 6-0 ... 2009 Canisius -
2 Joe Young G 6-2 ... 2015 Oregon Indiana Pacers
3 Joe Young G 6-2 ... 2009 Central Missouri -
[4 rows x 8 columns]
https://basketball.realgm.com/search?q=Darius+Adams
Player Pos HT ... Draft Year College NBA
0 Darius Adams PG 6-1 ... 2011 Indianapolis -
1 Darius Adams G 6-0 ... 2018 Coast Guard Academy -
[2 rows x 8 columns]
https://basketball.realgm.com/search?q=Lester+Hudson
Season Team GP GS MIN ... STL BLK PF TOV PTS
0 2009-10 * All Teams 25 0 5.3 ... 0.32 0.12 0.48 0.56 2.32
1 2009-10 * BOS 16 0 4.4 ... 0.19 0.12 0.44 0.56 1.38
2 2009-10 * MEM 9 0 6.8 ... 0.56 0.11 0.56 0.56 4.00
3 2010-11 WAS 11 0 6.7 ... 0.36 0.09 0.91 0.64 1.64
4 2011-12 * All Teams 16 0 20.9 ... 0.88 0.19 1.62 2.00 10.88
5 2011-12 * CLE 13 0 24.2 ... 1.08 0.23 2.00 2.31 12.69
6 2011-12 * MEM 3 0 6.5 ... 0.00 0.00 0.00 0.67 3.00
7 2014-15 LAC 5 0 11.1 ... 1.20 0.20 0.80 0.60 3.60
8 CAREER NaN 57 0 10.4 ... 0.56 0.14 0.91 0.98 4.70
[9 rows x 23 columns]
https://basketball.realgm.com/search?q=Marcus+Denmon
Season Team Location GP GS ... STL BLK PF TOV PTS
0 2012-13 SAN Las Vegas 5 0 ... 0.4 0.0 1.60 0.20 5.40
1 2013-14 SAN Las Vegas 5 1 ... 0.8 0.0 2.20 1.20 10.80
2 2014-15 SAN Las Vegas 6 2 ... 0.5 0.0 1.50 0.17 5.00
3 2015-16 SAN Salt Lake City 2 0 ... 0.0 0.0 0.00 0.00 0.00
4 CAREER NaN NaN 18 3 ... 0.5 0.0 1.56 0.44 6.17
[5 rows x 24 columns]
https://basketball.realgm.com/search?q=Courtney+Fortson
Season Team GP GS MIN FGM ... AST STL BLK PF TOV PTS
0 2011-12 * All Teams 10 0 9.5 1.10 ... 1.00 0.3 0.0 0.50 1.00 3.50
1 2011-12 * HOU 6 0 8.2 1.00 ... 0.83 0.5 0.0 0.33 0.83 3.00
2 2011-12 * LAC 4 0 11.5 1.25 ... 1.25 0.0 0.0 0.75 1.25 4.25
3 CAREER NaN 10 0 9.5 1.10 ... 1.00 0.3 0.0 0.50 1.00 3.50
[4 rows x 23 columns]