如何使用 Xpath 抓取 NHL 滑冰运动员统计数据?
How to scrape NHL skater stats using Xpath?
我正在尝试抓取 2017/2018 NHL 滑冰运动员的统计数据。我已经开始编写代码,但我 运行 遇到解析数据和打印到 excel 的问题。
到目前为止,这是我的代码:
#import modules
from urllib.request import urlopen
from lxml.html import fromstring
import pandas as pd
#connect to url
url = "https://www.hockey-reference.com/leagues/NHL_2018_skaters.html"
#remove HTML comment markup
content = str(urlopen(url).read())
comment = content.replace("-->","").replace("<!--","")
tree = fromstring(comment)
#setting up excel columns
columns = ("names", "gp", "g", "s", "team")
df = pd.DataFrame(columns=columns)
#attempt at parsing data while using loop
for nhl, skater_row in enumerate(tree.xpath('//table[contains(@class,"stats_table")]/tr')):
names = pitcher_row.xpath('.//td[@data-stat="player"]/a')[0].text
gp = skater_row.xpath('.//td[@data-stat="games_played"]/text()')[0]
g = skater_row.xpath('.//td[@data-stat="goals"]/text()')[0]
s = skater_row.xpath('.//td[@data-stat="shots"]/text()')[0]
try:
team = skater_row.xpath('.//td[@data-stat="team_id"]/a')[0].text
# create pandas dataframe to export data to excel
df.loc[nhl] = (names, team, gp, g, s)
#write data to excel
writer = pd.ExcelWriter('NHL skater.xlsx')
df.to_excel(writer, 'Sheet1')
writer.save()
有人可以解释一下如何解析这些数据吗?您是否有任何提示可以帮助编写 Xpath 以便我可以遍历数据?
我写这行有问题:
for nhl, skater_row in enumerate(tree.xpath...
您是如何找到 Xpath 的?您是否使用了 Xpath Finder 或 Xpath Helper?
此外,我 运行 遇到以下行的错误:
df.loc[nhl] = (names, team, gp, g, s)
它显示了 df 的无效语法。
我是网络抓取的新手,之前没有编码经验。任何帮助将不胜感激。提前感谢您的宝贵时间!
IIUC:可以这样用 BeautifulSoup
and pandas
read_html
import requests
import pandas
from bs4 import BeautifulSoup
url = 'https://www.hockey-reference.com/leagues/NHL_2018_skaters.html'
pg = requests.get(url)
bsf = BeautifulSoup(pg.content, 'html5lib')
tables = bsf.findAll('table', attrs={'id':'stats'})
dfs = pd.read_html(tables[0].prettify())
df = dfs[0]
生成的数据框将包含 table 中的所有列,并使用 pandas 过滤所需的列。
#Filters only columns 1, 3 and 5 similarly all required columns can be filtered.
dff = df[df.columns[[1, 3, 5]]]
如果您仍然想坚持使用 XPath 并仅获取所需的数据而不是过滤完整的数据,您可以尝试以下操作:
for row in tree.xpath('//table[@id="stats"]/tbody/tr[not(@class="thead")]'):
name = row.xpath('.//td[@data-stat="player"]')[0].text_content()
gp = row.xpath('.//td[@data-stat="games_played"]')[0].text_content()
g = row.xpath('.//td[@data-stat="goals"]')[0].text_content()
s = row.xpath('.//td[@data-stat="shots"]')[0].text_content()
team = row.xpath('.//td[@data-stat="team_id"]')[0].text_content()
print(name, gp, g, s, team)
的输出:
Justin Abdelkader 75 13 110 DET
Pontus Aberg 53 4 70 TOT
Pontus Aberg 37 2 39 NSH
Pontus Aberg 16 2 31 EDM
Noel Acciari 60 10 66 BOS
Kenny Agostino 5 0 11 BOS
Sebastian Aho 78 29 200 CAR
...
我正在尝试抓取 2017/2018 NHL 滑冰运动员的统计数据。我已经开始编写代码,但我 运行 遇到解析数据和打印到 excel 的问题。
到目前为止,这是我的代码:
#import modules
from urllib.request import urlopen
from lxml.html import fromstring
import pandas as pd
#connect to url
url = "https://www.hockey-reference.com/leagues/NHL_2018_skaters.html"
#remove HTML comment markup
content = str(urlopen(url).read())
comment = content.replace("-->","").replace("<!--","")
tree = fromstring(comment)
#setting up excel columns
columns = ("names", "gp", "g", "s", "team")
df = pd.DataFrame(columns=columns)
#attempt at parsing data while using loop
for nhl, skater_row in enumerate(tree.xpath('//table[contains(@class,"stats_table")]/tr')):
names = pitcher_row.xpath('.//td[@data-stat="player"]/a')[0].text
gp = skater_row.xpath('.//td[@data-stat="games_played"]/text()')[0]
g = skater_row.xpath('.//td[@data-stat="goals"]/text()')[0]
s = skater_row.xpath('.//td[@data-stat="shots"]/text()')[0]
try:
team = skater_row.xpath('.//td[@data-stat="team_id"]/a')[0].text
# create pandas dataframe to export data to excel
df.loc[nhl] = (names, team, gp, g, s)
#write data to excel
writer = pd.ExcelWriter('NHL skater.xlsx')
df.to_excel(writer, 'Sheet1')
writer.save()
有人可以解释一下如何解析这些数据吗?您是否有任何提示可以帮助编写 Xpath 以便我可以遍历数据?
我写这行有问题:
for nhl, skater_row in enumerate(tree.xpath...
您是如何找到 Xpath 的?您是否使用了 Xpath Finder 或 Xpath Helper?
此外,我 运行 遇到以下行的错误:
df.loc[nhl] = (names, team, gp, g, s)
它显示了 df 的无效语法。
我是网络抓取的新手,之前没有编码经验。任何帮助将不胜感激。提前感谢您的宝贵时间!
IIUC:可以这样用 BeautifulSoup
and pandas
read_html
import requests
import pandas
from bs4 import BeautifulSoup
url = 'https://www.hockey-reference.com/leagues/NHL_2018_skaters.html'
pg = requests.get(url)
bsf = BeautifulSoup(pg.content, 'html5lib')
tables = bsf.findAll('table', attrs={'id':'stats'})
dfs = pd.read_html(tables[0].prettify())
df = dfs[0]
生成的数据框将包含 table 中的所有列,并使用 pandas 过滤所需的列。
#Filters only columns 1, 3 and 5 similarly all required columns can be filtered.
dff = df[df.columns[[1, 3, 5]]]
如果您仍然想坚持使用 XPath 并仅获取所需的数据而不是过滤完整的数据,您可以尝试以下操作:
for row in tree.xpath('//table[@id="stats"]/tbody/tr[not(@class="thead")]'):
name = row.xpath('.//td[@data-stat="player"]')[0].text_content()
gp = row.xpath('.//td[@data-stat="games_played"]')[0].text_content()
g = row.xpath('.//td[@data-stat="goals"]')[0].text_content()
s = row.xpath('.//td[@data-stat="shots"]')[0].text_content()
team = row.xpath('.//td[@data-stat="team_id"]')[0].text_content()
print(name, gp, g, s, team)
的输出:
Justin Abdelkader 75 13 110 DET
Pontus Aberg 53 4 70 TOT
Pontus Aberg 37 2 39 NSH
Pontus Aberg 16 2 31 EDM
Noel Acciari 60 10 66 BOS
Kenny Agostino 5 0 11 BOS
Sebastian Aho 78 29 200 CAR
...