Pandas dataframe vectorizing/filtering: ValueError: Can only compare identically-labeled Series objects
Pandas dataframe vectorizing/filtering: ValueError: Can only compare identically-labeled Series objects
我有两个包含 NHL 曲棍球统计数据的数据框。一个包含过去十年每支球队参加的每场比赛,另一个是我想用计算值填充它的地方。简而言之,我想从一支球队的前五场比赛中获取一个指标,将其相加,然后将其放入另一个 df 中。我在下面修剪了我的 dfs 以排除其他统计数据,并且只会查看一个统计数据。
df_all包含所有游戏:
>>> df_all
season gameId playerTeam opposingTeam gameDate xGoalsFor xGoalsAgainst
1 2008 2008020001 NYR T.B 20081004 2.287 2.689
6 2008 2008020003 NYR T.B 20081005 1.793 0.916
11 2008 2008020010 NYR CHI 20081010 1.938 2.762
16 2008 2008020019 NYR PHI 20081011 3.030 3.020
21 2008 2008020034 NYR N.J 20081013 1.562 3.454
... ... ... ... ... ... ... ...
142576 2015 2015030185 L.A S.J 20160422 2.927 2.042
142581 2017 2017030171 L.A VGK 20180411 1.275 2.279
142586 2017 2017030172 L.A VGK 20180413 1.907 4.642
142591 2017 2017030173 L.A VGK 20180415 2.452 3.159
142596 2017 2017030174 L.A VGK 20180417 2.427 1.818
df_sum_all 将包含计算的统计数据,现在它有一堆空列:
>>> df_sum_all
season team xg5 xg10 xg15 xg20
0 2008 NYR 0 0 0 0
1 2009 NYR 0 0 0 0
2 2010 NYR 0 0 0 0
3 2011 NYR 0 0 0 0
4 2012 NYR 0 0 0 0
.. ... ... ... ... ... ...
327 2014 L.A 0 0 0 0
328 2015 L.A 0 0 0 0
329 2016 L.A 0 0 0 0
330 2017 L.A 0 0 0 0
331 2018 L.A 0 0 0 0
这是我计算 xGoalsFor 和 xGoalsAgainst 比率的函数。
def calcRatio(statfor, statagainst, games, season, team, statsdf):
tempFor = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statfor).sum())
tempAgainst = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statagainst).sum())
tempRatio = tempFor / tempAgainst
return tempRatio
我相信这是合乎逻辑的。我输入我想从中得出比率的统计数据、要加总的比赛数、要比赛的赛季和球队,然后从哪里获取统计数据。我已经分别测试了这些功能,并且知道我可以很好地进行过滤,并对统计数据求和等等。这是 tempFor 计算的独立实现示例:
>>> statsdf = df_all
>>> team = 'TOR'
>>> season = 2015
>>> games = 3
>>> tempFor = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statfor).sum())
>>> print(tempFor)
8.618
看到了吗?它returns一个值。但是我不能在整个数据框中做同样的事情。我错过了什么?我认为它的工作方式基本上是针对每一行,它将 'xg5' 列设置为 calcRatio 函数的输出,该函数使用该行的 'season' 和 'team' 来过滤 df_all.
>>> df_sum_all['xg5'] = calcRatio('xGoalsFor','xGoalsAgainst',5,df_sum_all['season'], df_sum_all['team'], df_all)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in calcRatio
File "/home/sebastian/.local/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 1142, in wrapper
raise ValueError("Can only compare identically-labeled " "Series objects")
ValueError: Can only compare identically-labeled Series objects
干杯,感谢您的帮助!
更新:我使用了 iterrows() 并且它运行良好,所以我一定不是很了解矢量化。虽然是相同的功能 - 为什么它以一种方式工作,而不是以另一种方式工作?
>>> emptyseries = []
>>> for index, row in df_sum_all.iterrows():
... emptyseries.append(calcRatio('xGoalsFor','xGoalsAgainst',5,row['season'],row['team'], df_all))
...
>>> df_sum_all['xg5'] = emptyseries
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
>>> df_sum_all
season team xg5 xg10 xg15 xg20
0 2008 NYR 0.826260 0 0 0
1 2009 NYR 1.288390 0 0 0
2 2010 NYR 0.915942 0 0 0
3 2011 NYR 0.730498 0 0 0
4 2012 NYR 0.980744 0 0 0
.. ... ... ... ... ... ...
327 2014 L.A 0.823998 0 0 0
328 2015 L.A 1.147412 0 0 0
329 2016 L.A 1.054947 0 0 0
330 2017 L.A 1.369005 0 0 0
331 2018 L.A 0.721411 0 0 0
[332 rows x 6 columns]
"ValueError: Can only compare identically-labeled Series objects"
tempFor = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statfor).sum())
tempAgainst = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statagainst).sum())
变量的输入:
team: df_sum_all['team']
season: df_sum_all['season']
statsdf: df_all
所以在代码中,(statsdf.playerTeam == team),它将比较来自 df_sum_all[=21 的系列=] 和来自 df_all。
如果这两个标签不一样,你会看到上面的错误。
我有两个包含 NHL 曲棍球统计数据的数据框。一个包含过去十年每支球队参加的每场比赛,另一个是我想用计算值填充它的地方。简而言之,我想从一支球队的前五场比赛中获取一个指标,将其相加,然后将其放入另一个 df 中。我在下面修剪了我的 dfs 以排除其他统计数据,并且只会查看一个统计数据。
df_all包含所有游戏:
>>> df_all
season gameId playerTeam opposingTeam gameDate xGoalsFor xGoalsAgainst
1 2008 2008020001 NYR T.B 20081004 2.287 2.689
6 2008 2008020003 NYR T.B 20081005 1.793 0.916
11 2008 2008020010 NYR CHI 20081010 1.938 2.762
16 2008 2008020019 NYR PHI 20081011 3.030 3.020
21 2008 2008020034 NYR N.J 20081013 1.562 3.454
... ... ... ... ... ... ... ...
142576 2015 2015030185 L.A S.J 20160422 2.927 2.042
142581 2017 2017030171 L.A VGK 20180411 1.275 2.279
142586 2017 2017030172 L.A VGK 20180413 1.907 4.642
142591 2017 2017030173 L.A VGK 20180415 2.452 3.159
142596 2017 2017030174 L.A VGK 20180417 2.427 1.818
df_sum_all 将包含计算的统计数据,现在它有一堆空列:
>>> df_sum_all
season team xg5 xg10 xg15 xg20
0 2008 NYR 0 0 0 0
1 2009 NYR 0 0 0 0
2 2010 NYR 0 0 0 0
3 2011 NYR 0 0 0 0
4 2012 NYR 0 0 0 0
.. ... ... ... ... ... ...
327 2014 L.A 0 0 0 0
328 2015 L.A 0 0 0 0
329 2016 L.A 0 0 0 0
330 2017 L.A 0 0 0 0
331 2018 L.A 0 0 0 0
这是我计算 xGoalsFor 和 xGoalsAgainst 比率的函数。
def calcRatio(statfor, statagainst, games, season, team, statsdf):
tempFor = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statfor).sum())
tempAgainst = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statagainst).sum())
tempRatio = tempFor / tempAgainst
return tempRatio
我相信这是合乎逻辑的。我输入我想从中得出比率的统计数据、要加总的比赛数、要比赛的赛季和球队,然后从哪里获取统计数据。我已经分别测试了这些功能,并且知道我可以很好地进行过滤,并对统计数据求和等等。这是 tempFor 计算的独立实现示例:
>>> statsdf = df_all
>>> team = 'TOR'
>>> season = 2015
>>> games = 3
>>> tempFor = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statfor).sum())
>>> print(tempFor)
8.618
看到了吗?它returns一个值。但是我不能在整个数据框中做同样的事情。我错过了什么?我认为它的工作方式基本上是针对每一行,它将 'xg5' 列设置为 calcRatio 函数的输出,该函数使用该行的 'season' 和 'team' 来过滤 df_all.
>>> df_sum_all['xg5'] = calcRatio('xGoalsFor','xGoalsAgainst',5,df_sum_all['season'], df_sum_all['team'], df_all)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in calcRatio
File "/home/sebastian/.local/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 1142, in wrapper
raise ValueError("Can only compare identically-labeled " "Series objects")
ValueError: Can only compare identically-labeled Series objects
干杯,感谢您的帮助!
更新:我使用了 iterrows() 并且它运行良好,所以我一定不是很了解矢量化。虽然是相同的功能 - 为什么它以一种方式工作,而不是以另一种方式工作?
>>> emptyseries = []
>>> for index, row in df_sum_all.iterrows():
... emptyseries.append(calcRatio('xGoalsFor','xGoalsAgainst',5,row['season'],row['team'], df_all))
...
>>> df_sum_all['xg5'] = emptyseries
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
>>> df_sum_all
season team xg5 xg10 xg15 xg20
0 2008 NYR 0.826260 0 0 0
1 2009 NYR 1.288390 0 0 0
2 2010 NYR 0.915942 0 0 0
3 2011 NYR 0.730498 0 0 0
4 2012 NYR 0.980744 0 0 0
.. ... ... ... ... ... ...
327 2014 L.A 0.823998 0 0 0
328 2015 L.A 1.147412 0 0 0
329 2016 L.A 1.054947 0 0 0
330 2017 L.A 1.369005 0 0 0
331 2018 L.A 0.721411 0 0 0
[332 rows x 6 columns]
"ValueError: Can only compare identically-labeled Series objects"
tempFor = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statfor).sum())
tempAgainst = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statagainst).sum())
变量的输入:
team: df_sum_all['team']
season: df_sum_all['season']
statsdf: df_all
所以在代码中,(statsdf.playerTeam == team),它将比较来自 df_sum_all[=21 的系列=] 和来自 df_all。 如果这两个标签不一样,你会看到上面的错误。