如何在团队之间头对头地填充 df?

How to populate a df with rolling head to head between teams?

我有一个 df,其中包含有关球队之间比赛的数据,我想创建一个新列,其中包含比赛前球队之间的 h2h 记录。

例如:

df = pd.DataFrame(data = [['LAC','LAL', 1, '15/02/2022'], ['LAC','LAL', 1, '16/02/2022'], ['LAL','LAC', 1, '17/02/2022'],
                           ['LAL','LAC', 1, '18/02/2022'], ['LAL','LAC', 1, '19/02/2022'], ['LAC','LAL', 1, '20/02/2022'],
                           ['LAL','LAC', 1, '21/02/2022'], ['LAC','LAL', 1, '22/02/2022']],
                   columns = ['winner', 'loser', 'won', 'date'])

在此示例中,每场比赛前的交锋应该是:0-0、1-0、2-0、1-2、2-2、3-3、3-4

我想计算 h2h % wins,但我想得到一个团队对另一个团队的胜利次数是第一步。我可以用 groupby 计算最终的 h2h,但我不确定如何计算每场比赛,因为一个团队可能在两列之一中。请注意,此 df 的格式遵循 winner/loser 格式,因此 'won' 始终为 1。或者,我可以将 df 更改为长版本(一个匹配 = 两行)但不确定是否有帮助.我还有其他专栏,但我不确定它们是否与这个问题相关(更多统计信息、ID 等)。

根据@拟人的回复,我可以做以下事情:

df['winner_wins'] = df.groupby(['winner', 'loser'])['won'].cumsum()
df['winner_wins'] = df.groupby(['winner', 'loser'])['winner_wins'].shift(1)

在赛前准确记录 'winner' 球队的胜场数。但我不知道我应该如何为 'loser' 团队

获得相同的东西

如果我对你的问题理解正确,cumsumexpanding 方法可能对你有用。

代码:

import pandas as pd

# Create a sample dataframe
df = pd.DataFrame(data = [['LAC','LAL', 1, '15/02/2022'], ['LAC','LAL', 1, '16/02/2022'], ['LAL','LAC', 1, '17/02/2022'], ['LAL','LAC', 1, '18/02/2022'], ['LAL','LAC', 1, '19/02/2022'], ['LAC','LAL', 1, '20/02/2022'], ['LAL','LAC', 1, '21/02/2022'], ['LAC','LAL', 1, '22/02/2022']], columns = ['winner', 'loser', 'won', 'date'])

# Calculate h2h records
df = df.sort_values('date').assign(
    LAC_h2h_wins=(df.winner=='LAC').cumsum(),
    LAL_h2h_wins=(df.winner=='LAL').cumsum(),
    LAC_h2h_wins_pct=(df.winner=='LAC').expanding().agg(lambda s: 100 * s.sum() / len(s)),
    LAL_h2h_wins_pct=(df.winner=='LAL').expanding().agg(lambda s: 100 * s.sum() / len(s)),
)

print(df)

输出:

winner loser won date LAC_h2h_wins LAL_h2h_wins LAC_h2h_wins_pct LAL_h2h_wins_pct
0 LAC LAL 1 15/02/2022 1 0 100 0
1 LAC LAL 1 16/02/2022 2 0 100 0
2 LAL LAC 1 17/02/2022 2 1 66.6667 33.3333
3 LAL LAC 1 18/02/2022 2 2 50 50
4 LAL LAC 1 19/02/2022 2 3 40 60
5 LAC LAL 1 20/02/2022 3 3 50 50
6 LAL LAC 1 21/02/2022 3 4 42.8571 57.1429
7 LAC LAL 1 22/02/2022 4 4 50 50

[编辑]

回答 OP 的评论。

代码:

import pandas as pd

# Create a sample dataframe with more data points
df = pd.DataFrame(data = [['LAC','LAL', 1, '15/02/2022'], ['LAC','LAL', 1, '16/02/2022'], ['LAL','LAC', 1, '17/02/2022'], ['LAL','LAC', 1, '18/02/2022'], ['LAL','LAC', 1, '19/02/2022'], ['LAC','LAL', 1, '20/02/2022'], ['LAL','LAC', 1, '21/02/2022'], ['LAC','LAL', 1, '22/02/2022'], ['ABC','LAL', 1, '15/02/2022'], ['ABC','LAL', 1, '16/02/2022'], ['LAL','ABC', 1, '17/02/2022'], ['LAL','ABC', 1, '18/02/2022'], ['LAL','ABC', 1, '19/02/2022'], ['ABC','LAL', 1, '20/02/2022'], ['LAL','ABC', 1, '21/02/2022'], ['ABC','LAL', 1, '22/02/2022'], ['ABC','XYZ', 1, '15/02/2022'], ['ABC','XYZ', 1, '16/02/2022'], ['XYZ','ABC', 1, '17/02/2022'], ['XYZ','ABC', 1, '18/02/2022'], ['XYZ','ABC', 1, '19/02/2022'], ['ABC','XYZ', 1, '20/02/2022'], ['XYZ','ABC', 1, '21/02/2022'], ['ABC','XYZ', 1, '22/02/2022'], ['LAC','XYZ', 1, '15/02/2022'], ['LAC','XYZ', 1, '16/02/2022'], ['XYZ','LAC', 1, '17/02/2022'], ['XYZ','LAC', 1, '18/02/2022'], ['XYZ','LAC', 1, '19/02/2022'], ['LAC','XYZ', 1, '20/02/2022'], ['XYZ','LAC', 1, '21/02/2022'], ['LAC','XYZ', 1, '22/02/2022']], columns = ['winner', 'loser', 'won', 'date'])

# In order to group by games, make sorted game titles like "LAC-LAL"
df['game'] = df.apply(lambda r: '-'.join(sorted([r.winner, r.loser])), axis=1)

# Ensure that df is sorted game and date (date must align in the ascending order)
df = df.sort_values(['game', 'date'], ignore_index=True)

# Assign 1 if the left team in the game title, otherwise 0. For example, "LAC" is the left team in the game title "LAC-LAL"
df['left_win'] = df.apply(lambda r: f'{r.winner}-{r.loser}'==r.game, axis=1)

# Do the same thing on the right team.
df['right_win'] = ~df.left_win

# Calculate the cumulative sumation.
df[['left_win_cumsum', 'right_win_cumsum']] = df.groupby('game')[['left_win', 'right_win']].cumsum()

# Shift and fill the first games as 0
df[['h2h_winner', 'h2h_loser']] = df.groupby('game')[['left_win_cumsum', 'right_win_cumsum']].shift().fillna(0).astype(int)

# Check the order in a pair of winner and loser columns. If the order is different from the game title, reverse the cumsum values
f = lambda r: [r.h2h_winner, r.h2h_loser] if f'{r.winner}-{r.loser}'==r.game else [r.h2h_loser, r.h2h_winner]
df[['h2h_winner', 'h2h_loser']] = df.apply(f, axis=1).apply(pd.Series)

# Drop all the temporary columns
df = df.drop(['game', 'left_win', 'right_win', 'left_win_cumsum', 'right_win_cumsum'], axis=1)

print(df.to_markdown(stralign='center', numalign='center'))

输出(仅提取 LAC - LAL 游戏):

winner loser won date h2h_winner h2h_loser
16 LAC LAL 1 15/02/2022 0 0
17 LAC LAL 1 16/02/2022 1 0
18 LAL LAC 1 17/02/2022 0 2
19 LAL LAC 1 18/02/2022 1 2
20 LAL LAC 1 19/02/2022 2 2
21 LAC LAL 1 20/02/2022 2 3
22 LAL LAC 1 21/02/2022 3 3
23 LAC LAL 1 22/02/2022 3 4

尝试:

tmp = pd.crosstab(df.index, df["winner"]).shift(fill_value=0).cumsum()

# prevent error if there's a team that only wins:
tmp = tmp.merge(
    pd.DataFrame(columns=np.unique(df[["winner", "loser"]])), how="outer"
).fillna(0)

df[["winner_cnt", "loser_cnt"]] = df.apply(
    lambda x: tmp.loc[x.name, x[["winner", "loser"]].values].values, axis=1
).apply(pd.Series)
print(df)

打印:

  winner loser  won        date  winner_cnt  loser_cnt
0    LAC   LAL    1  15/02/2022           0          0
1    LAC   LAL    1  16/02/2022           1          0
2    LAL   LAC    1  17/02/2022           0          2
3    LAL   LAC    1  18/02/2022           1          2
4    LAL   LAC    1  19/02/2022           2          2
5    LAC   LAL    1  20/02/2022           2          3
6    LAL   LAC    1  21/02/2022           3          3
7    LAC   LAL    1  22/02/2022           3          4