当值取决于先验值时如何在 Pandas 中进行矢量化

How to vectorize in Pandas when values depend on prior values

我想使用 Pandas 实现一个保持 运行 平衡的功能,但我不确定它是否可以矢量化以提高速度。

简而言之,我要解决的问题是跟踪消耗、生成和过度生成的 "bank"。

"consumption"表示给定时间段内使用了多少。
"generation"是生成多少。
当发电量大于消耗量时,房主可以 "bank" 额外的发​​电量,用于后续时间段。如果他们下个月的消费超过他们的代数,他们可以申请。
这将适用于许多实体,因此 "id" 字段。时间顺序由"order"

定义

非常基本的例子:

代码 将 numpy 导入为 np 将 pandas 导入为 pd

id = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2]
order = [1,2,3,4,5,6,7,8,9,18,11,12,13,14,15,1,2,3,4,5,6,7,8,9,10,11]
consume = [10, 17, 20, 11, 17, 19, 20, 10, 10, 19, 14, 12, 10, 14, 13, 19, 12, 17, 12, 18, 15, 14, 15, 20, 16, 15]
generate = [20, 16, 17, 21, 9, 13, 10, 16, 12, 10, 9, 9, 15, 13, 100, 15, 18, 16, 10, 16, 12, 12, 13, 20, 10, 15]
df = pd.DataFrame(list(zip(id, order, consume, generate)), 
       columns =['id','Order','Consume', 'Generate'])
begin_bal = [0,10,9,6,16,8,2,0,6,8,0,0,0,5,4,0,0,6,5,3,1,0,0,0,0,0]
end_bal = [10,9,6,16,8,2,0,6,8,0,0,0,5,4,91,0,6,5,3,1,0,0,0,0,0,0]
withdraw = [0,1,3,0,8,6,2,0,0,8,0,0,0,1,4,0,0,1,2,2,1,0,0,0,0,0]
df_solution = pd.DataFrame(list(zip(id, order, consume, generate, begin_bal, end_bal, withdraw)), 
       columns =['id','Order','Consume', 'Generate', 'begin_bal', 'end_bal', 'Withdraw'])

def bank(df):
    # deposit all excess when generation exceeds consumption
  deposit = (df['Generate'] > df['Consume']) * (df['Generate'] - df['Consume'])
  df['end_bal'] = 0

  # beginning balance = prior period ending balance
  df = df.sort_values(by=['id', 'Order'])
  df['begin_bal'] = df['end_bal'].shift(periods=1)
  df.loc[df['Order']==1, 'begin_bal'] = 0  # set first month beginning balance of each customer to 0

  # calculate withdrawal
  df['Withdraw'] = 0
  ok_to_withdraw = df['Consume'] > df['Generate']
  df.loc[ok_to_withdraw,'Withdraw'] = np.minimum(df.loc[ok_to_withdraw, 'begin_bal'],
                                               df.loc[ok_to_withdraw, 'Consume'] -
                                               df.loc[ok_to_withdraw, 'Generate'] -
                                               deposit[ok_to_withdraw])
  # ending balance = beginning balance + deposit - withdraw
  df['end_bal'] = df['begin_bal'] + deposit - df['Withdraw'] 
  return df

df = bank(df)
df.head()
    id  Order   Consume Generate    end_bal begin_bal   Withdraw
0   1   1       10      20          10.0    0.0         0.0
1   1   2       17      16          0.0     0.0         0.0
2   1   3       20      17          0.0     0.0         0.0
3   1   4       11      21          10.0    0.0         0.0
4   1   5       17      9           0.0     0.0         0.0

df_solution.head()

    id  Order   Consume Generate    begin_bal   end_bal Withdraw
0   1   1       10      20          0           10      0
1   1   2       17      16          10          9       1
2   1   3       20      17          9           6       3
3   1   4       11      21          6           16      0
4   1   5       17      9           16          8       9

我尝试通过 cumsum 和 shift 的各种迭代来实现。 . .但事实仍然是每一行的值似乎都需要根据前一行重新计算,而且我不确定这是否可以矢量化。

生成一些测试数据集的代码:

def generate_testdata():
  random.seed(42*42)
  np.random.seed(42*42)
  numids = 10
  numorders = 12
  id = []
  order = []
  for i in range(numids):
    id = id + [i]*numorders
    order = order + list(range(1,numorders+1))
  consume = np.random.uniform(low = 10, high = 40, size = numids*numorders)
  generate = np.random.uniform(low = 10, high = 40, size = numids*numorders)
  df = pd.DataFrame(list(zip(id, order, consume, generate)), 
           columns =['id','Order','Consume', 'Generate'])
  return df

我不确定我是否完全理解您的问题,但我会尝试回答。 我会re-phrase我所理解的...

1。源数据

有源数据,是一个四列的DataFrame:

  • id - 实体ID号
  • order - 表示周期的顺序
  • 消耗 - 期间消耗了多少
  • 生成 - 期间生成了多少

2。计算

对于每个id,我们要计算:

  • diff 这是每个周期 generateconsume 之间的差异
  • 期初余额 这是上一个订单的期末余额
  • 期末余额 这是 diff
  • 的累计总和

3。代码

我会尝试用 groupbycumsumshift 来解决这个问题。

# Make sure the df is sorted
df = df.sort_values(['id','order'])
df['diff'] = df['generate'] - df['consume'] 
df['closing_balance'] = df.groupby('id')['diff'].cumsum()
# Opening balance equals the closing balance from the previous period
df['opening_balance'] = df.groupby('id')['closing_balance'].shift(1)

我确实误解了一些东西,请随时纠正我,我会努力想出一个更好的答案。
特别是,我不确定如何处理 closing_balance 变成负数。它应该显示负余额吗?它应该使 "debts" 无效吗?

这是一个 numpy-ish 方法,主要是因为我对 pandas 不太熟悉:

思路是先计算自由值cumsum,如果是负数则减去累积最小值。

import numpy as np
import pandas as pd

id = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2]
order = [1,2,3,4,5,6,7,8,9,18,11,12,13,14,15,1,2,3,4,5,6,7,8,9,10,11]
consume = [10, 17, 20, 11, 17, 19, 20, 10, 10, 19, 14, 12, 10, 14, 13, 19, 12, 17, 12, 18, 15, 14, 15, 20, 16, 15]
generate = [20, 16, 17, 21, 9, 13, 10, 16, 12, 10, 9, 9, 15, 13, 8, 15, 18, 16, 10, 16, 12, 12, 13, 20, 10, 15]
df = pd.DataFrame(list(zip(id, order, consume, generate)), 
           columns =['id','Order','Consume', 'Generate'])
begin_bal = [0,10,9,6,16,8,2,0,6,8,0,0,0,5,4,0,0,6,5,3,1,0,0,0,0,0]
end_bal = [10,9,6,16,8,2,0,6,8,0,0,0,5,4,0,0,6,5,3,1,0,0,0,0,0,0]
withdraw = [0,1,3,0,9,6,2,0,0,8,0,0,0,1,4,0,0,1,2,2,1,0,0,0,0,0]
df_solution = pd.DataFrame(list(zip(id, order, consume, generate, begin_bal, end_bal, withdraw)), 
           columns =['id','Order','Consume', 'Generate', 'begin_bal', 'end_bal', 'Withdraw'])

def f(df):
    # find block bondaries
    ids = df["id"].values
    bnds, = np.where(np.diff(ids, prepend=ids[0]-1, append=ids[-1]+1))
    # find raw balance change
    delta = (df["Generate"] - df["Consume"]).values
    # find offset, so cumulative min does not interfere across ids
    safe_total = (np.minimum(delta.min(), 0)-1) * np.diff(bnds[:-1])
    # must apply offset just before group switch, so it aligns the first
    # begin_bal, not end_bal, of the next group
    # also keep a copy of original values at switches
    delta_orig = delta[bnds[1:-1]-1]
    delta[bnds[1:-1]-1] += safe_total - np.add.reduceat(delta, bnds[:-2])
    # form free cumsum
    acc = delta.cumsum()
    # correct
    acc -= np.minimum(0, np.minimum.accumulate(acc))
    #  write solution back to df
    shft = np.empty_like(acc)
    shft[1:] = acc[:-1]
    shft[0] = 0
    # reinstate last end_bal of each group
    acc[bnds[1:-1]-1] = np.maximum(0, shft[bnds[1:-1]-1] + delta_orig)
    df["begin_bal"] = shft
    df["end_bal"] = acc
    df["Withdraw"] = np.maximum(0, df["begin_bal"] - df["end_bal"])

测试:

f(df)
df == df_solution

打印:

      id  Order  Consume  Generate  begin_bal  end_bal  Withdraw
0   True   True     True      True       True     True      True
1   True   True     True      True       True     True      True
2   True   True     True      True       True     True      True
3   True   True     True      True       True     True      True
4   True   True     True      True       True     True     False
5   True   True     True      True       True     True      True
6   True   True     True      True       True     True      True
7   True   True     True      True       True     True      True
8   True   True     True      True       True     True      True
9   True   True     True      True       True     True      True
10  True   True     True      True       True     True      True
11  True   True     True      True       True     True      True
12  True   True     True      True       True     True      True
13  True   True     True      True       True     True      True
14  True   True     True      True       True     True      True
15  True   True     True      True       True     True      True
16  True   True     True      True       True     True      True
17  True   True     True      True       True     True      True
18  True   True     True      True       True     True      True
19  True   True     True      True       True     True      True
20  True   True     True      True       True     True      True
21  True   True     True      True       True     True      True
22  True   True     True      True       True     True      True
23  True   True     True      True       True     True      True
24  True   True     True      True       True     True      True
25  True   True     True      True       True     True      True

有一个 False 但这似乎是提供的预期输出中的错字。

这里使用@PaulPanzer 的逻辑是pandas 版本。

def CalcEB(x):
    delta = x['Generate'] - x['Consume']
    return delta.cumsum() - delta.cumsum().cummin().clip(-np.inf,0)

df['end_bal'] = df.groupby('id', as_index=False).apply(CalcEB).values
df['begin_bal'] = df.groupby('id')['end_bal'].shift().fillna(0)
df['Withdraw'] = (df['begin_bal'] - df['end_bal']).clip(0,np.inf)

df_pandas = df.copy()

#Note the typo mentioned by Paul Panzer
df_pandas.reindex(df_solution.columns, axis=1) == df_solution

输出(检查数据帧)

      id  Order  Consume  Generate  begin_bal  end_bal  Withdraw
0   True   True     True      True       True     True      True
1   True   True     True      True       True     True      True
2   True   True     True      True       True     True      True
3   True   True     True      True       True     True      True
4   True   True     True      True       True     True     False
5   True   True     True      True       True     True      True
6   True   True     True      True       True     True      True
7   True   True     True      True       True     True      True
8   True   True     True      True       True     True      True
9   True   True     True      True       True     True      True
10  True   True     True      True       True     True      True
11  True   True     True      True       True     True      True
12  True   True     True      True       True     True      True
13  True   True     True      True       True     True      True
14  True   True     True      True       True     True      True
15  True   True     True      True       True     True      True
16  True   True     True      True       True     True      True
17  True   True     True      True       True     True      True
18  True   True     True      True       True     True      True
19  True   True     True      True       True     True      True
20  True   True     True      True       True     True      True
21  True   True     True      True       True     True      True
22  True   True     True      True       True     True      True
23  True   True     True      True       True     True      True
24  True   True     True      True       True     True      True
25  True   True     True      True       True     True      True