替代循环？矢量化，cython？

Question

我有一个 pandas 数据框，如下所示：

       Total    Yr_to_Use   First_Year_Del    Del_rate 2019 2020 2021 2022 2023 etc 
ref1    100       2020         5                 10    0    0    0    0   0
ref2    20        2028         2                 5     0    0    0    0   0 
ref3    30        2021         7                 16    0    0    0    0   0
ref4    40        2025         9                 18    0    0    0    0   0
ref5    10        2022         4                 30    0    0    0    0   0

'Total' 列显示需要交付的产品数量。 'First_yr_Del' 告诉你第一年会交付多少。在此之后，交付率恢复为 'Del_rate' - 可以每年应用的统一费率，直到所有产品都交付为止。 'Year to Use' 列告诉您开始交付的第一年列。

示例： Ref1 有 100 个要交付。 2020年开始交付，第一年交付5台，之后每年交付10台，直至100台全部交付。

有什么想法可以解决这个问题吗？

我想我可能会使用类似下面的内容来依次引用哪些列，但我什至不确定这是否有帮助，因为这将取决于解决方案（在正确的版本中，base_date.year 定义为 table - 2019 中的第一列):

start_index_for_slice = df.columns.get_loc(base_date.year)
end_index_for_slice = start_index_for_slice+no_yrs_to_project
df.columns[start_index_for_slice:end_index_for_slice]

我是 python 的新手，不确定我是否有点超前了......

我想的方法是使用 for 循环，或使用 iterrows 的东西，但其他帖子似乎说这是个坏主意，我应该使用矢量化、cython 或 lambda。到目前为止，在这 3 个中，我只管理了一个非常简单的 lambda。其他的对我来说有点神秘，因为解决方案似乎建议一个接一个地执行直到完成。

感谢任何帮助！

谢谢

编辑：下面的预期输出示例（我编辑了一些日期以便您可以更好地理解逻辑）：

       Total    Yr_to_Use   First_Year_Del Del_rate 2019 2020 2021 2022 2023etc 
ref1    100       2020         5              10    0    5    10    10   10
ref2    20        2021         2              5     0    0    2     5    5 
ref3    30        2021         7              16    0    0    7     16   7
ref4    40        2019         9              18    9    18   13    0    0
ref5    10        2020         4              30    0    4    6     0    0

Answer 1

您可以使用两个用户定义的函数和 apply 方法来完成此操作

import pandas as pd
import numpy as np

df = pd.DataFrame(data={'id': ['ref1','ref2','ref3','ref4','ref5'], 
                        'Total': [100, 20, 30, 40, 10],
                        'Yr_to_Use': [2020, 2028, 2021, 2025, 2022],
                        'First_Year_Del': [5,2,7,9,4],
                        'Del_rate':[10,5,16,18,30]})

def f(r):
    ''' 
    Computes values per year and respective year
    '''

    n = (r['Total'] - r['First_Year_Del'])//r['Del_rate']
    leftover = (r['Total'] - r['First_Year_Del'])%r['Del_rate']
    r['values'] = [r['First_Year_Del']] + [r['Del_rate'] for _ in range(n)] + [leftover]
    r['years'] = np.arange(r['Yr_to_Use'], r['Yr_to_Use'] + len(r['values']))

    return r

df = df.apply(f, axis=1)


def get_year_range(r):
    '''
    Computes min and max year for each row
    '''

    r['y_min'] = min(r['years'])
    r['y_max'] = max(r['years'])
    return r 

df = df.apply(get_year_range, axis=1)

y_min = df['y_min'].min()
y_max = df['y_max'].max()

#Initialize each year value to zero
for year in range(y_min, y_max+1):
    df[year] = 0


def expand(r):
    '''
    Update value for each year
    '''
    for v, y in zip(r['values'], r['years']):
        r[y] = v 
    return r

# Apply and drop temporary columns
df = df.apply(expand, axis=1).drop(['values', 'years', 'y_min', 'y_max'], axis=1)

Answer 2

这是另一个选项，它将 rates/years 矩阵的计算分开，稍后将其附加到输入 df 中。仍然在脚本本身中循环（不是 "externalized" 到某些 numpy / pandas 函数）。我估计 5k 行应该没问题。

import pandas as pd
import numpy as np

# def gen_df1():

# create the inital df without years/rates
df = pd.DataFrame({'Total': [100, 20, 30, 40, 10], 
                   'Yr_to_Use': [2020, 2021, 2021, 2019, 2020], 
                   'First_Year_Del': [5, 2, 7, 9, 10],
                   'Del_rate': [10, 5, 16, 18, 30]})

# get number of rates + remainder
n, r = np.divmod((df['Total']-df['First_Year_Del']), df['Del_rate'])

# get the year of the last rate considering all rows
max_year = np.max(n + r.astype(np.bool) + df['Yr_to_Use'])

# get the offsets for the start of delivery, year zero is 2019
offset = df['Yr_to_Use'] - 2019
# subtracting the year zero lets you use this as an index...

# get a year index; this determines the the columns that will be created
yrs = np.arange(2019, max_year+1)

# prepare a n*m array to hold the rates for all years, initalize with all zero
out = np.zeros((df['Total'].shape[0], yrs.shape[0]))
# n: number of rows of the df, m: number of years where rates will have to be payed

# calculate the rates for each year and insert them into the output array
for i in range(df['Total'].shape[0]):
    # concatenate: year of the first rate, all yearly rates, a final rate if there was a remainder
    if r[i]: # if rest is not zero, append it as well
        rates = np.concatenate([[df['First_Year_Del'][i]], n[i]*[df['Del_rate'][i]], [r[i]]])
    else: # rest is zero, skip it
        rates = np.concatenate([[df['First_Year_Del'][i]], n[i]*[df['Del_rate'][i]]])
    # insert the rates at the apropriate location of the output array:
    out[i, offset[i]:offset[i]+rates.shape[0]] = rates

# add the years/rates matrix to the original df    
df = pd.concat([df, pd.DataFrame(out, columns=yrs.astype(str))], axis=1, sort=False)

替代循环？矢量化，cython？

Alternative to looping? Vectorisation, cython?

vectorization

cython

dataframe

python-3.x

pandas