在 pandas 中加快数据帧操作

Question

我目前正在从 R 切换到 python，想知道我是否可以加快以下数据帧操作。我有一个包含 50 万行和 17 列的销售数据集，在将它们放入仪表板之前我需要对其进行一些计算。我的数据如下所示：

location  time  product  sales
store1    2017  brandA   10
store1    2017  brandB   17 
store1    2017  brandC   15
store1    2017  brandD   19
store1    2017  catTot   86
store2    2017  brandA   8
store2    2017  brandB   23 
store2    2017  brandC   5
store2    2017  brandD   12
store2    2017  catTot   76
.         .     .         .
.         .     .         .
.         .     .         .
.         .     .         .

catTot 是我从原始数据集中获得的预聚合，它显示给定时间段内给定商店的总销售额。正如您所看到的，其他产品只是总数的一小部分，永远不会加起来到总数中，但是它们包含在总数中。由于我想在不显示所有产品的情况下反映给定位置的总销售额（由于仪表板中的性能问题），我需要将 catTot 值替换为实际上是当前值减去总和的聚合值其他产品。

目前，我通过嵌套的 for 循环进行迭代以进行更改。代码如下所示：

df['location'] = df.location.astype('category')
df['time'] = df.time.astype('category')

var_geo = []
var_time = []
for var_time in df.time.cat.categories:
    for var_geo in df.location.cat.categories:
        df_tmp = []
        fct_eur = []

        df_tmp = df[(df['location'] == var_geo) & (df['time'] == var_time)]
        fct_eur = df_tmp.iloc[len(df_tmp)-1,3] df_tmp.iloc[0:len(df_tmp)-2,3].sum()
        df.loc[(df['location'] == var_geo) & (df['time'] == var_time) & (df['product'] == 'catTot'), ['sales']] = fct_eur

如您所见，catTot 始终是屏蔽数据框中的最后一行。此操作现在每次大约需要 9 分钟，因为我有 23 个商店位置、大约 880 种产品、30 个时间段和 5 个不同的度量，这导致大约 50 万行。是否有更优雅或至少更快的方法来进行此类操作？

Answer 1

您可以创建一个分组键，将 "catTot" 之外的所有内容都设置为 "sales"，然后 pivot_table 以聚合 sales 列，例如：

agg = df.pivot_table(
    index=['location', 'time'],
    columns=np.where(df['product'] == 'catTot', 'catTot', 'sales'),  
    values='sales', 
    aggfunc='sum'
)

这会给你：

               catTot  sales
location time
store1   2017      86     61
store2   2017      76     48

那么你可以new_total = agg['catTot'] - agg['sales']:

location  time
store1    2017    25
store2    2017    28
dtype: int64

Answer 2

居然有朋友提出这种方法来解决我的问题。这段代码也是他的，它构建了一个嵌套目录并将度量添加到每一行的键中，但是除了 catTot 之外的所有内容都乘以 -1。所以最后只会保留剩下的。

for row in data:
        safe_add(mapping, row[0], int(row[1]), row[2], int(row[3]))
def safe_add(mapping, store, year, brand, count):
    if not store in mapping:
        mapping[store] = {}
    if not year in mapping[store]:
        mapping[store][year] = 0
    if brand != 'catTot':
        count = count * -1
    new_count = count + mapping[store][year]
    mapping[store][year] = new_count

获得嵌套目录后，我在字典中循环一次以获取我需要将其写出的行数。我这样做是为了能够预先填充一个空的 df 并填充它。

counter=0    
for geo in mapping.keys():
    for time in mapping[store].keys():
        counter +=1
df_annex = pd.DataFrame(data=None, index=np.arange(0, counter), columns=df.columns)
for geo in mapping.keys():
    for time in mapping[store].keys():
        df_annex.iloc[counterb, 0] = geo
        .
        .

写完字典后，我简单地将 df 中的旧总数子集化，并将其与附件连接起来。这导致 7.88 秒对 9 分钟的时间。

在 pandas 中加快数据帧操作

Speed up dataframe operations in pandas

python

etl

dataframe

python-3.x

pandas