calculating/adding 新列的更有效方法使用 Pandas 用于大型数据集

More effective way of calculating/adding new columns using Pandas for large dataset

我有以下数据集:

              CustomerID Date                     Amount           Department  \
0                 395134 2019-01-01               199              Home   
1                 395134 2019-01-01               279              Home   
2                1356012 2019-01-07               279              Home   
3                1921374 2019-01-08               269              Home   
4                 395134 2019-01-01               279              Home   
...                  ...        ...               ...               ...   
18926474         1667426 2021-06-30               349        Womenswear   
18926475         1667426 2021-06-30               299        Womenswear   
18926476          583105 2021-06-30               349        Womenswear   
18926477          538137 2021-06-30               279        Womenswear   
18926478          825382 2021-06-30              2499              Home   

                  DaysSincePurchase  
0                 986 days  
1                 986 days  
2                 980 days  
3                 979 days  
4                 986 days  
...                    ...  
18926474           75 days  
18926475           75 days  
18926476           75 days  
18926477           75 days  
18926478           75 days  

我想做一些特征工程,并在按 customerID 聚合(使用 group_by)后添加几列。 Date 列不重要,很容易删除。我想要一个数据集,其中每一行都是一个唯一的 customerID,它们只是整数 1,2...(第一列),其中其他列是:

  1. 采购总额
  2. 距离上次购买的天数
  3. 部门总数

这就是我所做的,并且有效。但是,当我计时时,大约需要 1.5 小时。还有其他更有效的方法吗?

customer_group = joinedData.groupby(['CustomerID'])
n = originalData['CustomerID'].nunique()

# First arrange the data in a matrix.
matrix = np.zeros((n,5)) # Pre-allocate matrix

for i in range(0,n):
    matrix[i,0] = i+1
    matrix[i,1] = sum(customer_group.get_group(i+1)['Amount'])
    matrix[i,2] = min(customer_group.get_group(i+1)['DaysSincePurchase']).days
    matrix[i,3] = customer_group.get_group(i+1)['Department'].nunique()

# The above loop takes 6300 sec approx

# convert matrix to dataframe and name columns
newData = pd.DataFrame(matrix)
newData = newData.rename(columns = {0:"CustomerID"})
newData = newData.rename(columns = {1:"TotalDemand"})
newData = newData.rename(columns = {2:"DaysSinceLastPurchase"})
newData = newData.rename(columns = {3:"nrDepartments"})

使用agg:

>>> df.groupby('CustomerID').agg(TotalDemand=('Amount', sum), 
                                 DaysSinceLastPurchase=('DaysSincePurchase', min),
                                 nrDepartments=('Department', 'nunique'))

我运行 这个函数在一个包含 20,000,000 条记录的数据帧上。执行需要几秒钟:

>>> %timeit df.groupby('CustomerID').agg(...)
14.7 s ± 225 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

生成的数据:

N = 20000000
df = pd.DataFrame(
    {'CustomerID': np.random.randint(1000, 10000, N), 
     'Date': np.random.choice(pd.date_range('2020-01-01', '2020-12-31'), N),
     'Amount': np.random.randint(100, 1000, N),
     'Department': np.random.choice(['Home', 'Sport', 'Food', 'Womenswear',
                                     'Menswear', 'Furniture'], N)})
df['DaysSincePurchase'] = pd.Timestamp.today().normalize() - df['Date']