如何根据分类对数组的某些列进行分组和求和（例如，按国家/地区对城市进行分组）

Question

问题

我有随着时间的推移跟踪某些项目的数组。这些项目属于某些类别。我想按时间和类别计算总和，例如从按时间和城市的 table 到按时间和国家的

我找到了一些方法，但它们看起来很笨拙 - 一定有更好的方法！ 我不是第一个遇到这个问题的人吗？也许使用 np.where?

更具体地说：

我有许多形状为 (p x i) 的 numpy 数组，其中 p 是周期，i 是我随时间跟踪的项目。然后我有一个形状为 i 的单独数组，它将项目分类（红色、绿色、黄色等）。

我想要做的是计算一个形状数组（p x 唯一类别的数量），它按时间和类别对大数组的值求和。图片中：

我需要代码尽可能高效，因为我需要在最大 400 x 1,000,000

的数组上多次执行此操作

我尝试过的：

本文 question 涵盖了多种不借助 pandas 进行分组的方法。我喜欢 scipy.ndimage 方法，但据我所知，它仅适用于一维。

我尝试了 pandas 的解决方案：

我创建了形状周期 x 项的数据框
我用 pd.melt() 取消透视，加入类别并做交叉表 period/categories

我也试过一组循环，用numba优化：

第一个循环创建一个数组，将类别转换为整数，即按字母顺序排列的第一个类别变为 0，第二个变为 1，依此类推
第二个循环遍历所有项目，然后针对每个项目遍历所有时间段并按类别求和

我的发现

对于小数组，pandas 更快
对于大型数组，numba 更好，但最好在 numba 装饰器中设置 parallel = False
对于非常大的数组，parallel = True 的 numba 大放异彩 parallel = True 通过在外循环上使用 numba.prange 来利用 numba 的并行化。

PS 我知道过早优化等的陷阱 - 我只是在研究这个，因为大量的时间都花在了这件事上

密码

import numpy as np
import pandas as pd
import time
import numba

periods = 300
n = int(2000)
categories = np.tile(['red','green','yellow','brown'],n)
my_array = np.random.randint(low = 0, high = 10, size = (periods, len(categories) ))
# my_arrays will have shape (periods x (n * number of categories))


#---- pandas
start = time.time()

df_categories = pd.DataFrame(data = categories).reset_index().rename(columns ={'index':'item',0:'category'})
df = pd.DataFrame(data = my_array)
unpiv = pd.melt(df.reset_index(), id_vars ='index', var_name ='item', value_name ='value').rename( columns = {'index':'time'})
unpiv = pd.merge(unpiv, df_categories, on='item' )
crosstab = pd.crosstab( unpiv['time'], unpiv['category'], values = unpiv['value'], aggfunc='sum' )

print("panda crosstab in:")
print(time.time() - start)
# yep, I know that timeit.timer would have been better, but I was in a hurry :)
print("")


#---- numba
@numba.jit(nopython = True, parallel = True, nogil = True)
def numba_classify(x, categories):
    cat_uniq = np.unique(categories)
    num_categories = len(cat_uniq)
    num_items = x.shape[1]
    periods = x.shape[0]
    categories_converted = np.zeros(len(categories), dtype = np.int32)
    out = np.zeros(( periods, num_categories))
    
    
    # before running the actual classification, I must convert the categories, which can be strings, to
    # the corresponsing number in cat_uniq, e.g. if brown is the first category by alphabetical sorting, then
    # brown --> 0, etc
    
    for i in numba.prange(num_items):
        for c in range(num_categories):
            if categories[i] == cat_uniq[c]:
                categories_converted[i] = c
      
        
    for i in numba.prange(num_items):        
        for p in range(periods):
            out[ p, categories_converted[i] ] += x[p,i]


    return out

start = time.time()

numba_out = numba_classify(my_array, categories)
print("numba done in:")
print(time.time() - start)

Answer 1

您可以使用 df.groupby(categories, axis=1).sum() 来大幅提高速度。

import numpy as np
import pandas as pd
import time


def make_data(periods, n):
    categories = np.tile(['red','green','yellow','brown'],n)
    my_array = np.random.randint(low = 0, high = 10, size = (periods, len(categories) ))
    
    return categories, pd.DataFrame(my_array)

for n in (200, 2000, 20000):
    categories, df = make_data(300, n)
    true_n = n * 4
    
    start = time.time()
    tabulation =df.groupby(categories, axis=1).sum()
    elapsed = time.time() - start
    
    print(f"300 x {true_n:5}: {elapsed:.3f} seconds")

# prints:
300 x   800: 0.005 seconds
300 x  8000: 0.021 seconds
300 x 80000: 0.673 seconds

如何根据分类对数组的某些列进行分组和求和（例如，按国家/地区对城市进行分组）

How to group and sum certain columns of an array based on their classification (eg to group cities by country)

python

dataframe

pandas

numba

问题

我尝试过的：

我的发现

密码