按 numpy.mean 分组

Group by with numpy.mean

如何计算以下每个 workerid 的平均值?下面是我的示例 NumPy ndarray。第0列是workerid,第1列是纬度,第2列是经度。
我想计算每个 workerid 的平均纬度和经度。我想使用 NumPy (ndarray) 保留这一切,而不转换为 Pandas.

import numpy
from scipy.spatial.distance import cdist, euclidean
import itertools
from itertools import groupby

class WorkerPatientScores:

    '''
    I read from the Patient and Worker tables in SchedulingOptimization.
    '''
    def __init__(self, dist_weight=1):
        self.a = []

        self.a = ([[25302, 32.133598100000000, -94.395845200000000],
                   [25302, 32.145095132560200, -94.358041585705600],
                   [25302, 32.160400000000000, -94.330700000000000],
                   [25305, 32.133598100000000, -94.395845200000000],
                   [25305, 32.115095132560200, -94.358041585705600],
                   [25305, 32.110400000000000, -94.330700000000000],
                   [25326, 32.123598100000000, -94.395845200000000],
                   [25326, 32.125095132560200, -94.358041585705600],
                   [25326, 32.120400000000000, -94.330700000000000],
                   [25341, 32.173598100000000, -94.395845200000000],
                   [25341, 32.175095132560200, -94.358041585705600],
                   [25341, 32.170400000000000, -94.330700000000000],
                   [25376, 32.153598100000000, -94.395845200000000],
                   [25376, 32.155095132560200, -94.358041585705600],
                   [25376, 32.150400000000000, -94.330700000000000]])

        ndarray = numpy.array(self.a)
        ndlist = ndarray.tolist()
        geo_tuple = [(p[1], p[2]) for p in ndlist]
        nd1 = numpy.array(geo_tuple)
        mean_tuple = numpy.mean(nd1, 0)
        print(mean_tuple)

上面的输出是:

[ 32.14303108 -94.36152893]

使用 workerid 和列表理解将是:

a=np.array(self.a)
ids=np.unique(a[:,0]) #array of unique ids
pos_mean=[np.mean(a[a[:,0]==i, 1:], axis=0) for i in ids]

但考虑到似乎总是有 3 个连续的测量值,应该有一种相对简单的矢量化方法

你可以使用一些有创意的数组切片和where函数来解决这个问题。

means = {}
for i in numpy.unique(a[:,0]):
    tmp = a[numpy.where(a[:,0] == i)]
    means[i] = (numpy.mean(tmp[:,1]), numpy.mean(tmp[:,2]))

切片 [:,0] 是一种从二维数组中提取列(在本例中为第一列)的简便方法。为了获得方法,我们从第一列中找到唯一的 ID,然后对于每一列,我们用 where 提取适当的行,然后合并。最终结果是元组字典,其中键是 ID,值是包含其他两列平均值的元组。当我 运行 它时,它会产生以下字典:

{25302.0: (32.1463644108534, -94.36152892856853),
 25305.0: (32.11969774418673, -94.36152892856853),
 25326.0: (32.12303107752007, -94.36152892856853),
 25341.0: (32.17303107752007, -94.36152892856853),
 25376.0: (32.15303107752007, -94.36152892856853)}

给定这个数组,我们想按第一列分组并取其他两列的均值

X = np.asarray([[25302, 32.133598100000000, -94.395845200000000],
                [25302, 32.145095132560200, -94.358041585705600],
                [25302, 32.160400000000000, -94.330700000000000],
                [25305, 32.133598100000000, -94.395845200000000],
                [25305, 32.115095132560200, -94.358041585705600],
                [25305, 32.110400000000000, -94.330700000000000],
                [25326, 32.123598100000000, -94.395845200000000],
                [25326, 32.125095132560200, -94.358041585705600],
                [25326, 32.120400000000000, -94.330700000000000],
                [25341, 32.173598100000000, -94.395845200000000],
                [25341, 32.175095132560200, -94.358041585705600],
                [25341, 32.170400000000000, -94.330700000000000],
                [25376, 32.153598100000000, -94.395845200000000],
                [25376, 32.155095132560200, -94.358041585705600],
                [25376, 32.150400000000000, -94.330700000000000]])

仅使用 numpy 并且没有循环

groups = X[:,0].copy()
X = np.delete(X, 0, axis=1)

_ndx = np.argsort(groups)
_id, _pos, g_count  = np.unique(groups[_ndx], 
                                return_index=True, 
                                return_counts=True)

g_sum = np.add.reduceat(X[_ndx], _pos, axis=0)
g_mean = g_sum / g_count[:,None]

将结果存储在字典中:

>>> dict(zip(_id, g_mean))
{25302.0: array([ 32.14636441, -94.36152893]),
 25305.0: array([ 32.11969774, -94.36152893]),
 25326.0: array([ 32.12303108, -94.36152893]),
 25341.0: array([ 32.17303108, -94.36152893]),
 25376.0: array([ 32.15303108, -94.36152893])}