如何有效地创建数组 Python 中条目数的频率 table

Question

我正在尝试实现一种在 python 中创建频率 table 的有效方法，其中包含 ~30 million 个条目的相当大的 numpy 输入数组。目前我正在使用 for-loop，但它花费的时间太长了。

输入是 numpy array 形式的有序

Y = np.array([4, 4, 4, 6, 6, 7, 8, 9, 9, 9..... etc])

而且我希望得到以下形式的输出：

Z = {4:3, 5:0, 6:2, 7:1,8:1,9:3..... etc} (as any data type)

目前我正在使用以下实现：

Z = pd.Series(index = np.arange(Y.min(), Y.max()))

for i in range(Y.min(), Y.max()):
  Z[i] = (Y == i).sum()

是否有更快的方法或无需 iterating 通过循环的方法？感谢您的帮助，如果之前有人问过这个问题，我们深表歉意！

Answer 1

您可以简单地使用 collections 模块中的 Counter 来完成此操作。请参阅下面的代码 i 运行为您的测试用例。

import numpy as np
from collections import Counter
Y = np.array([4, 4, 4, 6, 6, 7, 8, 9, 9, 9,10,5,5,5])
print(Counter(Y))

它给出了以下输出

Counter({4: 3, 9: 3, 5: 3, 6: 2, 7: 1, 8: 1, 10: 1})

您可以轻松地进一步使用此对象。希望对您有所帮助。

Answer 2

我认为 numpy.unique 是您的解决方案。

https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.unique.html

import numpy as np
t = np.random.randint(0, 1000, 100000000)
print(np.unique(t, return_counts=True))

这对我来说大约需要 4 秒。 collections.Counter 方法大约需要 10 秒。

但是 numpy.unique returns 数组中的频率和 collections.Counter returns 字典中的频率。看方便了。

编辑。我无法对其他帖子发表评论，所以我会在这里写下@lomereiters 解决方案快如闪电（线性）并且应该被接受。

Answer 3

如果您的输入数组 x 已排序，您可以执行以下操作以线性时间获取计数：

diff1 = np.diff(x)
# get indices of the elements at which jumps occurred
jumps = np.concatenate([[0], np.where(diff1 > 0)[0] + 1, [len(x)]])
unique_elements = x[jumps[:-1]]
counts = np.diff(jumps)

如何有效地创建数组 Python 中条目数的频率 table

How to efficiently create a frequency table of numbers of entries in an array Python

python

arrays

reduce

dictionary