循环遍历多个元组列表以查找每个元组列表最大值的快速方法

Question

我有数以万计的元组列表，列表中的每个元组都由 (int, float) 对组成。我希望能够循环遍历所有元组列表以找到 (int, float) 对，其中 float 是元组列表中浮点数的最大值。考虑几个元组列表：

[
[(0, 0.3792), (3, 0.5796)],
[0, 0.9365), (1, 0.0512), (18, 0.0123),
[(13, 0.8642)],
[(0, 0.6249), (1, 0.01), (2, 0.01), (3, 0.01), (4, 0.01), (5, 0.01)]
]

对于每个元组列表，我想找到第二个数字最大化的对（例如，对于第一个列表，我想要 (3, 0.5796)；对于第 4 个项目，(0, 0.6249) 应该returned）。我目前的做法是把元组变成numpy数组，然后找到argmax和max:

def get_max(doc: List[Tuple[int, float]]) -> Tuple[int, float]:
            
   topic_prob_array = np.array(doc, dtype=np.dtype('int,float'))
   return topic_prob_array['f0'][np.argmax(topic_prob_array['f1'])], np.max(topic_prob_array['f1'])

我希望将它变成一个 numpy 向量化函数（通过 vec_func = np.vectorized(get_max, otypes=[int,float]) 或 numpy ufunc（通过 vec_func = np.fromfunc(get_max, nin=1, nout=1)。我不确定我是否正确地格式化了输入和输出。我的推理是我正在发送单个元组列表和 return 单个元组，因此 nin=1, nout=1。但是，我无法成功将其矢量化版本发送到运行。

我也试过不依赖的解决方法numpy:

def get_max(doc: List[Tuple[int, float]]) -> Tuple[int, float]:

   ids, probabilities = zip(*doc)
   return ids[np.argmax(probabilities)], np.max(probabilities)

两者到运行所需的时间大致相同。对于我大约 80k 的列表，这两个实现大约需要 1 分 10 秒。如果可能的话，我真的很想把它写下来。

Answer 1

您需要为此使用 numpy 吗？我们可以采用函数式方法和 map max 函数，在整个数据集中自定义 key。

from functools import partial
from operator import itemgetter

snd = itemgetter(1)
p = partial(max, key=snd)
list(map(p, data))
>>> [(3, 0.5796), (0, 0.9365), (13, 0.8642), (0, 0.6249)]

然后对来自原始数据集的 80K 个随机元组进行快速计时。

from random import choice

result = []
for _ in range(80_000):
    result.append(choice(data))

%timeit list(map(p, result))
42.2 ms ± 686 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Answer 2

就像@gold_cy 提到的那样，我不确定您是否正在寻找 numpy 答案。非 numpy 答案可能是：

list_tuple = [
    [(0, 0.3792), (3, 0.5796)],
    [(0, 0.9365), (1, 0.0512), (18, 0.0123)],
    [(13, 0.8642)],
    [(0, 0.6249), (1, 0.01), (2, 0.01), (3, 0.01), (4, 0.01), (5, 0.01)]
]

[sorted(tup, key=lambda x: x[1], reverse=True).pop(0) for tup in list_tuple]

>>> [(3, 0.5796), (0, 0.9365), (13, 0.8642), (0, 0.6249)]

Answer 3

优化的非numpy解决方案是：

from operator import itemgetter

get1 = itemgetter(1)

all_lists = [...]  # Whatever your actual list of list of tuples comes from

all_maxes = [max(lst, key=get1) for lst in all_lists]

numpy 不太可能让你受益匪浅，因为完成的工作相对便宜，如果你只是为了单个操作转换为 numpy 数组，那么受益范围更小。

Answer 4

In [462]: alist
Out[462]: 
[[(0, 0.3792), (3, 0.5796)],
 [(0, 0.9365), (1, 0.0512), (18, 0.0123)],
 [(13, 0.8642)],
 [(0, 0.6249), (1, 0.01), (2, 0.01), (3, 0.01), (4, 0.01), (5, 0.01)]]
In [463]: blist = alist*10000    # bigger test list

尝试各种替代方法，我发现这个“蛮力”函数是最快的（虽然不是很多）：

def get_max3(doc):
    m = doc[0]
    for i in doc[1:]:
        if i[1]>m[1]: m=i
    return m

对于小列表，列表理解速度稍快，对于大列表，地图版本有优势 - 但不是很多。

In [465]: [get_max3(i) for i in alist]
Out[465]: [(3, 0.5796), (0, 0.9365), (13, 0.8642), (0, 0.6249)]

In [466]: timeit [get_max3(i) for i in alist]
1.9 µs ± 51.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [467]: timeit list(map(get_max3,blist))
15 ms ± 7.77 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

使用numpy的版本都慢得多；将元组列表转换为 numpy 数组（甚至是结构化数组）需要时间。

循环遍历多个元组列表以查找每个元组列表最大值的快速方法

Fast method to cycle through multiple lists of tuples to find max of each tuple list

python

arrays

tuples

numpy

vectorization