如何根据加权概率从 python 字典中选择键?

How to choose keys from a python dictionary based on weighted probability?

我有一个 Python 字典,其中键代表一些项目,值代表所述项目的一些(规范化)权重。例如:

d = {'a': 0.0625, 'c': 0.625, 'b': 0.3125}
# Note that sum([v for k,v in d.iteritems()]) == 1 for all `d`

鉴于项目与权重的这种相关性,我如何从 d 中选择一个键,以便 6.25% 的时间结果是 'a',32.25% 的时间结果是 'b',62.5%的结果是'c'?

def weighted_random_by_dct(dct):
    rand_val = random.random()
    total = 0
    for k, v in dct.items():
        total += v
        if rand_val <= total:
            return k
    assert False, 'unreachable'

应该可以解决问题。遍历每个键并保持 运行 总和,如果随机值(介于 0 和 1 之间)落在插槽中,它 returns 那个键

我的理解是:你需要一个简单的随机函数来生成一个介于 0 和 1 之间的随机数。如果值介于 0 to 0.0625 之间,你将 select 键a,如果是在0.0625 and (0.0625 + 0.625)之间,那么你就select键c等,这就是这个里面实际提到的。

由于随机数将统一生成,因此预期与较大权重相关联的键与其他键相比selected更多。

如果你会用numpy,你可以使用numpy.random.choice函数,像这样:

import numpy as np

d = {'a': 0.0625, 'c': 0.625, 'b': 0.3125}

def pick_by_weight(d):
    d_choices = []
    d_probs = []
    for k,v in d.iteritems():
      d_choices.append(k)
      d_probs.append(v)
    return np.random.choice(d_choices, 1, p=d_probs)[0]


d = {'a': 0.0625, 'c': 0.625, 'b': 0.3125}
choice = pick_by_weight(d)

如果您打算经常这样做,您可以使用 numpy 到 select 列表中的密钥,使用 np.random.choice() 加权概率。下面的示例将使用加权概率选择您的密钥 10,000 次。

import numpy as np

probs = [0.0625, 0.625, 0.3125]
keys = ['a', 'c', 'b']

choice_list = np.random.choice(keys, 10000, replace=True, p=probs)

保留一个 "inverted" 字典可能会有用,其中键是权重值,值是您可以获得的键的列表。这样就可以更容易地分发它,以防更多的密钥具有相同的权重:

from collections import defaultdict
import random

dict = {'a': 0.0625, 'd': 0.0625, 'c': 0.625, 'b': 0.3125}

inverted_dict = defaultdict(list)

for k, v in dict.items():
    inverted_dict[v].append(k)

# Here first you get a random value between 0 and 1, which is your weigth
# Then, you choose a random value from the list of keys that have the same weight
print(random.choice(inverted_dict[random.choice(inverted_dict.keys())]))

不确定您的用例是什么,但您可以查看 NLTK 包中的频率 distribution/probability 分布 类,它处理所有细节。

FreqDist is an extension of a counter, which can be passed to a ProbDistI界面。 ProbDistI 接口公开了一个 "generate()" 方法,可用于对分布进行采样,以及一个 "prob(sample)" 方法,可用于获取给定键的概率。

对于您的情况,您希望使用最大似然估计,因此 MLEProbDist。如果你想平滑分布,你可以试试 LaplaceProbDist 或 SimpleGoodTuringProbDist。

例如:

from nltk.probability import FreqDist, MLEProbDist

d = {'a': 6.25, 'c': 62.5, 'b': 31.25}
freq_dist = FreqDist(d)
prob_dist = MLEProbDist(freq_dist)

print prob_dist.prob('a')
print prob_dist.prob('b')
print prob_dist.prob('c')
print prob_dist.prob('d')

将打印“0.0625 0.3125 0.625 0.0”。

要生成新样本,您可以使用:

prob_dist.generate()

从 Python 3.6 开始,您可以使用内置的 random.choices() 而不是必须使用 Numpy。

那么,如果我们想从您的字典中采样(替换)25 个键,其中值是被采样的 weights/probabilities,我们可以简单地写:

import random
random.choices(list(my_dict.keys()), weights=my_dict.values(), k=25)

这会输出采样键列表:

['c', 'b', 'c', 'b', 'b', 'c', 'c', 'c', 'b', 'c', 'b', 'c', 'b', 'c', 'c', 'c', 'c', 'c', 'a', 'b']

如果您只想要一个键,请将 k 设置为 1 并从列表中提取单个元素 random.choices returns:

random.choices(list(my_dict.keys()), weights=my_dict.values(), k=1)[0]

(如果您不将 my_dict.keys() 转换为列表,您将收到关于它如何不可订阅的 TypeError。)

这是来自 docs 的相关片段:

random.choices(population, weights=None, *, cum_weights=None, k=1)

Return a k sized list of elements chosen from the population with replacement. If the population is empty, raises IndexError.

If a weights sequence is specified, selections are made according to the relative weights. Alternatively, if a cum_weights sequence is given, the selections are made according to the cumulative weights (perhaps computed using itertools.accumulate()). For example, the relative weights [10, 5, 30, 5] are equivalent to the cumulative weights [10, 15, 45, 50]. Internally, the relative weights are converted to cumulative weights before making selections, so supplying the cumulative weights saves work.

If neither weights nor cum_weights are specified, selections are made with equal probability. If a weights sequence is supplied, it must be the same length as the population sequence. It is a TypeError to specify both weights and cum_weights.

The weights or cum_weights can use any numeric type that interoperates with the float values returned by random() (that includes integers, floats, and fractions but excludes decimals). Weights are assumed to be non-negative.

For a given seed, the choices() function with equal weighting typically produces a different sequence than repeated calls to choice(). The algorithm used by choices() uses floating point arithmetic for internal consistency and speed. The algorithm used by choice() defaults to integer arithmetic with repeated selections to avoid small biases from round-off error.

根据 的评论,random.choices 对于小数组更快,而 numpy.random.choice 对于大数组更快。 numpy.random.choice 还提供了一个无需替换的采样选项,而没有内置的 Python 标准库函数。