如何在不使用循环的情况下为 3D numpy 数组中的每个值做出 N 个随机选择

How to make N random choices for each value in a 3D numpy array without using loops

我有:

我正在尝试做的事情如下:

目前,我已经设法使用循环来完成这项工作,如下所示。这对于少量模拟人生来说很好,但实际上 num_sims 将达到数千,这意味着我的代码太慢了。

def allocate_N(N, var_index):
  """Make N choices from cats for a given variable, and return
  the incides of each category
  var_index is the position of the variable in n_choices"""
  allocation = np.random.choice(cats, size=N, p=probs[:, var_index])
  allocation_sorted = np.argsort(cats)
  ypos = np.searchsorted(cats[allocation_sorted], allocation)
  cat_indices = allocation_sorted[ypos]
  return cat_indices

def add_to_sim(sims, cat_indices, var_index):
  """Takes the category indices from allocate_n and adds 1 to
  sims at the corresponding location for each occurrence of
  the category in cat_indices"""
  from collections import Counter
  a = Counter(list(cat_indices))
  vals = [1*a[j] for j in cat_indices]
  pos = [(var_index, x) for x in cat_indices]
  sims[tuple(np.transpose(pos))] = vals

# For each variable and each row in sims, make N allocations
# and add results to 'sims'
for var_index in range(len(n_choices.T)):
  sim_count = 0
  # slice is (vars x cats), a single row of 'sims'
  for slice in sims:      
    N = n_choices[sim_count, var_index]
    if N > 0:
      cat_indices = allocate_N(N, var_index)
      add_to_sim(slice, cat_indices, var_index)
    sim_count += 1

我确定一定有办法对其进行矢量化?我能够使用方法 同时为每个变量做出一个随机选择,但我不确定如何将其应用于我的特定问题。

感谢您的帮助!

您描述的似乎是 multinomial distribution 的样本。您可以直接从分布中获取样本。不幸的是,每个模拟和变量的分布参数(试验次数和概率)都会发生变化,np.random.multinomialscipy.stats.multinomial 都不允许使用多组参数进行矢量化采样。这意味着,如果你想这样做,你仍然必须用循环来做。至少,您的代码可以简化为以下内容:

import numpy as np

np.random.seed(0)
# Problem size
n_cats = 10
n_vars = 50
n_sims = 100
n_maxchoices = 50
# Make example problem
probs = np.random.rand(n_cats, n_vars)
probs /= probs.sum(0)
n_choices = np.random.randint(n_maxchoices, size=(n_sims, n_vars))
sims = np.zeros((n_sims, n_vars, n_cats), np.int32)
# Sample multinomial distribution for each simulation and variable
for i_sim in range(n_sims):
    for i_var in range(n_vars):
        sims[i_sim, i_var] = np.random.multinomial(n_choices[i_sim, i_var],
                                                   probs[:, i_var])
# Check number of choices per simulation and variable is correct
print(np.all(sims.sum(2) == n_choices))
# True

请注意,如果您愿意使用 Numba,您仍然可以通过这样的功能使它更快:

import numpy as np
import numba as nb

@nb.njit(parallel=True)
def make_simulations(probs, n_choices, sims):
    for i_sim in nb.prange(n_sims):
        for i_var in nb.prange(n_vars):
            sims[i_sim, i_var] = np.random.multinomial(n_choices[i_sim, i_var],
                                                       probs[:, i_var])

编辑:一种可能的替代解决方案不使用仅一个循环的多项式采样可能是这样的:

import numpy as np

np.random.seed(0)
# Problem size
n_cats = 10
n_vars = 50
n_sims = 100
n_maxchoices = 50
# Make example problem
probs = np.random.rand(n_cats, n_vars)
probs /= probs.sum(0)
n_choices = np.random.randint(n_maxchoices, size=(n_sims, n_vars))
sims = np.zeros((n_sims, n_vars, n_cats), np.int32)
# Fill simulations array
n_choices_var = n_choices.sum(0)
sims_r = np.arange(n_sims)
# For each variable
for i_var in range(n_vars):
    # Take choices for all simulations
    choices_var = np.random.choice(n_cats, n_choices_var[i_var], p=probs[:, i_var])
    # Increment choices counts in simulations array
    i_sim = np.repeat(sims_r, n_choices[:, i_var])
    np.add.at(sims, (i_sim, i_var, choices_var), 1)
# Check result
print(np.all(sims.sum(2) == n_choices))
# True

我不确定这是否真的会更快,因为它会生成许多中间数组。我想这取决于问题的特定参数,但如果 Numba 解决方案不是最快的,我会感到惊讶。