我如何向量化这个特定的非 numpy 函数?

How can I vectorize this specific non-numpy function?

文档字符串使 post 看起来比实际更长。另外,我的问题是关于函数调用链顶部的函数。

有效的部分:

我想绘制指定分布的卡方值等值线图。我了解如何制作等值线图的基础知识,但无法在基本示例之外应用这些技术。问题可能在于对我的函数进行矢量化。作为样本,考虑一个包含 1000 个点的样本高斯数据集,其平均值和分布分别为 48 和 7。

# imports: import numpy as np, import random, from math import pi, from scipy.integrate import quad, from scipy.stats import chisquare, from scipy.optimize import minimize

dataset_gauss = [random.gauss(48, 7) for index in range(1000)]

我的函数和变量名是这样的,因为我的完整代码采用多重分布(高斯分布、对数正态分布)

def equation_gauss(x, a, b):
    """
    This function returns the equation for the Gaussian distribution.
    """
    cnorm = 1 / (b* (2*pi)**(1/2))
    return cnorm * np.exp((-1) * (x - a)**2 / (2* b**2))

使用最大对数似然,我的脚本(与问题无关,因此未显示代码)returns params_gauss = [47.972906400237889, 7.0241339595841286].

为了计算卡方,必须首先制作一个 bin 边界列表。然后,可以将每个期望值等同于每个箱子从箱子左侧到右侧的分布方程的积分。每个 bin 的观测值是该 bin 内观测值的数量。可以通过将每个 bin 的预期值和观察值的平方差除以预期值的商相加来计算卡方。

def get_bins(distribution, num_bins=50):
    """
    This function returns a specified number of equally sized bins over
    the domain of the distribution.
    """
    if distribution == 'gauss':
        dataset = dataset_gauss
    return np.linspace(min(dataset), max(dataset), num_bins)

def get_binned_expectations(distribution, args):
    """
    This function returns the expectation values per bin for a dataset
    given by the specified distribution.
    """
    if distribution == 'gauss':
        dataset = dataset_gauss
        func = equation_gauss
    num_obs = len(dataset)
    bins = get_bins(distribution)
    res = []
    for idx in range(len(bins)):
        if idx != len(bins)-1:
            res.append(quad(func, bins[idx] , bins[idx+1], args = (args[0] , args[1]))[0] * num_obs)
    return res

def get_binned_observations(distribution):
    """
    This function returns the observation values per bin for a dataset
    given by the specified distribution.
    """
    if distribution == 'gauss':
        dataset = dataset_gauss
    bins = get_bins(distribution)
    bin_count = []
    for idx in range(len(bins)):
        if idx != len(bins)-1:
            summ = 0
            for datum in dataset:
                if datum > bins[idx] and datum <= bins[idx+1]:
                    summ += 1
            bin_count.append(summ)
        if idx == len(bins)-1:
            pass
    return bin_count

def get_chi_square(distribution, params):
    """
    This function returns the chi square value for a specified
    distribution.

    EX:
        distribution    :   'gauss', 'lognormal'

        params          :   [a, b] for parameters a and b
                            'opt' (for optimized parameters)
    """
    values_observation = get_binned_observations(distribution)
    if params == 'opt':
        if distribution == 'gauss':
            params = params_gauss
    values_expectation = get_binned_expectations(distribution, params)
    return chisquare(values_observation, values_expectation)

作为检查,让我们试试:

res = get_chi_square('gauss', params='opt')
print(res)
new_params = [40, 10]
new_res = get_chi_square('gauss', params=new_params)
print(new_res)

>> Power_divergenceResult(statistic=55.465132812431413, pvalue=0.21391356257718666)
>> Power_divergenceResult(statistic=14950.604250041084, pvalue=0.0)

第一个值statistic是对应参数得到的卡方值,而第二个值pvalue是参数拟合的概率。出于我的目的,最好只将第一个元素称为 print(new_res[0])。 (由于未指定自由度,概率不是很准确)。

为了制作等高线图,我的理解是我需要通过 dim-2 数组生成网格 space。首先,我编写了一个函数来 return 每个参数的数字列表。这是 returns x, y 使得 X, Y 是它的 meshgrid.

的函数
def get_axis_data(param, frac, size):
    """
    This function returns a specified number of elements in a range
    centered around the value of the inputted parameter. The extrema
    of this range are specified as:
                    param ± param * frac
    """
    update = frac * param
    return np.linspace(param - update, param + update, size)

我的问题:

我知道我可以使用 plt.contourf(X, Y, Z, cmap)。但是,我不知道如何格式化 get_chi_square 以接收 meshgrid-ed 参数作为输入,因为它调用 scipy 模块通过 [=69=(有效地)计算卡方]list 个可优化参数。我已经注释掉了我尝试失败的事情。

def get_grid_data(distribution, frac=1/4, size=9, func=get_chi_square, cmap='plasma'):
    """
    This function returns the grid values for a contour plot of the
    error metric as a function of the parameters of a specified
    distribution.

    EX:
        func:   'chi square', 'maximum log-likelihood' (error metric)
    """
    if distribution == 'gauss':
        opt_params = params_gauss
    a_vals = get_axis_data(opt_params[0], frac, size)
    b_vals = get_axis_data(opt_params[1], frac, size)
    X, Y = np.meshgrid(a_vals, b_vals)
    # func = np.vectorize(func)
    # Z = func(distribution, [X, Y])[0]
    return X, Y#, Z

X, Y = get_grid_data('gauss')
print("X")
print(X)
print("")
print("Y")
print(Y)

运行 以上给出:

 X
[[ 35.9796798   38.97798645  41.9762931   44.97459975  47.9729064
   50.97121305  53.9695197   56.96782635  59.966133  ]
 [ 35.9796798   38.97798645  41.9762931   44.97459975  47.9729064
   50.97121305  53.9695197   56.96782635  59.966133  ]
 [ 35.9796798   38.97798645  41.9762931   44.97459975  47.9729064
   50.97121305  53.9695197   56.96782635  59.966133  ]
 [ 35.9796798   38.97798645  41.9762931   44.97459975  47.9729064
   50.97121305  53.9695197   56.96782635  59.966133  ]
 [ 35.9796798   38.97798645  41.9762931   44.97459975  47.9729064
   50.97121305  53.9695197   56.96782635  59.966133  ]
 [ 35.9796798   38.97798645  41.9762931   44.97459975  47.9729064
   50.97121305  53.9695197   56.96782635  59.966133  ]
 [ 35.9796798   38.97798645  41.9762931   44.97459975  47.9729064
   50.97121305  53.9695197   56.96782635  59.966133  ]
 [ 35.9796798   38.97798645  41.9762931   44.97459975  47.9729064
   50.97121305  53.9695197   56.96782635  59.966133  ]
 [ 35.9796798   38.97798645  41.9762931   44.97459975  47.9729064
   50.97121305  53.9695197   56.96782635  59.966133  ]]

Y
[[ 5.26810047  5.26810047  5.26810047  5.26810047  5.26810047  5.26810047
   5.26810047  5.26810047  5.26810047]
 [ 5.70710884  5.70710884  5.70710884  5.70710884  5.70710884  5.70710884
   5.70710884  5.70710884  5.70710884]
 [ 6.14611721  6.14611721  6.14611721  6.14611721  6.14611721  6.14611721
   6.14611721  6.14611721  6.14611721]
 [ 6.58512559  6.58512559  6.58512559  6.58512559  6.58512559  6.58512559
   6.58512559  6.58512559  6.58512559]
 [ 7.02413396  7.02413396  7.02413396  7.02413396  7.02413396  7.02413396
   7.02413396  7.02413396  7.02413396]
 [ 7.46314233  7.46314233  7.46314233  7.46314233  7.46314233  7.46314233
   7.46314233  7.46314233  7.46314233]
 [ 7.9021507   7.9021507   7.9021507   7.9021507   7.9021507   7.9021507
   7.9021507   7.9021507   7.9021507 ]
 [ 8.34115908  8.34115908  8.34115908  8.34115908  8.34115908  8.34115908
   8.34115908  8.34115908  8.34115908]
 [ 8.78016745  8.78016745  8.78016745  8.78016745  8.78016745  8.78016745
   8.78016745  8.78016745  8.78016745]]

我想以与上面代码中的 XY 相同的格式打印 Z。这样怎么才能得到卡方函数值呢?

编辑:

如果我将函数 get_grid_data 更改为 get_grid_params 并重新定义 get_grid_data 如下所示,我可以生成 81 个卡方值。我认为这是向前迈出的一步,但我不确定等高线图所需的 res(也就是上面的 Z)中数组元素的顺序。

def get_grid_params(distribution, frac, size):
    """
    This function returns the grid values for a contour plot of the
    error metric as a function of the parameters of a specified
    distribution.

    EX:
        func:   'chi square', 'maximum log-likelihood' (error metric)
    """
    if distribution == 'gauss':
        opt_params = params_gauss
    a_vals = get_axis_data(opt_params[0], frac, size)
    b_vals = get_axis_data(opt_params[1], frac, size)
    X, Y = np.meshgrid(a_vals, b_vals)
    # func = np.vectorize(func)
    # Z = func(distribution, [X, Y])
    return X, Y

def get_grid_data(distribution, frac=1/4, size=9, func=get_chi_square):
    """

    """
    X, Y = get_grid_params(distribution, frac, size)
    res = []
    for idx in range(len(X)):
        for jdx in range(len(Y)):
            res.append(func(distribution, [X[idx][jdx], Y[idx][jdx]])[0])
    print(res)
get_grid_data('gauss')

这会打印

# 81 elements ==> 9x9 grid
[4208765217.1232886, 79756867.433148235, 2102012.2187297232, 77845.812346977109, 4299.2223157168837, 2529.7286507333743, 20486.858965000847, 257923.37090704756, 4854102.2912357552, 93281349.868633255, 3214630.1060019895, 149308.23999474355, 9526.0996064385563, 892.28204593366377, 1078.7222202890009, 6755.3095776326609, 53291.09528539874, 588864.18413363863, 4691132.998034155, 266721.46912966535, 20459.717521392733, 2093.3255539124393, 279.78284725132187, 577.3737260040574, 3111.9705345888774, 17462.38755758019, 125880.4188491786, 450519.22715869371, 40667.241172187212, 5020.7992346344054, 744.8798302729781, 116.9962855442742, 364.63898596547921, 1791.3456214870084, 7916.7426067634342, 40972.313769493878, 76104.092836489493, 10798.249475713539, 2013.1185415524558, 381.52353083113587, 66.126519584745949, 264.93942984225561, 1200.5798834763946, 4482.867919608283, 18107.837200860213, 21572.225934943446, 4551.094178016996, 1136.7099239043926, 253.51850353558262, 54.455759914884304, 218.13425049819415, 897.03841272531849, 2952.9334085022683, 9936.4277408736034, 9337.1516297669732, 2622.2698023608255, 789.26686546629082, 202.78664001629076, 60.365012999827258, 199.40257099587109, 726.84333101567586, 2159.6632005396755, 6339.5377293121628, 5372.7483380962221, 1815.8139713332946, 620.16531689499118, 184.61780691354744, 75.563465535153725, 196.96163816097214, 626.64757117448494, 1701.8233311097256, 4494.3117008380068, 3664.4699687203392, 1400.0096023072927, 527.65588603959168, 182.94718825996048, 96.20249715692033, 204.59025315045054, 566.75361531867895, 1416.8609878368447, 3434.8994517014899]
# reshape as 9x9 shows the order of params is wrong.

合并上面的代码,直到(但包括)用 this answer produces the desired output 中的代码定义 get_chi_square 的部分。