如何在 python 中实现概率分布的合并?

How to implement Conflation for probability distribution in python?

我在网上查找了将多个连续概率分布合并为一个连续概率分布的方法。这种方法叫做Conflation,方法可以在下面的文章中找到:An Optimal Method for Consolidating Data from Different Experiments。在本文中,我发现执行 Conflation 而不是平均来合并分布会更好。

根据我从这篇文章中的理解,方程式的执行方式是将几个概率分布中的每个概率密度值乘以几个概率分布中每个概率密度值的乘积的积分,对于连续分布和离散分布它是通过将几个概率分布的每个概率密度值乘以几个概率分布的每个概率密度值的总和来完成的。 (详见文章第5页)

假设我有大约 4 个列表,例如 4 个范数分布,例如

list_1 = [5, 8, 6, 2, 1]
list_2 = [2, 6, 1, 3, 8]
list_3 = [1, 9, 2, 7, 5]
list_4 = [3, 2, 4, 1, 6]

并实施 Conflation 结果列表变为,

Con_list = [2.73, 34.56, 3.69, 3.23, 12]

(如有错误请指正)

如何将照片中的两个等式实现到 python 以获得输入的 PDF 分布的合并?

我之前发现关于平均列表的stackflow问题,代码如下,

def average(l):
    llen = len(l)
    def divide(x):
        return x / llen
    # return map(divide, map(sum, zip(*l)))
    return map(divide, map(sum, zip(l)))

我一直在尝试重新编码此函数以遵循上面的等式,但我找不到一种方法来获取连续分布的合并 pdf。

编辑 1:

根据 @Josh Purtell 的回答,我重写了代码,但是,我不断收到以下错误消息:

错误信息:

Traceback (most recent call last):
  File "/tmp/sessions/c903d99d60f20c3b/main.py", line 72, in <module>
    graph=conflate_pdf(domain, dists,lb,ub)
  File "/tmp/sessions/c903d99d60f20c3b/main.py", line 58, in conflate_pdf
    denom = quad(prod_pdf, lb, ub, args=(dists))[0]
  File "/usr/local/lib/python3.6/dist-packages/scipy/integrate/quadpack.py", line 341, in quad
    points)
  File "/usr/local/lib/python3.6/dist-packages/scipy/integrate/quadpack.py", line 448, in _quad
    return _quadpack._qagse(func,a,b,args,full_output,epsabs,epsrel,limit)
TypeError: only size-1 arrays can be converted to Python scalars

代码:

def prod_pdf(x,pdfs):
    prod=np.ones(pdfs[0].shape[0])
    for pdf in pdfs:
        prod=prod*pdf
    return prod

def conflate_pdf(x,dists,lb,ub):
    denom = quad(prod_pdf, lb, ub, args=(dists))[0]
    return prod_pdf(x,dists)/denom

lb=-10
ub=10
domain=np.arange(lb,ub,.01)

dist_1 = stats.norm.pdf(domain, 2,1)
dist_2 = stats.norm.pdf(domain, 2.5,1.5)
dist_3 = stats.norm.pdf(domain, 2.2,1.6)
dist_4 = stats.norm.pdf(domain, 2.4,1.3)
dist_5 = stats.norm.pdf(domain, 2.7,1.5)

dists=[dist_1, dist_2, dist_3, dist_4, dist_5]
graph=conflate_pdf(domain, dists,lb,ub)

from matplotlib import pyplot as plt
plt.plot(domain, dist_1)
plt.plot(domain, dist_2)
plt.plot(domain, dist_3)
plt.plot(domain, dist_4)
plt.plot(domain, dist_5)
plt.plot(domain,graph)
plt.xlabel("domain")
plt.ylabel("pdf")
plt.title("Conflated PDF")
plt.show()

从代码看,是什么原因导致这个错误?

编辑 2:

我设法重写了代码以查看分发列表,而不是在 编辑 1 的产品函数中获取 pdf,但我仍然遇到同样的错误在 编辑 1.

代码:

def prod_pdf(x,pdfs):
    prod=np.ones(np.array(pdfs)[0].shape)
    for pdf in pdfs:
        print(prod)
        for c,y in enumerate(pdf):
            prod[c]=prod[c]*y
        print('final:', prod)
    return prod

def conflate_pdf(x,dists,lb,ub):
    denom = quad(prod_pdf, lb, ub, args=(dists))[0]
    print('Denom: ',denom)
    print('product pdf: ', prod_pdf(x,dists))
    conflated_pdf=prod_pdf(x,dists)/denom
    print(conflated_pdf)
    return conflated_pdf

lb=-10
ub=10
domain=np.arange(lb,ub,.01)

dist_1 = st.norm.pdf(domain, 2,1)
dist_2 = st.norm.pdf(domain, 2.5,1.5)
dist_3 = st.norm.pdf(domain, 2.2,1.6)
dist_4 = st.norm.pdf(domain, 2.4,1.3)
dist_5 = st.norm.pdf(domain, 2.7,1.5)

from matplotlib import pyplot as plt
plt.plot(domain, dist_1, 'r')
plt.plot(domain, dist_2, 'g')
plt.plot(domain, dist_3, 'b')
plt.plot(domain, dist_4, 'y')
plt.plot(domain, dist_5, 'c')

dists=[dist_1, dist_2, dist_3, dist_4, dist_5]
graph=conflate_pdf(domain, dists,lb,ub)


plt.plot(domain,graph, 'm')
plt.xlabel("domain")
plt.ylabel("pdf")
plt.title("Conflated PDF")
plt.show()

编辑 3:

我尝试 运行 以下代码(基于 @Josh Purtell 的回答),但是,我一直在获取一个变量,它获取整个数组在 product 函数之后,它会产生关于 size-1 数组的相同错误消息。请参阅以下带有部分输出的代码:

代码:

from scipy.integrate import quad
from scipy import stats
import numpy as np

def prod_pdf(x,dists):
    p_pdf=1
    print('Incoming Array:', p_pdf)
    for dist in dists:
        p_pdf=p_pdf*dist
        print('final:', p_pdf)
    return p_pd

def conflate_pdf(x,dists,lb,ub):
    print('Input product pdf: ', prod_pdf(x,dists))
    denom = quad(prod_pdf, lb, ub, args=(dists,))[0]
    # denom = simps(prod_pdf)
    # denom = nquad(func=(prod_pdf), ranges=([lb, ub]), args=(dists,))[0]
    print('Denom: ', denom)
    conflated_pdf=prod_pdf(x,dists)/denom
    print('Conflated PDF: ', conflated_pdf)
    return conflated_pdf

lb=-10
ub=10
domain=np.arange(lb,ub,.01)

dist_1 = st.norm.pdf(domain, 2,1)
dist_2 = st.norm.pdf(domain, 2.5,1.5)
dist_3 = st.norm.pdf(domain, 2.2,1.6)
dist_4 = st.norm.pdf(domain, 2.4,1.3)
dist_5 = st.norm.pdf(domain, 2.7,1.5)

from matplotlib import pyplot as plt
plt.xlabel("domain")
plt.ylabel("pdf")
plt.title("Conflated PDF")
plt.legend()
plt.plot(domain, dist_1, 'r', label='Dist. 1')
plt.plot(domain, dist_2, 'g', label='Dist. 2')
plt.plot(domain, dist_3, 'b', label='Dist. 3')
plt.plot(domain, dist_4, 'y', label='Dist. 4')
plt.plot(domain, dist_5, 'c', label='Dist. 5')

dists=[dist_1, dist_2, dist_3, dist_4, dist_5]
print('distribution list: \n', dists)
graph=conflate_pdf(domain, dists,lb,ub)

plt.plot(domain,graph, 'm', label='Conflated Dist.')
plt.show()

这是输出的一小部分:

Incoming Array: 1
final: [2.14638374e-32 2.41991991e-32 2.72804284e-32 ... 6.41980576e-15
 5.92770938e-15 5.47278628e-15]
final: [4.75178372e-48 5.66328097e-48 6.74864868e-48 ... 7.03075979e-21
 6.27970218e-21 5.60806584e-21]
final: [2.80912097e-61 3.51131870e-61 4.38823989e-61 ... 1.32670185e-26
 1.14952951e-26 9.95834610e-27]
final: [1.51005552e-81 2.03116529e-81 2.73144352e-81 ... 1.76466623e-34
 1.46198598e-34 1.21092834e-34]
final: [1.09076800e-97 1.55234627e-97 2.20861552e-97 ... 3.72095218e-40
 2.98464396e-40 2.39335035e-40]
Input product pdf:  [1.09076800e-97 1.55234627e-97 2.20861552e-97 ... 3.72095218e-40
 2.98464396e-40 2.39335035e-40]
Incoming Array: 1
final: [2.14638374e-32 2.41991991e-32 2.72804284e-32 ... 6.41980576e-15
 5.92770938e-15 5.47278628e-15]
final: [4.75178372e-48 5.66328097e-48 6.74864868e-48 ... 7.03075979e-21
 6.27970218e-21 5.60806584e-21]
final: [2.80912097e-61 3.51131870e-61 4.38823989e-61 ... 1.32670185e-26
 1.14952951e-26 9.95834610e-27]
final: [1.51005552e-81 2.03116529e-81 2.73144352e-81 ... 1.76466623e-34
 1.46198598e-34 1.21092834e-34]
final: [1.09076800e-97 1.55234627e-97 2.20861552e-97 ... 3.72095218e-40
 2.98464396e-40 2.39335035e-40]

我设法在 编辑 3 中查看代码以实现相同的方法,我编辑了代码,它从每个分布中获取第一个变量,但是,对于其余部分它不断打印相同值的循环,它不会转到列表中的下一个值,并且合并分布是单个变量。请参阅以下带有部分输出的代码:

代码:

from scipy.integrate import quad
from scipy import stats
import numpy as np

def prod_pdf(x,dists):
    p_pdf=1
    print('Incoming Array:', p_pdf)
    for c,dist in enumerate(dists):
        p_pdf=p_pdf*dist[c]
        print('final:', p_pdf)
    return p_pdf

def conflate_pdf(x,dists,lb,ub):
    print('Input product pdf: ', prod_pdf(x,dists))
    denom = quad(prod_pdf, lb, ub, args=(dists,))[0]
    # denom = simps(prod_pdf)
    # denom = nquad(func=(prod_pdf), ranges=([lb, ub]), args=(dists,))[0]
    print('Denom: ', denom)
    conflated_pdf=prod_pdf(x,dists)/denom
    print('Conflated PDF: ', conflated_pdf)
    return conflated_pdf

lb=-10
ub=10
domain=np.arange(lb,ub,.01)

dist_1 = st.norm.pdf(domain, 2,1)
dist_2 = st.norm.pdf(domain, 2.5,1.5)
dist_3 = st.norm.pdf(domain, 2.2,1.6)
dist_4 = st.norm.pdf(domain, 2.4,1.3)
dist_5 = st.norm.pdf(domain, 2.7,1.5)

from matplotlib import pyplot as plt
plt.xlabel("domain")
plt.ylabel("pdf")
plt.title("Conflated PDF")
plt.legend()
plt.plot(domain, dist_1, 'r', label='Dist. 1')
plt.plot(domain, dist_2, 'g', label='Dist. 2')
plt.plot(domain, dist_3, 'b', label='Dist. 3')
plt.plot(domain, dist_4, 'y', label='Dist. 4')
plt.plot(domain, dist_5, 'c', label='Dist. 5')

dists=[dist_1, dist_2, dist_3, dist_4, dist_5]
print('distribution list: \n', dists)
graph=conflate_pdf(domain, dists,lb,ub)

plt.plot(domain,graph, 'm', label='Conflated Dist.')
plt.show()

部分输出:

Incoming Array: 1
final: 2.1463837356630605e-32
final: 5.0231307782193034e-48
final: 3.266239495519432e-61
final: 2.187514996217005e-81
final: 1.979657878680375e-97
Incoming Array: 1
final: 2.1463837356630605e-32
final: 5.0231307782193034e-48
final: 3.266239495519432e-61
final: 2.187514996217005e-81
final: 1.979657878680375e-97
Denom:  3.95931575736075e-96
Incoming Array: 1
final: 2.1463837356630605e-32
final: 5.0231307782193034e-48
final: 3.266239495519432e-61
final: 2.187514996217005e-81
final: 1.979657878680375e-97
Conflated PDF:  0.049999999999999996

编辑 4:

我实现了以下代码,它似乎工作正常,而且,如果我将 quad 更改为 fixed_quad 并规范化,我设法解决了 quad 的问题pdf 列表。我会得到相同的结果。这是以下代码:

import scipy.stats as st
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, Normalizer, normalize, StandardScaler
from scipy.integrate import quad, simps, quad_vec, nquad, cumulative_trapezoid
from scipy.integrate import romberg, trapezoid, simpson, romb
from scipy.integrate import fixed_quad, quadrature, quad_explain
from scipy import stats
import time

def user_prod_pdf(x,dists):
p_list=[]
p_pdf=1
print('Incoming Array:', p_pdf)
for dist in dists:
print('Incoming Distribution Array:', dist.pdf(x))
p_pdf=p_pdf*dist.pdf(x)
print('Product PDF:', p_pdf)
p_list.append(p_pdf)
print('final Product PDF:', p_pdf)
print('Product PDF list: ', p_list)
return p_pdf

def user_conflate_pdf(x,dists,lb,ub):
print('Input product pdf: ', user_prod_pdf(x,dists))
denom = quad(user_prod_pdf, lb, ub, args=(dists,))[0]
print('Denom: ', denom)
conflated_pdf=user_prod_pdf(x,dists)/denom
print('Conflated PDF: ', conflated_pdf)
return conflated_pdf

def user_conflate_pdf_2(pdfs):
"""
Compute conflation of given pdfs.

[ARGS]
- pdfs: PDFs numpy array of shape (n, x)
where n is the number of PDFs
and x is the variable space.

[RETURN]
A 1d-array of normalized conflated PDF.
"""
# conflate
conflation = np.array(pdfs).prod(axis=0)
# normalize
conflation /= conflation.sum()
return conflation

def my_product_pdf(x,dists):
p_list=[]
p_pdf=1
print('Incoming Array:', p_pdf)
list_full_size=np.array(dists).shape
print('Full list size: ', list_full_size)
print('list size: ', list_full_size[0])
for x in range(list_full_size[1]):
p_pdf=1
for y in range(list_full_size[0]):
p_pdf=float(p_pdf)*dists[y][x]
print('Product value: ', p_pdf)
print('Product PDF:', p_pdf)
p_list.append(p_pdf)
print('final Product PDF:', p_pdf)
print('Product PDF list: ', p_list)
# return p_pdf
return p_list
# return np.array(p_list)

def my_conflate_pdf(x,dists,lb,ub):
print('\n')
# print('product pdf: ', prod_pdf(x,dists))
print('product pdf: ', my_product_pdf(x,dists))
denom = fixed_quad(my_product_pdf, lb, ub, args=(dists,), n=1)[0]
print('Denom: ', denom)
# conflated_pdf=prod_pdf(x,dists)/denom
conflated_pdf=my_product_pdf(x,dists)/denom
# conflated_pdf=[i / j for i,j in zip(my_product_pdf(x,dists), denom)]
print('Conflated PDF: ', conflated_pdf)
return conflated_pdf

lb=-10
ub=10
domain=np.arange(lb,ub,.01)

# dist_1 = st.norm(2,1)
# dist_2 = st.norm(2.5,1.5)
# dist_3 = st.norm(2.2,1.6)
# dist_4 = st.norm(2.4,1.3)
# dist_5 = st.norm(2.7,1.5)

# dist_1_pdf = st.norm.pdf(domain, 2,1)
# dist_2_pdf = st.norm.pdf(domain, 2.5,1.5)
# dist_3_pdf = st.norm.pdf(domain, 2.2,1.6)
# dist_4_pdf = st.norm.pdf(domain, 2.4,1.3)
# dist_5_pdf = st.norm.pdf(domain, 2.7,1.5)

# dist_1_pdf /= dist_1_pdf.sum()
# dist_2_pdf /= dist_2_pdf.sum()
# dist_3_pdf /= dist_3_pdf.sum()
# dist_4_pdf /= dist_4_pdf.sum()
# dist_5_pdf /= dist_5_pdf.sum()

dist_1 = st.norm(2,1)
dist_2 = st.norm(4,2)
dist_3 = st.norm(7,4)
dist_4 = st.norm(2.4,1.3)
dist_5 = st.norm(2.7,1.5)

dist_1_pdf = st.norm.pdf(domain, 2,1)
dist_2_pdf = st.norm.pdf(domain, 4,2)
dist_3_pdf = st.norm.pdf(domain, 7,4)
dist_4_pdf = st.norm.pdf(domain, 2.4,1.3)
dist_5_pdf = st.norm.pdf(domain, 2.7,1.5)

# dist_1_pdf /= dist_1_pdf.sum()
# dist_2_pdf /= dist_2_pdf.sum()
# dist_3_pdf /= dist_3_pdf.sum()
# dist_4_pdf /= dist_4_pdf.sum()
# dist_5_pdf /= dist_5_pdf.sum()

# User:
plt.xlabel("domain")
plt.ylabel("pdf")
plt.title("User Conflated PDF")
plt.plot(domain, dist_1_pdf, 'r', label='Dist. 1')
plt.plot(domain, dist_2_pdf, 'g', label='Dist. 2')
plt.plot(domain, dist_3_pdf, 'b', label='Dist. 3')
plt.plot(domain, dist_4_pdf, 'y', label='Dist. 4')
plt.plot(domain, dist_5_pdf, 'c', label='Dist. 5')

dists=[dist_1, dist_2, dist_3, dist_4, dist_5]
user_graph=user_conflate_pdf(domain,dists,lb,ub)
print('Final Conflated PDF: ', user_graph)

# user_graph /= user_graph.sum()

plt.plot(domain, user_graph, 'm', label='Conflated PDF')
plt.legend()
plt.show()

# User 2:
plt.xlabel("domain")
plt.ylabel("pdf")
plt.title("User Conflated PDF 2")
plt.plot(domain, dist_1_pdf, 'r', label='Dist. 1')
plt.plot(domain, dist_2_pdf, 'g', label='Dist. 2')
plt.plot(domain, dist_3_pdf, 'b', label='Dist. 3')
plt.plot(domain, dist_4_pdf, 'y', label='Dist. 4')
plt.plot(domain, dist_5_pdf, 'c', label='Dist. 5')

dists=[dist_1_pdf, dist_2_pdf, dist_3_pdf, dist_4_pdf, dist_5_pdf]
user_graph=user_conflate_pdf_2(dists)
print('Final User Conflated PDF 2 : ', user_graph)

# user_graph /= user_graph.sum()

plt.plot(domain, user_graph, 'm', label='Conflated PDF')
plt.legend()
plt.show()

# My Code:
# from matplotlib import pyplot as plt
plt.xlabel("domain")
plt.ylabel("pdf")
plt.title("My Conflated PDF Code")
plt.plot(domain, dist_1_pdf, 'r', label='Dist. 1')
plt.plot(domain, dist_2_pdf, 'g', label='Dist. 2')
plt.plot(domain, dist_3_pdf, 'b', label='Dist. 3')
plt.plot(domain, dist_4_pdf, 'y', label='Dist. 4')
plt.plot(domain, dist_5_pdf, 'c', label='Dist. 5')

dists=[dist_1_pdf, dist_2_pdf, dist_3_pdf, dist_4_pdf, dist_5_pdf]
my_graph=my_conflate_pdf(domain,dists,lb,ub)
print('Final Conflated PDF: ', my_graph)

my_graph /= np.array(my_graph).sum()

# my_graph = inverse_normalise(my_graph)

plt.plot(domain, my_graph, 'm', label='Conflated PDF')
plt.legend()
plt.show()

# Conflated PDF:
print('User Conflated PDF: ', user_graph)
print('My Conflated PDF: ', np.array(my_graph))

这是输出:

我的问题在这里,我知道我需要规范化 PDF 列表。但是,假设我没有对 PDF 进行规范化,我该如何修改我的合并代码以获得以下图表?

要获得上面的图和我的合并代码:

# user_graph /= user_graph.sum()
# dist_1_pdf /= dist_1_pdf.sum()
# dist_2_pdf /= dist_2_pdf.sum()
# dist_3_pdf /= dist_3_pdf.sum()
# dist_4_pdf /= dist_4_pdf.sum()
# dist_5_pdf /= dist_5_pdf.sum()

我没有规范化的合并代码图:

免责声明:我很可能误解了您或论文作者,在这种情况下,请建议对此答案进行修改。

这是一个简单的 not-especially-performant 实现,我认为合并可能看起来像

##define pdfs for discrete RV X = {1,2,3,4}
import numpy as np

def mult_list(pdfs):
    prod=np.ones(pdfs[0].shape[0])
    for pdf in pdfs:
        prod=prod*pdf
    return prod

def conflate(pdfs):
    return mult_list(pdfs)/sum(mult_list(pdfs))

pdf_1=np.array([.25,.25,.25,.25])
pdf_2=np.array([.33,.33,.33,.00])
pdf_3=np.array([.25,.12,.13,.50])

print(conflate([pdf_1,pdf_2,pdf_3]))

生成合并后的 pdf

>>> [0.5  0.24 0.26 0.  ]

它通过了粗略的嗅探测试。

在事情的连续性方面,以上内容转化为

from scipy.integrate import quad
from scipy import stats
import numpy as np

def prod_pdf(x,dists):
    p_pdf=1
    for dist in dists:
        p_pdf=p_pdf*dist.pdf(x)
    return p_pdf

def conflate_pdf(x,dists,lb,ub):
    denom = quad(prod_pdf, lb, ub, args=(dists))[0]
    return prod_pdf(x,dists)/denom

dists=[stats.norm(2,1),stats.norm(4,2)]
lb=-10
ub=10
domain=np.arange(lb,ub,.01)
graph=conflate_pdf(domain,dists,lb,ub)

from matplotlib import pyplot as plt
plt.plot(domain,graph)
plt.xlabel("domain")
plt.ylabel("pdf")
plt.title("Conflated PDF")
plt.show()
plt.savefig("conflatedpdf.png")

这给出

如您所见,分布并非双峰分布,正如人们所希望的那样。