即使对于具有 2D 高斯的大样本量,样本协方差矩阵也远非真实
Sample covariance matrix far from truth even for large sample size with 2D gaussian
这是一个非常简单的脚本,可以生成具有 10000 个点的 2D 高斯分布。
np.cov 估计的协方差矩阵似乎与生成的协方差矩阵相去甚远。有什么解释和解决办法吗?
import numpy as np
import matplotlib.pyplot as plt
center=[0,0]
npoints=10000
data_covmat = np.array([[1,1],[1,0.5]])
lines=np.random.multivariate_normal(center,data_covmat,npoints)
print(f'2D gaussian centered at {center}, {npoints} points\nCovariance matrix =')
print(data_covmat)
plt.scatter(lines[:,0],lines[:,1],alpha=.1)
plt.axis('scaled')
plt.show()
print(f'Sample covariance matrix =\n{np.cov(lines,rowvar=False)}')
协方差矩阵=
[[1. 1.]
[1. 0.5]]
样本协方差矩阵=
[[1.23880367 0.74585136]
[0.74585136 0.85974812]]
数组 [[1, 1], [1, 0.5]] 不是半正定数组。它的特征值之一是负的。 multivariate_normal
的文档字符串中 cov
参数的描述说 "Covariance matrix of the distribution. It must be symmetric and positive-semidefinite for proper sampling."
尝试使用 [[1, 0.6], [0.6, 0.5]],它是对称且正定的,并且按预期工作:
In [37]: npoints = 10000
In [38]: center = [0, 0]
In [39]: data_covmat = np.array([[1, 0.6], [0.6, 0.5]])
In [40]: np.linalg.eigvals(data_covmat)
Out[40]: array([1.4, 0.1])
In [41]: lines = np.random.multivariate_normal(center, data_covmat, npoints)
In [42]: np.cov(lines, rowvar=False)
Out[42]:
array([[0.99782727, 0.60349542],
[0.60349542, 0.50179535]])
这是一个非常简单的脚本,可以生成具有 10000 个点的 2D 高斯分布。 np.cov 估计的协方差矩阵似乎与生成的协方差矩阵相去甚远。有什么解释和解决办法吗?
import numpy as np
import matplotlib.pyplot as plt
center=[0,0]
npoints=10000
data_covmat = np.array([[1,1],[1,0.5]])
lines=np.random.multivariate_normal(center,data_covmat,npoints)
print(f'2D gaussian centered at {center}, {npoints} points\nCovariance matrix =')
print(data_covmat)
plt.scatter(lines[:,0],lines[:,1],alpha=.1)
plt.axis('scaled')
plt.show()
print(f'Sample covariance matrix =\n{np.cov(lines,rowvar=False)}')
协方差矩阵=
[[1. 1.] [1. 0.5]]
样本协方差矩阵=
[[1.23880367 0.74585136] [0.74585136 0.85974812]]
数组 [[1, 1], [1, 0.5]] 不是半正定数组。它的特征值之一是负的。 multivariate_normal
的文档字符串中 cov
参数的描述说 "Covariance matrix of the distribution. It must be symmetric and positive-semidefinite for proper sampling."
尝试使用 [[1, 0.6], [0.6, 0.5]],它是对称且正定的,并且按预期工作:
In [37]: npoints = 10000
In [38]: center = [0, 0]
In [39]: data_covmat = np.array([[1, 0.6], [0.6, 0.5]])
In [40]: np.linalg.eigvals(data_covmat)
Out[40]: array([1.4, 0.1])
In [41]: lines = np.random.multivariate_normal(center, data_covmat, npoints)
In [42]: np.cov(lines, rowvar=False)
Out[42]:
array([[0.99782727, 0.60349542],
[0.60349542, 0.50179535]])