制作数据集以在 Sklearn 中测试 PCA？

Question

我想测试我的 PCA 工作流程，为此我想创建一个数据集，其中包含 3 个特征，这些特征之间存在一定的关系。然后应用 PCA 并检查是否捕获了这些关系，在 Python 中最直接的方法是什么？

谢谢！

Answer 1

您可以创建两个特征相互独立且第三个特征是其他两个特征的线性组合的样本。

例如：

import numpy as np
from numpy.random import random

N_SAMPLES = 1000

samples = random((N_SAMPLES, 3))

# Let us suppose that the column `1` will have the dependent feature, the other two being independent

samples[:, 1] = 3 * samples[:, 0] - 2 * samples[:, 2]

现在，如果您运行 PCA 在该样本上找到两个主成分，"explained variance" 应该等于 1。

例如：

from sklearn.decomposition import PCA

pca2 = PCA(2)
pca2.fit(samples)

assert sum(pca2.explained_variance_ratio_) == 1.0 # this should be true

制作数据集以在 Sklearn 中测试 PCA？

Fabricate a datset to test PCA in Sklearn?

python

numpy

pca

dataframe