PCA 的加载因子如何用于计算可应用于 R 数据框中每个个体的索引？

Question

我正在使用基于约 30 个变量的主成分分析 (PCA) 来构建一个索引，该索引将个人分为 R 中的 3 个不同类别（顶部、中部、底部）。

我有一个包含约 2000 个人的数据框，其中包含 28 个二元变量和 2 个连续变量。

现在，我想使用 PC1 的加载因子来构造一个将我的 2000 个人根据这 30 个变量分为 3 个不同组的索引。

问题：尽管进行了广泛的研究，但我无法找到如何从 PCA_loadings 中提取负载因子，给每个人打分（基于 30 个变量的负载），这随后可以让我对每个人进行排名（用于进一步分类）。在图表中显示加载因子是否有意义？

我执行了以下步骤：

a) 运行 PCA 使用 PCA_outcome <- prcomp(na.omit(df1), scale = T)

b) 使用 PCA_loadings <- PCA_outcome$rotation

提取载荷

c) 删除了所有加载因子接近 0 的变量。

我考虑过创建 30 个新变量，每个加载因子一个，我将对每个二进制变量 == 1 求和（尽管我不确定如何处理连续变量）。因此，我会给每个人打分。但是，我不知道如何 assemble 加载因子中的 30 个值到每个人的分数。

R码

df1 <- read.table(text=" 
          educ     call      house  merge_id    school  members       
A           1        0          1      12_3        0      0.9
B           0        0          0      13_3        1      0.8
C           1        1          1      14_3        0      1.1
D           0        0          0      15_3        1      0.8 
E           1        1          1      16_3        3      3.2", header=T)


## Run PCA
PCA_outcome <- prcomp(na.omit(df1), scale = T)

## Extract loadings
PCA_loadings <- PCA_outcome$rotation


## Explanation: A-E are 5 of the 2000 individuals and the variables (education, call, house, school, members) represent my 30 variables (binary and continuous).

预期结果： - 获得每个人的排名分数 - 随后，为每个人分配一个类别 1-3。

Answer 1

我不是 100% 确定你在问什么，但这是我认为你在问的问题的答案。

首先，PCA 的 PC1 不一定会为您提供 socio-economic 状态的索引。正如所解释的 here, PC1 simply "accounts for as much of the variability in the data as possible". PC1 may well work as a good metric for socio-economic status for your data set, but you'll have to critically examine the loadings and see if this makes sense. Depending on the signs of the loadings, it could be that a very negative PC1 corresponds to a very positive socio-economic status. As I say: look at the results with a critical eye. An explanation of how PC scores are calculated can be found here. Anyway, that's a discussion that belongs on Cross Validated，让我们来看看代码。

听起来您想执行 PCA，拉出 PC1，并将其与您的原始数据框（和 merge_ids）相关联。如果这是你的目标，这里有一个解决方案。

# Create data frame
df <- read.table(text = "educ     call      house  merge_id    school  members       
A           1        0          1      12_3        0      0.9
B           0        0          0      13_3        1      0.8
C           1        1          1      14_3        0      1.1
D           0        0          0      15_3        1      0.8 
E           1        1          1      16_3        3      3.2", header = TRUE)

# Perform PCA
PCA <- prcomp(df[, names(df) != "merge_id"], scale = TRUE, center = TRUE)

# Add PC1
df$PC1 <- PCA$x[, 1]

# Look at new data frame
print(df)
#>   educ call house merge_id school members        PC1
#> A    1    0     1     12_3      0     0.9  0.1000145
#> B    0    0     0     13_3      1     0.8  1.6610864
#> C    1    1     1     14_3      0     1.1 -0.8882381
#> D    0    0     0     15_3      1     0.8  1.6610864
#> E    1    1     1     16_3      3     3.2 -2.5339491

^{由 reprex package (v0.2.1.9000)}

创建于 2019-05-30

正如您所说使用 PCA，我假设这是一个家庭作业问题，所以我建议您阅读 PCA 以便您了解它的作用和用途。

PCA 的加载因子如何用于计算可应用于 R 数据框中每个个体的索引？

How can loading factors from PCA be used to calculate an index that can be applied for each individual in a data frame in R?

r

pca

R码