在 R 中使用 daisy 和 pam 进行聚类

Clustering using daisy and pam in R

我正在尝试执行非常简单的聚类分析,但无法获得正确的结果。我对大型数据集的问题是 "Which diseases are frequently reported together?"。下面的简化数据示例应导致 2 个集群:1) 头痛/头晕 2) 恶心/腹部疼痛。但是,我无法正确获取代码。我正在使用 pamdaisy 函数。对于这个例子,我手动分配了 2 个簇 (k=2),因为我知道想要的结果,但实际上我探索了 k 的几个值。

有人知道我做错了什么吗?

library(cluster)
library(dplyr)

dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
                  PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"))


gower_dist <- daisy(dat, metric = "gower")
k <- 2
pam_fit <- pam(gower_dist, diss = TRUE, k)  # performs cluster analysis
pam_results <- dat %>%
  mutate(cluster = pam_fit$clustering) %>%
  group_by(cluster) %>%
  do(the_summary = summary(.))
head(pam_results$the_summary)

您将数据集提供给聚类算法的格式对于您的 objective 来说并不精确。事实上,如果您想将一起报告的疾病分组,但您还在相异矩阵中包含 ID,它们将参与矩阵构造,您不希望这样,因为您的 objective 只考虑疾病.

因此,我们需要建立一个数据集,其中每一行都是报告的所有疾病he/she的患者,然后仅在数字特征上构建差异矩阵。对于此任务,如果患者报告疾病,我将添加一个值为 1 的列 presence,否则为 0;函数 pivot_wider (link).

将自动填充零

这是我使用的代码,我想我达到了你想要的,如果是这样请告诉我。

library(cluster)
library(dplyr)
library(tidyr)

dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
                  PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"),
                  presence = 1)
# build the wider dataset: each row is a patient
dat_wider <- pivot_wider(
    dat,
    id_cols = ID,
    names_from = PTName,
    values_from = presence,
    values_fill = list(presence = 0)
)

# in the dissimalirity matrix construction, we leave out the column ID
gower_dist <- daisy(dat_wider %>% select(-ID), metric = "gower")
k <- 2

set.seed(123)
pam_fit <- pam(gower_dist, diss = TRUE, k) 
pam_results <- dat_wider %>%
    mutate(cluster = pam_fit$clustering) %>%
    group_by(cluster) %>%
    do(the_summary = summary(.))
head(pam_results$the_summary)

此外,由于您只使用二进制数据,而不是 Gower 距离,如果它们更适合您的数据,您可以考虑使用 Simple Matching or Jaccard 距离。在 R 中,您可以使用

使用它们
sm_dist <- dist(dat_wider %>% select(-ID), method = "manhattan")/p
j_dist <- dist(dat_wider %>% select(-ID), method = "binary")

分别,其中p是你要考虑的二进制变量的数量。