在 R 中使用 daisy 和 pam 进行聚类
Clustering using daisy and pam in R
我正在尝试执行非常简单的聚类分析,但无法获得正确的结果。我对大型数据集的问题是 "Which diseases are frequently reported together?"。下面的简化数据示例应导致 2 个集群:1) 头痛/头晕 2) 恶心/腹部疼痛。但是,我无法正确获取代码。我正在使用 pam
和 daisy
函数。对于这个例子,我手动分配了 2 个簇 (k=2),因为我知道想要的结果,但实际上我探索了 k 的几个值。
有人知道我做错了什么吗?
library(cluster)
library(dplyr)
dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"))
gower_dist <- daisy(dat, metric = "gower")
k <- 2
pam_fit <- pam(gower_dist, diss = TRUE, k) # performs cluster analysis
pam_results <- dat %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
head(pam_results$the_summary)
您将数据集提供给聚类算法的格式对于您的 objective 来说并不精确。事实上,如果您想将一起报告的疾病分组,但您还在相异矩阵中包含 ID,它们将参与矩阵构造,您不希望这样,因为您的 objective 只考虑疾病.
因此,我们需要建立一个数据集,其中每一行都是报告的所有疾病he/she的患者,然后仅在数字特征上构建差异矩阵。对于此任务,如果患者报告疾病,我将添加一个值为 1 的列 presence
,否则为 0;函数 pivot_wider
(link).
将自动填充零
这是我使用的代码,我想我达到了你想要的,如果是这样请告诉我。
library(cluster)
library(dplyr)
library(tidyr)
dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"),
presence = 1)
# build the wider dataset: each row is a patient
dat_wider <- pivot_wider(
dat,
id_cols = ID,
names_from = PTName,
values_from = presence,
values_fill = list(presence = 0)
)
# in the dissimalirity matrix construction, we leave out the column ID
gower_dist <- daisy(dat_wider %>% select(-ID), metric = "gower")
k <- 2
set.seed(123)
pam_fit <- pam(gower_dist, diss = TRUE, k)
pam_results <- dat_wider %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
head(pam_results$the_summary)
此外,由于您只使用二进制数据,而不是 Gower 距离,如果它们更适合您的数据,您可以考虑使用 Simple Matching or Jaccard 距离。在 R 中,您可以使用
使用它们
sm_dist <- dist(dat_wider %>% select(-ID), method = "manhattan")/p
j_dist <- dist(dat_wider %>% select(-ID), method = "binary")
分别,其中p
是你要考虑的二进制变量的数量。
我正在尝试执行非常简单的聚类分析,但无法获得正确的结果。我对大型数据集的问题是 "Which diseases are frequently reported together?"。下面的简化数据示例应导致 2 个集群:1) 头痛/头晕 2) 恶心/腹部疼痛。但是,我无法正确获取代码。我正在使用 pam
和 daisy
函数。对于这个例子,我手动分配了 2 个簇 (k=2),因为我知道想要的结果,但实际上我探索了 k 的几个值。
有人知道我做错了什么吗?
library(cluster)
library(dplyr)
dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"))
gower_dist <- daisy(dat, metric = "gower")
k <- 2
pam_fit <- pam(gower_dist, diss = TRUE, k) # performs cluster analysis
pam_results <- dat %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
head(pam_results$the_summary)
您将数据集提供给聚类算法的格式对于您的 objective 来说并不精确。事实上,如果您想将一起报告的疾病分组,但您还在相异矩阵中包含 ID,它们将参与矩阵构造,您不希望这样,因为您的 objective 只考虑疾病.
因此,我们需要建立一个数据集,其中每一行都是报告的所有疾病he/she的患者,然后仅在数字特征上构建差异矩阵。对于此任务,如果患者报告疾病,我将添加一个值为 1 的列 presence
,否则为 0;函数 pivot_wider
(link).
这是我使用的代码,我想我达到了你想要的,如果是这样请告诉我。
library(cluster)
library(dplyr)
library(tidyr)
dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"),
presence = 1)
# build the wider dataset: each row is a patient
dat_wider <- pivot_wider(
dat,
id_cols = ID,
names_from = PTName,
values_from = presence,
values_fill = list(presence = 0)
)
# in the dissimalirity matrix construction, we leave out the column ID
gower_dist <- daisy(dat_wider %>% select(-ID), metric = "gower")
k <- 2
set.seed(123)
pam_fit <- pam(gower_dist, diss = TRUE, k)
pam_results <- dat_wider %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
head(pam_results$the_summary)
此外,由于您只使用二进制数据,而不是 Gower 距离,如果它们更适合您的数据,您可以考虑使用 Simple Matching or Jaccard 距离。在 R 中,您可以使用
使用它们sm_dist <- dist(dat_wider %>% select(-ID), method = "manhattan")/p
j_dist <- dist(dat_wider %>% select(-ID), method = "binary")
分别,其中p
是你要考虑的二进制变量的数量。