R:将样本标签顺序匹配到层次聚类顺序
R: Matching sample label order to hierarchically clustered order
我有一个名为 cleaned_mayo
的数据,看起来像:
Source Tissue RIN Diagnosis Gender AgeAtDeath ApoE FLOWCELL PMI N_unmapped N_multimapping N_noFeature N_ambiguous ENSG00000223972
1924_TCX MayoBrainBank_Dickson TemporalCortex 5.6 Control F 90_or_above 33 AC5R6PACXX 2 2773880 9656114 8225967 2876479 1
1926_TCX MayoBrainBank_Dickson TemporalCortex 7.8 Control F 88 33 AC44HKACXX 2 2279283 12410116 9503353 3600252 2
1935_TCX MayoBrainBank_Dickson TemporalCortex 8.6 Control F 88 33 AC5T2GACXX 3 3120169 8650081 9640468 4603751 0
1925_TCX MayoBrainBank_Dickson TemporalCortex 6.6 Control F 89 33 BC6178ACXX 4 2046886 10627577 7533671 3361385 1
1963_TCX MayoBrainBank_Dickson TemporalCortex 9.7 Control M 90_or_above 33 AC5T1WACXX 4 1810116 9611375 5343437 2983079 2
ENSG00000227232 ENSG00000278267 ENSG00000243485 ENSG00000274890 ENSG00000237613 ENSG00000268020 ENSG00000240361 ENSG00000186092 ENSG00000238009 ENSG00000239945
1924_TCX 80 7 1 0 0 0 0 0 3 0
1926_TCX 113 22 9 0 0 0 0 0 0 0
1935_TCX 181 21 2 0 0 0 0 0 0 0
1925_TCX 75 9 5 0 0 0 0 0 2 0
1963_TCX 73 14 1 0 0 0 0 0 3 0
ENSG00000233750
1924_TCX 18
1926_TCX 2
1935_TCX 8
1925_TCX 20
1963_TCX 13
我使用以下代码对这些数据的表达式列进行分层聚类:
# Create the dendrogram for visualization
dend_expr<- cleaned_mayo[,14:60738] %>% # Isolate expression data
scale %>% # Normalize
dist %>% # Compute distance measure
hclust %>% # Cluster hierarchically
as.dendrogram %>% # Convert to dendrogram type
assign_values_to_leaves_edgePar(value= cleaned_mayo$Diagnosis, edgePar= "col") %>% # Color branches by diagnosis
as.ggdend()
然后我使用以下方法可视化此树状图:
# Plot dendrogram
ggplot(dend_expr, horiz= T, theme= NULL, labels= F) +
ggtitle("Mayo Cohort: Hierarchical Clustering of Patients Colored by Diagnosis")
我的问题是,使用这种 assign_values_to_leaves_edgePar
分支着色技术,我的诊断顺序不再与聚类表达数据相匹配。因此我的分支根据诊断顺序着色,这对于现在排列的样本是不正确的。
如何在聚类后匹配这些数据帧的顺序或以其他方式正确标记分支?
谢谢!
我自己找到了解决此问题的方法,并将post放在这里以防将来对任何人有所帮助。
从创建树状图开始:
# Create the dendrogram for visualization
dend_expr<- cleaned_mayo[,15:60739] %>% # Isolate expression data
scale %>% # Normalize
dist %>% # Compute distance measure
hclust %>% # Cluster hierarchically
as.dendrogram()
然后我可以按照与新的层次聚类数据相同的顺序排列我的原始数据:
# Arrange labels in order with tree
tree_labels<- cleaned_mayo[order.dendrogram(dend_expr),]
然后我可以使用以下顺序为树状图的分支着色:
# Color branches by diagnosis
dend_expr<- assign_values_to_leaves_edgePar(dend_expr, value= tree_labels$Diagnosis, edgePar= "col") %>%
as.ggdend()
然后可视化结果:
# Plot dendrogram
ggplot(dend_expr, horiz= T, theme= NULL, labels= F) +
ggtitle("Mayo Cohort: Hierarchical Clustering of Patients Colored by Diagnosis")
我有一个名为 cleaned_mayo
的数据,看起来像:
Source Tissue RIN Diagnosis Gender AgeAtDeath ApoE FLOWCELL PMI N_unmapped N_multimapping N_noFeature N_ambiguous ENSG00000223972
1924_TCX MayoBrainBank_Dickson TemporalCortex 5.6 Control F 90_or_above 33 AC5R6PACXX 2 2773880 9656114 8225967 2876479 1
1926_TCX MayoBrainBank_Dickson TemporalCortex 7.8 Control F 88 33 AC44HKACXX 2 2279283 12410116 9503353 3600252 2
1935_TCX MayoBrainBank_Dickson TemporalCortex 8.6 Control F 88 33 AC5T2GACXX 3 3120169 8650081 9640468 4603751 0
1925_TCX MayoBrainBank_Dickson TemporalCortex 6.6 Control F 89 33 BC6178ACXX 4 2046886 10627577 7533671 3361385 1
1963_TCX MayoBrainBank_Dickson TemporalCortex 9.7 Control M 90_or_above 33 AC5T1WACXX 4 1810116 9611375 5343437 2983079 2
ENSG00000227232 ENSG00000278267 ENSG00000243485 ENSG00000274890 ENSG00000237613 ENSG00000268020 ENSG00000240361 ENSG00000186092 ENSG00000238009 ENSG00000239945
1924_TCX 80 7 1 0 0 0 0 0 3 0
1926_TCX 113 22 9 0 0 0 0 0 0 0
1935_TCX 181 21 2 0 0 0 0 0 0 0
1925_TCX 75 9 5 0 0 0 0 0 2 0
1963_TCX 73 14 1 0 0 0 0 0 3 0
ENSG00000233750
1924_TCX 18
1926_TCX 2
1935_TCX 8
1925_TCX 20
1963_TCX 13
我使用以下代码对这些数据的表达式列进行分层聚类:
# Create the dendrogram for visualization
dend_expr<- cleaned_mayo[,14:60738] %>% # Isolate expression data
scale %>% # Normalize
dist %>% # Compute distance measure
hclust %>% # Cluster hierarchically
as.dendrogram %>% # Convert to dendrogram type
assign_values_to_leaves_edgePar(value= cleaned_mayo$Diagnosis, edgePar= "col") %>% # Color branches by diagnosis
as.ggdend()
然后我使用以下方法可视化此树状图:
# Plot dendrogram
ggplot(dend_expr, horiz= T, theme= NULL, labels= F) +
ggtitle("Mayo Cohort: Hierarchical Clustering of Patients Colored by Diagnosis")
我的问题是,使用这种 assign_values_to_leaves_edgePar
分支着色技术,我的诊断顺序不再与聚类表达数据相匹配。因此我的分支根据诊断顺序着色,这对于现在排列的样本是不正确的。
如何在聚类后匹配这些数据帧的顺序或以其他方式正确标记分支?
谢谢!
我自己找到了解决此问题的方法,并将post放在这里以防将来对任何人有所帮助。
从创建树状图开始:
# Create the dendrogram for visualization
dend_expr<- cleaned_mayo[,15:60739] %>% # Isolate expression data
scale %>% # Normalize
dist %>% # Compute distance measure
hclust %>% # Cluster hierarchically
as.dendrogram()
然后我可以按照与新的层次聚类数据相同的顺序排列我的原始数据:
# Arrange labels in order with tree
tree_labels<- cleaned_mayo[order.dendrogram(dend_expr),]
然后我可以使用以下顺序为树状图的分支着色:
# Color branches by diagnosis
dend_expr<- assign_values_to_leaves_edgePar(dend_expr, value= tree_labels$Diagnosis, edgePar= "col") %>%
as.ggdend()
然后可视化结果:
# Plot dendrogram
ggplot(dend_expr, horiz= T, theme= NULL, labels= F) +
ggtitle("Mayo Cohort: Hierarchical Clustering of Patients Colored by Diagnosis")