按周聚类多个特征

Question

我有一组每周按商店分类的数据，具有 2 个特征 - 销售额和页数。我想根据这些特征对它们进行聚类，理想情况下是将全年交易模式最相似的数据分组。

是否可以使用我拥有的数据执行此操作。我的理解是特征是列，行是将根据集群分配的标签，但我有几周的时间要考虑，所以我不知道这些应该在行还是列中？

library(data.table)
dt <- data.table(weeks = rep(seq(1:5),5),
             store = c(rep("a", 5), rep("b", 5), rep("c", 5), 
                       rep("e", 5), rep("d", 5)),
             sales = rep(rnorm(5), 5),
             pp = rep(rnorm(5), 5))
dt <- dcast.data.table(dt, store ~ weeks, value.var = c("sales", "pp"))

谢谢

Answer 1

由于您有超过 1000 家商店，下面的演示可能无法直接应用，但希望能为您指明正确的方向。

您可以每周或任何其他周变体（N=4,8,...）分析商店聚类

这里我们以每周频率查看商店集群：

测试数据

library(dplyr)     #data manipulation
library(ggdendro)  #extracting clusters
library(ggplot2)   #plotting
library(gridExtra) #for arranging ggplot graphs in grids 

set.seed(42)

DF <- data.frame(weeks = rep(seq(1:5),5),
    store = c(rep("A", 5), rep("B", 5), rep("C", 5), 
            rep("E", 5), rep("D", 5)),
    sales = rnorm(25),
    pp = rnorm(25))


weekInput = unique(DF$weeks)

选择周数：

在简单的情况下，我包括了离散的周，即 1、2、3、4 和 5。

假设您希望选择 N = 4 作为周的频率，即第 1 到 4、5 到 8 周等，您可以使用以下内容创建全年的 weekInput 分区并根据您的要求进行修改.

#weekFreq = 4
#StartWeek = 1
#EndWeek = 52
#startPoints=seq(StartWeek,EndWeek,weekFreq)    
#endPoints= c(tail(startPoints,-1)-1,EndWeek)
#
#freqDF = data.frame(cbind(startPoints,endPoints))
#weekInput = lapply(1:nrow(freqDF),function(x) { z= freqDF[x,]; z=as.vector(as.matrix(z)) } )
#head(weekInput)
#[[1]]
#[1] 1 4
#
#[[2]]
#[1] 5 8
#
#[[3]]
#[1]  9 12
#
#[[4]]
#[1] 13 16
#
#[[5]]
#[1] 17 20
#
#[[6]]
#[1] 21 24

聚类绘图

绘制各种树状图的好资源是here 和 here

对于每周，我们计算数值数据的距离矩阵并创建层次聚类并使用 ggdendro 包进行绘图并输出绘图对象列表

plotList = lapply(weekInput,function(x) {

subsetWeek=DF %>% 
  group_by(weeks) %>% 
  filter(weeks==x) %>%  #you could change this to `weeks %in% c(x[1],x[2])`
  as.data.frame()  %>%  # x[1] and x[2] are start and end points of weekInput
  select(-weeks) %>% 
  as.data.frame()

#For numeric features of data, compute the distance matrix and form hierarchical cluster

numericDF= subsetWeek[,sapply(subsetWeek,is.numeric)]

clustDF = hclust(dist(numericDF))

#You can choose to limit the clusters to N = n, as per your discretion
#clustDF =  cutree(clustDF, 4)


clustDF$labels = subsetWeek$store

#Use functions from ggdendro package for extracting clusters for ease in plotting

clustDendro = as.dendrogram(clustDF)

dendroData = dendro_data(clustDendro,type="rectangle")

Labels = label(dendroData)
Labels$group <- c(rep("Area1", 2), rep("Area2", 2), rep("Area3", 1))


gPlot = ggplot(segment(dendroData)) +
    geom_segment(aes(x=x,y=y,xend=xend,yend=yend)) + 
    geom_label(data=Labels,aes(label=label,x=x,y=0,label.size=5,colour=Labels$group,fontface="bold")) +
    ggtitle(paste0("Store Clusters for Week:",x)) +
    labs(color="Area Names\n")


gPlot = gPlot + theme(legend.title = element_text(face = "bold"))

return(gPlot)


})

安排地块

上面的绘图对象列表可以根据需要进行排列。有用 link 更多详细信息是 here

grid::grid.newpage()
grid::grid.draw(do.call(rbind,lapply(plotList,function(x) ggplotGrob(x))))

单周：

所有周

由于单栏排列，可读性略有影响

按周聚类多个特征

Clustering with multiple features by week

r

cluster-analysis

hierarchical-clustering

k-means