R(或Python)创建渐进分割的矩形树图
R (or Python) create a rectangle tree map with progressive segmentation
我想要一些关于如何解决这个有趣问题的想法(至少对我而言)。假设我有一个具有 3 个不同特征变量和一些定量评级的人群。示例如下:
df
income expense education gender residence
1 153 2989 NoCollege F Own
2 289 872 College F Rent
3 551 98 NoCollege M Rent
4 286 320 College M Rent
5 259 372 NoCollege M Rent
6 631 221 NoCollege M Own
7 729 105 College M Rent
8 582 450 NoCollege M Own
9 570 253 College F Rent
10 1380 635 NoCollege F Rent
11 409 425 NoCollege M Rent
12 569 232 NoCollege F Own
13 317 856 College M Rent
14 199 283 College F Own
15 624 564 NoCollege M Own
16 1064 504 NoCollege M Own
17 821 169 NoCollege F Rent
18 402 175 College M Own
19 602 285 College M Rent
20 433 264 College M Rent
21 670 985 NoCollege F Own
我可以计算由 3 个特征变量定义的细分市场的支出收入比 (SIR):教育、性别和居住地。所以在第一层,没有做分割,SIR为:
df %>% summarise(count=n(), spending_ratio=sum(expense)/sum(income)*100)
>> count spending_ratio
1 21 95.8
然后我将人口分成男性和女性组,得到:
df %>% group_by(gender) %>% summarise(count=n(), spending_ratio=sum(expense)/sum(income)*100)
>> gender count spending_ratio
1 F 8 138.0
2 M 13 67.3
我们通过引入教育来继续这个过程:
df %>% group_by(gender, education) %>% summarise(count=n(), spending_ratio=sum(expense)/sum(income)*100)
>> gender education count spending_ratio
1 F College 3 133.1
2 F NoCollege 5 139.4
3 M College 6 72.4
4 M NoCollege 7 63.9
最后添加 residence
:
df %>% group_by(gender, education, residence) %>% summarise(count=n(), spending_ratio=sum(expense)/sum(income)*100)
>> gender education residence count spending_ratio
1 F College Own 1 142.2
2 F College Rent 2 131.0
3 F NoCollege Own 3 302.2
4 F NoCollege Rent 2 36.5
5 M College Own 1 43.5
6 M College Rent 5 77.3
7 M NoCollege Own 4 59.9
8 M NoCollege Rent 3 73.4
我想要实现的是生成一个包含上述所有信息的 treemap-like plot。但正如您所见,树状图与我想要的相去甚远。我想要得到的是一个类似于顶部图像的地图,其中每个矩形的大小代表计数,颜色代表 SIR,并且包括树的所有级别。
非常感谢任何帮助。
您可以使用 treemap
包在不同级别进行聚合,但输出需要格式化很多。当 treemap
进行连续聚合时,它会删除 data.table 中的所有附加变量。因此,由于您的聚合函数需要额外的变量,我创建了一些虚拟变量。变量 'index' 用于从每个子集索引到 'expense' 和 'income'。这是你可以做到的,
library(treemap)
library(data.table)
## Some dummy variables to aggregate by: ALL, i, and index
dat <- as.data.table(df)[, `:=`(total = factor("ALL"), i = 1, index = 1:.N)][]
indexList <- c('total', 'gender', 'education', 'residence') # order or aggregation
## Function to aggregate at each grouping level (SIR)
agg <- function(index, ...) {
dots <- list(...)
expense <- dots[["expense"]][index]
income <- dots[["income"]][index]
sum(expense) / sum(income) * 100
}
## Get treemap data
res <- treemap(dat, index=indexList, vSize='i', vColor='index',
type="value", fun.aggregate = "agg",
palette = 'RdYlBu',
income=dat[["income"]],
expense=dat[["expense"]]) # ... args get passed to fun.aggregate
## The useful variables: level (corresponds to indexList), vSize (bar size), vColor(SIR)
## Create a label variable that is the value of the variable in indexList at each level
out <- res$tm
out$label <- out[cbind(1:nrow(out), out$level)]
out$label <- with(out, ifelse(level==4, substring(label, 1, 1), label)) # shorten labels
out$level <- factor(out$level, levels=sort(unique(out$level), TRUE)) # factor levels
## Time to find label positions, scale to [0, 1] first
## x-value is cumsum by group, y will just be the level
out$xlab <- out$vSize / max(aggregate(vSize ~ level, data=out, sum)$vSize)
split(out$xlab, out$level) <- lapply(split(out$xlab, out$level), function(x) cumsum(x) - x/2)
## Make plot
library(ggplot2)
ggplot(out, aes(x=level, y=vSize, fill=color, group=interaction(level, label))) +
geom_bar(stat='identity', position='fill') + # add another for black rectangles but not legend
geom_bar(stat='identity', position='fill', color="black", show_guide=FALSE) +
geom_text(data=out, aes(x=level, y=xlab, label=label, ymin=0, ymax=1), size=6, font=2,
inherit.aes=FALSE) +
coord_flip() +
scale_fill_discrete('SIR', breaks=out$color, labels = round(out$vColor)) +
theme_minimal() + # Then just some formatting
xlab("") + ylab("") +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
编辑
我认为它实际上对 SIR 的渐变颜色效果更好。为此,您只需将 fill=color
替换为 fill=vColor
并按一定的渐变比例填充。
## Make plot with gradient color for SIR
library(ggplot2)
ggplot(out, aes(x=level, y=vSize, fill=vColor, group=interaction(level, label))) +
geom_bar(stat='identity', position='fill') + # add another for black rectangles but not legend
geom_bar(stat='identity', position='fill', color="black", show_guide=FALSE) +
geom_text(data=out, aes(x=level, y=xlab, label=label, ymin=0, ymax=1), size=6, font=2,
inherit.aes=FALSE) +
coord_flip() +
scale_fill_gradientn(colours = c("white", "red")) +
theme_minimal() + # Then just some formatting
xlab("") + ylab("") +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
我想要一些关于如何解决这个有趣问题的想法(至少对我而言)。假设我有一个具有 3 个不同特征变量和一些定量评级的人群。示例如下:
df
income expense education gender residence
1 153 2989 NoCollege F Own
2 289 872 College F Rent
3 551 98 NoCollege M Rent
4 286 320 College M Rent
5 259 372 NoCollege M Rent
6 631 221 NoCollege M Own
7 729 105 College M Rent
8 582 450 NoCollege M Own
9 570 253 College F Rent
10 1380 635 NoCollege F Rent
11 409 425 NoCollege M Rent
12 569 232 NoCollege F Own
13 317 856 College M Rent
14 199 283 College F Own
15 624 564 NoCollege M Own
16 1064 504 NoCollege M Own
17 821 169 NoCollege F Rent
18 402 175 College M Own
19 602 285 College M Rent
20 433 264 College M Rent
21 670 985 NoCollege F Own
我可以计算由 3 个特征变量定义的细分市场的支出收入比 (SIR):教育、性别和居住地。所以在第一层,没有做分割,SIR为:
df %>% summarise(count=n(), spending_ratio=sum(expense)/sum(income)*100)
>> count spending_ratio
1 21 95.8
然后我将人口分成男性和女性组,得到:
df %>% group_by(gender) %>% summarise(count=n(), spending_ratio=sum(expense)/sum(income)*100)
>> gender count spending_ratio
1 F 8 138.0
2 M 13 67.3
我们通过引入教育来继续这个过程:
df %>% group_by(gender, education) %>% summarise(count=n(), spending_ratio=sum(expense)/sum(income)*100)
>> gender education count spending_ratio
1 F College 3 133.1
2 F NoCollege 5 139.4
3 M College 6 72.4
4 M NoCollege 7 63.9
最后添加 residence
:
df %>% group_by(gender, education, residence) %>% summarise(count=n(), spending_ratio=sum(expense)/sum(income)*100)
>> gender education residence count spending_ratio
1 F College Own 1 142.2
2 F College Rent 2 131.0
3 F NoCollege Own 3 302.2
4 F NoCollege Rent 2 36.5
5 M College Own 1 43.5
6 M College Rent 5 77.3
7 M NoCollege Own 4 59.9
8 M NoCollege Rent 3 73.4
我想要实现的是生成一个包含上述所有信息的 treemap-like plot。但正如您所见,树状图与我想要的相去甚远。我想要得到的是一个类似于顶部图像的地图,其中每个矩形的大小代表计数,颜色代表 SIR,并且包括树的所有级别。
非常感谢任何帮助。
您可以使用 treemap
包在不同级别进行聚合,但输出需要格式化很多。当 treemap
进行连续聚合时,它会删除 data.table 中的所有附加变量。因此,由于您的聚合函数需要额外的变量,我创建了一些虚拟变量。变量 'index' 用于从每个子集索引到 'expense' 和 'income'。这是你可以做到的,
library(treemap)
library(data.table)
## Some dummy variables to aggregate by: ALL, i, and index
dat <- as.data.table(df)[, `:=`(total = factor("ALL"), i = 1, index = 1:.N)][]
indexList <- c('total', 'gender', 'education', 'residence') # order or aggregation
## Function to aggregate at each grouping level (SIR)
agg <- function(index, ...) {
dots <- list(...)
expense <- dots[["expense"]][index]
income <- dots[["income"]][index]
sum(expense) / sum(income) * 100
}
## Get treemap data
res <- treemap(dat, index=indexList, vSize='i', vColor='index',
type="value", fun.aggregate = "agg",
palette = 'RdYlBu',
income=dat[["income"]],
expense=dat[["expense"]]) # ... args get passed to fun.aggregate
## The useful variables: level (corresponds to indexList), vSize (bar size), vColor(SIR)
## Create a label variable that is the value of the variable in indexList at each level
out <- res$tm
out$label <- out[cbind(1:nrow(out), out$level)]
out$label <- with(out, ifelse(level==4, substring(label, 1, 1), label)) # shorten labels
out$level <- factor(out$level, levels=sort(unique(out$level), TRUE)) # factor levels
## Time to find label positions, scale to [0, 1] first
## x-value is cumsum by group, y will just be the level
out$xlab <- out$vSize / max(aggregate(vSize ~ level, data=out, sum)$vSize)
split(out$xlab, out$level) <- lapply(split(out$xlab, out$level), function(x) cumsum(x) - x/2)
## Make plot
library(ggplot2)
ggplot(out, aes(x=level, y=vSize, fill=color, group=interaction(level, label))) +
geom_bar(stat='identity', position='fill') + # add another for black rectangles but not legend
geom_bar(stat='identity', position='fill', color="black", show_guide=FALSE) +
geom_text(data=out, aes(x=level, y=xlab, label=label, ymin=0, ymax=1), size=6, font=2,
inherit.aes=FALSE) +
coord_flip() +
scale_fill_discrete('SIR', breaks=out$color, labels = round(out$vColor)) +
theme_minimal() + # Then just some formatting
xlab("") + ylab("") +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
编辑
我认为它实际上对 SIR 的渐变颜色效果更好。为此,您只需将 fill=color
替换为 fill=vColor
并按一定的渐变比例填充。
## Make plot with gradient color for SIR
library(ggplot2)
ggplot(out, aes(x=level, y=vSize, fill=vColor, group=interaction(level, label))) +
geom_bar(stat='identity', position='fill') + # add another for black rectangles but not legend
geom_bar(stat='identity', position='fill', color="black", show_guide=FALSE) +
geom_text(data=out, aes(x=level, y=xlab, label=label, ymin=0, ymax=1), size=6, font=2,
inherit.aes=FALSE) +
coord_flip() +
scale_fill_gradientn(colours = c("white", "red")) +
theme_minimal() + # Then just some formatting
xlab("") + ylab("") +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())