如何使用比例和数值在 R 中创建热图
How to create heatmap in R with proportion and numeric value
我有一个数据框,其中包含 tech/biotech 几个地区的国家帖子数量以及与其他地区一致的帖子数量。我希望创建一个热图,显示这些字段的交集(以发布数量计)以及这些字段的比例 "duplicates." 也就是说,数据框本身看起来类似于:
df <- data.frame(matrix(nrow=4, byrow=TRUE, data=c(14000, 3300,
2500, 1000, 3300, 3300, 700, 300, 2500, 700, 95000,7500, 1000, 300, 7500, 108000)))
colnames(df) <- rownames(df) <- c("ML & Image", "Software Dev", "Cloud Dev", "Bioinformatics & Health")
因此,例如,第一行将从 ML & Image 职位发布的总数开始,然后是同时满足成为软件开发人员条件的 ML & Image 职位发布的数量,然后是满足成为 Cloud Developers 等条件的 ML & Image 招聘信息
如果您要在 R 控制台中查看 df table 并保持帖子的数值,但按比例着色,我想制作一个看起来有点像数据框的热图不同领域之间的重叠。因此,如果重叠很少,它将被着色为红色(左右),如果重叠大约为 30-60%,则为黄色(左右),如果重叠很多,则为绿色(左右),侧面有一个颜色条供参考。
非常感谢对此的任何帮助。谢谢!
不确定我是否完全理解你的问题,但以下内容可能会给你一些想法。
> library(ggplot2)
> library(reshape2)
# Setup the data
> df <- data.frame(matrix(nrow=4, byrow=TRUE, data=c(14000, 3300, 2500, 1000, 3300, 3300, 700, 300, 2500, 700, 95000,7500, 1000, 300, 7500, 108000)))
> colnames(df) <- rownames(df) <- c("ML & Image", "Software Dev", "Cloud Dev", "Bioinformatics & Health")
> df
ML & Image Software Dev Cloud Dev Bioinformatics & Health
ML & Image 14000 3300 2500 1000
Software Dev 3300 3300 700 300
Cloud Dev 2500 700 95000 7500
Bioinformatics & Health 1000 300 7500 108000
# Convert df to matrix and divide each column by the diagonal value
> m <- data.matrix(df)
> m <- m / matrix(t(colSums(diag(4) * m)), nrow=4, ncol=4, byrow=TRUE)
> m
ML & Image Software Dev Cloud Dev Bioinformatics & Health
ML & Image 1.00000000 1.00000000 0.026315789 0.009259259
Software Dev 0.23571429 1.00000000 0.007368421 0.002777778
Cloud Dev 0.17857143 0.21212121 1.000000000 0.069444444
Bioinformatics & Health 0.07142857 0.09090909 0.078947368 1.000000000
# Prepare data for ggplot2 by melting the matrix data in long data and
# add the posting counts back in to be used as labels
> hm <- melt(m)
> hm$postings <- c(df[,1],df[,2],df[,3],df[,4])
> hm
Var1 Var2 value postings
1 ML & Image ML & Image 1.000000000 14000
2 Software Dev ML & Image 0.235714286 3300
3 Cloud Dev ML & Image 0.178571429 2500
4 Bioinformatics & Health ML & Image 0.071428571 1000
5 ML & Image Software Dev 1.000000000 3300
6 Software Dev Software Dev 1.000000000 3300
7 Cloud Dev Software Dev 0.212121212 700
8 Bioinformatics & Health Software Dev 0.090909091 300
9 ML & Image Cloud Dev 0.026315789 2500
10 Software Dev Cloud Dev 0.007368421 700
11 Cloud Dev Cloud Dev 1.000000000 95000
12 Bioinformatics & Health Cloud Dev 0.078947368 7500
13 ML & Image Bioinformatics & Health 0.009259259 1000
14 Software Dev Bioinformatics & Health 0.002777778 300
15 Cloud Dev Bioinformatics & Health 0.069444444 7500
16 Bioinformatics & Health Bioinformatics & Health 1.000000000 108000
# Plot it
> ggplot(hm, aes(x=Var1, y=Var2)) +
geom_tile(aes(fill=value)) +
scale_fill_gradientn(colours=c("red","yellow","green")) +
geom_text(aes(label=postings))
这导致:
我有一个数据框,其中包含 tech/biotech 几个地区的国家帖子数量以及与其他地区一致的帖子数量。我希望创建一个热图,显示这些字段的交集(以发布数量计)以及这些字段的比例 "duplicates." 也就是说,数据框本身看起来类似于:
df <- data.frame(matrix(nrow=4, byrow=TRUE, data=c(14000, 3300,
2500, 1000, 3300, 3300, 700, 300, 2500, 700, 95000,7500, 1000, 300, 7500, 108000)))
colnames(df) <- rownames(df) <- c("ML & Image", "Software Dev", "Cloud Dev", "Bioinformatics & Health")
因此,例如,第一行将从 ML & Image 职位发布的总数开始,然后是同时满足成为软件开发人员条件的 ML & Image 职位发布的数量,然后是满足成为 Cloud Developers 等条件的 ML & Image 招聘信息
如果您要在 R 控制台中查看 df table 并保持帖子的数值,但按比例着色,我想制作一个看起来有点像数据框的热图不同领域之间的重叠。因此,如果重叠很少,它将被着色为红色(左右),如果重叠大约为 30-60%,则为黄色(左右),如果重叠很多,则为绿色(左右),侧面有一个颜色条供参考。
非常感谢对此的任何帮助。谢谢!
不确定我是否完全理解你的问题,但以下内容可能会给你一些想法。
> library(ggplot2)
> library(reshape2)
# Setup the data
> df <- data.frame(matrix(nrow=4, byrow=TRUE, data=c(14000, 3300, 2500, 1000, 3300, 3300, 700, 300, 2500, 700, 95000,7500, 1000, 300, 7500, 108000)))
> colnames(df) <- rownames(df) <- c("ML & Image", "Software Dev", "Cloud Dev", "Bioinformatics & Health")
> df
ML & Image Software Dev Cloud Dev Bioinformatics & Health
ML & Image 14000 3300 2500 1000
Software Dev 3300 3300 700 300
Cloud Dev 2500 700 95000 7500
Bioinformatics & Health 1000 300 7500 108000
# Convert df to matrix and divide each column by the diagonal value
> m <- data.matrix(df)
> m <- m / matrix(t(colSums(diag(4) * m)), nrow=4, ncol=4, byrow=TRUE)
> m
ML & Image Software Dev Cloud Dev Bioinformatics & Health
ML & Image 1.00000000 1.00000000 0.026315789 0.009259259
Software Dev 0.23571429 1.00000000 0.007368421 0.002777778
Cloud Dev 0.17857143 0.21212121 1.000000000 0.069444444
Bioinformatics & Health 0.07142857 0.09090909 0.078947368 1.000000000
# Prepare data for ggplot2 by melting the matrix data in long data and
# add the posting counts back in to be used as labels
> hm <- melt(m)
> hm$postings <- c(df[,1],df[,2],df[,3],df[,4])
> hm
Var1 Var2 value postings
1 ML & Image ML & Image 1.000000000 14000
2 Software Dev ML & Image 0.235714286 3300
3 Cloud Dev ML & Image 0.178571429 2500
4 Bioinformatics & Health ML & Image 0.071428571 1000
5 ML & Image Software Dev 1.000000000 3300
6 Software Dev Software Dev 1.000000000 3300
7 Cloud Dev Software Dev 0.212121212 700
8 Bioinformatics & Health Software Dev 0.090909091 300
9 ML & Image Cloud Dev 0.026315789 2500
10 Software Dev Cloud Dev 0.007368421 700
11 Cloud Dev Cloud Dev 1.000000000 95000
12 Bioinformatics & Health Cloud Dev 0.078947368 7500
13 ML & Image Bioinformatics & Health 0.009259259 1000
14 Software Dev Bioinformatics & Health 0.002777778 300
15 Cloud Dev Bioinformatics & Health 0.069444444 7500
16 Bioinformatics & Health Bioinformatics & Health 1.000000000 108000
# Plot it
> ggplot(hm, aes(x=Var1, y=Var2)) +
geom_tile(aes(fill=value)) +
scale_fill_gradientn(colours=c("red","yellow","green")) +
geom_text(aes(label=postings))
这导致: