总结 R 数据框中的因子分布
Summarize distribution of factors in R data frame
假设我有一个 data.frame 这样的:
X1 X2 X3
1 A B A
2 A C B
3 B A B
4 A A C
我想统计每一列中A、B、C等出现的次数,return结果为
A_count B_count C_count
X1 3 1 0
X2 2 1 1
X3 1 2 1
我确定这个问题有上千个重复问题,但我似乎找不到适合我的答案:(
来自 运行
apply(mydata, 2, table)
我得到了类似的东西
$X1
B A
1 3
$X2
A C B
2 1 1
但这并不是我想要的,如果我尝试将它重新构建到数据框中,它就不起作用,因为我没有为每一行获得相同数量的列(如上面的 $X1,其中没有 C)。
我错过了什么?
非常感谢!
您可以重构以包含每列共有的因子水平,然后制表。我还建议使用 lapply()
而不是 apply()
,因为 apply()
用于矩阵。
df <- read.table(text = "X1 X2 X3
1 A B A
2 A C B
3 B A B
4 A A C", h=T)
do.call(
rbind,
lapply(df, function(x) table(factor(x, levels=levels(unlist(df)))))
)
# A B C
# X1 3 1 0
# X2 2 1 1
# X3 1 2 1
假设你的数据框是x
,我会简单地做:
do.call(rbind, tapply(unlist(x, use.names = FALSE),
rep(1:ncol(x), each = nrow(x)),
table))
# A B C
#1 3 1 0
#2 2 1 1
#3 1 2 1
基准测试
# a function to generate toy data
# `k` factor levels
# `n` row
# `p` columns
datsim <- function(n, p, k) {
as.data.frame(replicate(p, sample(LETTERS[1:k], n, TRUE), simplify = FALSE),
col.names = paste0("X",1:p), stringsAsFactors = TRUE)
}
# try `n = 100`, `p = 500` and `k = 3`
x <- datsim(100, 500, 3)
## DirtySockSniffer's answer
system.time(do.call(rbind, lapply(x, function(u) table(factor(u, levels=levels(unlist(x)))))))
# user system elapsed
# 21.240 0.068 21.365
## my answer
system.time(do.call(rbind, tapply(unlist(x, use.names = FALSE), rep(1:ncol(x), each = nrow(x)), table)))
# user system elapsed
# 0.108 0.000 0.111
Dirty 的回答可以通过以下方式改进:
## improved DirtySockSniffer's answer
system.time({clevels <- levels(unlist(x, use.names = FALSE));
do.call(rbind, lapply(x, function(u) table(factor(u, levels=clevels))))})
# user system elapsed
# 0.108 0.000 0.108
也考虑user20650的回答:
## Let's try a large `n`, `p`, `k`
x <- datsim(200, 5000, 5)
system.time(t(table(stack(lapply(x, as.character)))))
# user system elapsed
# 0.592 0.052 0.646
虽然我的回答是:
system.time(do.call(rbind, tapply(unlist(x, use.names = FALSE), rep(1:ncol(x), each = nrow(x)), table)))
# user system elapsed
# 1.844 0.056 1.904
改进后的 Dirty 答案:
system.time({clevels <- levels(unlist(x, use.names = FALSE));
do.call(rbind, lapply(x, function(u) table(factor(u, levels=clevels))))})
# user system elapsed
# 1.240 0.012 1.263
假设我有一个 data.frame 这样的:
X1 X2 X3
1 A B A
2 A C B
3 B A B
4 A A C
我想统计每一列中A、B、C等出现的次数,return结果为
A_count B_count C_count
X1 3 1 0
X2 2 1 1
X3 1 2 1
我确定这个问题有上千个重复问题,但我似乎找不到适合我的答案:(
来自 运行
apply(mydata, 2, table)
我得到了类似的东西
$X1
B A
1 3
$X2
A C B
2 1 1
但这并不是我想要的,如果我尝试将它重新构建到数据框中,它就不起作用,因为我没有为每一行获得相同数量的列(如上面的 $X1,其中没有 C)。
我错过了什么?
非常感谢!
您可以重构以包含每列共有的因子水平,然后制表。我还建议使用 lapply()
而不是 apply()
,因为 apply()
用于矩阵。
df <- read.table(text = "X1 X2 X3
1 A B A
2 A C B
3 B A B
4 A A C", h=T)
do.call(
rbind,
lapply(df, function(x) table(factor(x, levels=levels(unlist(df)))))
)
# A B C
# X1 3 1 0
# X2 2 1 1
# X3 1 2 1
假设你的数据框是x
,我会简单地做:
do.call(rbind, tapply(unlist(x, use.names = FALSE),
rep(1:ncol(x), each = nrow(x)),
table))
# A B C
#1 3 1 0
#2 2 1 1
#3 1 2 1
基准测试
# a function to generate toy data
# `k` factor levels
# `n` row
# `p` columns
datsim <- function(n, p, k) {
as.data.frame(replicate(p, sample(LETTERS[1:k], n, TRUE), simplify = FALSE),
col.names = paste0("X",1:p), stringsAsFactors = TRUE)
}
# try `n = 100`, `p = 500` and `k = 3`
x <- datsim(100, 500, 3)
## DirtySockSniffer's answer
system.time(do.call(rbind, lapply(x, function(u) table(factor(u, levels=levels(unlist(x)))))))
# user system elapsed
# 21.240 0.068 21.365
## my answer
system.time(do.call(rbind, tapply(unlist(x, use.names = FALSE), rep(1:ncol(x), each = nrow(x)), table)))
# user system elapsed
# 0.108 0.000 0.111
Dirty 的回答可以通过以下方式改进:
## improved DirtySockSniffer's answer
system.time({clevels <- levels(unlist(x, use.names = FALSE));
do.call(rbind, lapply(x, function(u) table(factor(u, levels=clevels))))})
# user system elapsed
# 0.108 0.000 0.108
也考虑user20650的回答:
## Let's try a large `n`, `p`, `k`
x <- datsim(200, 5000, 5)
system.time(t(table(stack(lapply(x, as.character)))))
# user system elapsed
# 0.592 0.052 0.646
虽然我的回答是:
system.time(do.call(rbind, tapply(unlist(x, use.names = FALSE), rep(1:ncol(x), each = nrow(x)), table)))
# user system elapsed
# 1.844 0.056 1.904
改进后的 Dirty 答案:
system.time({clevels <- levels(unlist(x, use.names = FALSE));
do.call(rbind, lapply(x, function(u) table(factor(u, levels=clevels))))})
# user system elapsed
# 1.240 0.012 1.263