如何使用 R data.table 按组计算分类变量的 frequency/table?
How do I compute the frequency/table of categorical variables by group with R data.table?
我有以下 data.table 和 R
library(data.table)
dt = data.table(ID = c("person1", "person1", "person1", "person2", "person2", "person2", "person2", "person2", ...), category = c("red", "red", "blue", "red", "red", "blue", "green", "green", ...))
dt
ID category
person1 red
person1 red
person1 blue
person2 red
person2 red
person2 blue
person2 green
person2 green
person3 blue
....
我正在寻找如何为每个唯一 ID 创建 "frequency" 的分类变量 red
、blue
、green
,然后展开这些列以进行记录每个的计数。结果 data.table 看起来像这样:
dt
ID red blue green
person1 2 1 0
person2 2 1 2
...
我错误地认为从 data.table
开始的正确方法是按组计算 table()
,例如
dt[, counts :=table(category), by=ID]
但这似乎是按组 ID 计算分类值的总数。这也没有解决我的 "expanding" 和 data.table 的问题。
正确的做法是什么?
像这样?
library(data.table)
library(dplyr)
dt[, .N, by = .(ID, category)] %>% dcast(ID ~ category)
如果要将这些列添加到原来的data.table
counts <- dt[, .N, by = .(ID, category)] %>% dcast(ID ~ category)
counts[is.na(counts)] <- 0
output <- merge(dt, counts, by = "ID")
这是以命令式的方式完成的,可能有更简洁、实用的方式来完成。
library(data.table)
library(dtplyr)
dt = data.table(ID = c("person1", "person1", "person1", "person2", "person2", "person2", "person2", "person2"),
category = c("red", "red", "blue", "red", "red", "blue", "green", "green"))
ids <- unique(dt$ID)
categories <- unique(dt$category)
counts <- matrix(nrow=length(ids), ncol=length(categories))
rownames(counts) <- ids
colnames(counts) <- categories
for (i in seq_along(ids)) {
for (j in seq_along(categories)) {
count <- dt %>%
filter(ID == ids[i], category == categories[j]) %>%
nrow()
counts[i, j] <- count
}
}
然后:
>counts
## red blue green
##person1 2 1 0
##person2 2 1 2
一行即可使用reshape库
library(reshape2)
dcast(data=dt,
ID ~ category,
fun.aggregate = length,
value.var = "category")
ID blue green red
1 person1 1 0 2
2 person2 1 2 2
此外,如果你只需要一个简单的2-way table,你可以使用内置的R table
函数。
table(dt$ID,dt$category)
我有以下 data.table 和 R
library(data.table)
dt = data.table(ID = c("person1", "person1", "person1", "person2", "person2", "person2", "person2", "person2", ...), category = c("red", "red", "blue", "red", "red", "blue", "green", "green", ...))
dt
ID category
person1 red
person1 red
person1 blue
person2 red
person2 red
person2 blue
person2 green
person2 green
person3 blue
....
我正在寻找如何为每个唯一 ID 创建 "frequency" 的分类变量 red
、blue
、green
,然后展开这些列以进行记录每个的计数。结果 data.table 看起来像这样:
dt
ID red blue green
person1 2 1 0
person2 2 1 2
...
我错误地认为从 data.table
开始的正确方法是按组计算 table()
,例如
dt[, counts :=table(category), by=ID]
但这似乎是按组 ID 计算分类值的总数。这也没有解决我的 "expanding" 和 data.table 的问题。
正确的做法是什么?
像这样?
library(data.table)
library(dplyr)
dt[, .N, by = .(ID, category)] %>% dcast(ID ~ category)
如果要将这些列添加到原来的data.table
counts <- dt[, .N, by = .(ID, category)] %>% dcast(ID ~ category)
counts[is.na(counts)] <- 0
output <- merge(dt, counts, by = "ID")
这是以命令式的方式完成的,可能有更简洁、实用的方式来完成。
library(data.table)
library(dtplyr)
dt = data.table(ID = c("person1", "person1", "person1", "person2", "person2", "person2", "person2", "person2"),
category = c("red", "red", "blue", "red", "red", "blue", "green", "green"))
ids <- unique(dt$ID)
categories <- unique(dt$category)
counts <- matrix(nrow=length(ids), ncol=length(categories))
rownames(counts) <- ids
colnames(counts) <- categories
for (i in seq_along(ids)) {
for (j in seq_along(categories)) {
count <- dt %>%
filter(ID == ids[i], category == categories[j]) %>%
nrow()
counts[i, j] <- count
}
}
然后:
>counts
## red blue green
##person1 2 1 0
##person2 2 1 2
一行即可使用reshape库
library(reshape2)
dcast(data=dt,
ID ~ category,
fun.aggregate = length,
value.var = "category")
ID blue green red
1 person1 1 0 2
2 person2 1 2 2
此外,如果你只需要一个简单的2-way table,你可以使用内置的R table
函数。
table(dt$ID,dt$category)