R 将行汇总为一行(连续变量和因子变量)
R rolling up rows to a single row (continuous & factor variables)
我正在尝试将一天中的一堆行汇总为一行。如果可能的话,我希望它在 dplyr 中。我知道我的代码远非正确,但这就是我得到的结果:
data %>%
group_by(DAY) %>%
summarise_each(funs(Sum = n()), SEX, GROUP, TOTAL)
原文:
DAY SEX GROUP TOTAL
7/1/14 FEMALE A 1
7/1/14 FEMALE B 1
7/1/14 FEMALE B 1
7/1/14 FEMALE A 1
7/1/14 MALE A 1
7/1/14 MALE B 2
新:
DAY FEMALE MALE GROUP_A GROUP_B TOTAL
7/1/14 4 2 3 3 7
您计算总计(总和)和其他列 (table) 的方式大不相同,因此您可能必须分别执行这些步骤。计算总数很容易。对于表格,我建议使用 tidyr
如下:
# required packages
require(dplyr)
require(tidyr)
# calculations
data %>%
group_by(DAY) %>% # group by day
mutate(TOTAL = sum(TOTAL)) %>% # first calculate total
gather(key, value, -DAY, -TOTAL) %>% # collapse
unite(group, key, value) %>% # get sensible column names
group_by(DAY, TOTAL) %>% # group by day and total
do(as.data.frame(table(.$group))) %>% # table
spread(Var1, Freq) # spread out
## DAY TOTAL GROUP_A GROUP_B SEX_FEMALE SEX_MALE
## 1 7/1/14 7 3 3 4 2
这似乎有点神秘,但这是一个简短的咒语
dat %>% group_by(DAY) %>%
summarise_each(funs(ifelse(is.numeric(.), sum(.), list(table(.))))) -> res
data.frame(DAY=res$DAY, t(unlist(res[, 2:ncol(res)])))
# DAY SEX.FEMALE SEX.MALE GROUP.A GROUP.B TOTAL
# 1 7/1/14 4 2 3 3 7
在这里,如果不是数字,您只需将每一列汇总为 table,如果是,则对其求和(对于总计列)。这需要作为列表返回,因为 summarise_each
需要一个值。然后,结果扩展为常规 data.frame
.
一种可能的方法:
library(reshape2)
library(data.table)
cbind(dcast(df, DAY~SEX),
dcast(df, DAY~GROUP)[-1],
setDT(df)[,.(total=sum(TOTAL)),DAY][,-1,with=F])
# DAY FEMALE MALE A B total
#1 7/1/14 4 2 3 3 7
data.table
的另一种方式,在 data.frame
上测试超过一天。
require(data.table)
setDT(data)[, as.list(c(table(SEX), table(GROUP), TOTAL=sum(TOTAL))), by=DAY]
# DAY FEMALE MALE A B TOTAL
#1: 7/1/14 3 0 1 2 3
#2: 8/1/14 1 2 2 1 4
编辑:另一个更少手动的选项(您不需要知道哪些变量是因子,哪些是数字),感谢@jangorecki 和@的一些帮助戴维阿伦伯格
wh_num <- sapply(data, is.numeric)[-1]
wh_fact <-sapply(data, is.factor)[-1]
setDT(data)[, as.list(c(lapply(.SD[, wh_fact, with = FALSE], table),
lapply(.SD[, wh_num, with = FALSE], sum),
recursive = TRUE)), by = DAY]
# DAY SEX.FEMALE SEX.MALE GROUP.A GROUP.B TOTAL
#1: 7/1/14 3 0 1 2 3
#2: 8/1/14 1 2 2 1 4
数据
data <- structure(list(DAY = c("7/1/14", "7/1/14", "7/1/14", "8/1/14",
"8/1/14", "8/1/14"), SEX = structure(c(1L, 1L, 1L, 1L, 2L, 2L
), .Label = c("FEMALE", "MALE"), class = "factor"), GROUP = structure(c(1L,
2L, 2L, 1L, 1L, 2L), .Label = c("A", "B"), class = "factor"),
TOTAL = c(1L, 1L, 1L, 1L, 1L, 2L)), .Names = c("DAY", "SEX",
"GROUP", "TOTAL"), row.names = c(NA, -6L), class = "data.frame")
我正在尝试将一天中的一堆行汇总为一行。如果可能的话,我希望它在 dplyr 中。我知道我的代码远非正确,但这就是我得到的结果:
data %>%
group_by(DAY) %>%
summarise_each(funs(Sum = n()), SEX, GROUP, TOTAL)
原文:
DAY SEX GROUP TOTAL
7/1/14 FEMALE A 1
7/1/14 FEMALE B 1
7/1/14 FEMALE B 1
7/1/14 FEMALE A 1
7/1/14 MALE A 1
7/1/14 MALE B 2
新:
DAY FEMALE MALE GROUP_A GROUP_B TOTAL
7/1/14 4 2 3 3 7
您计算总计(总和)和其他列 (table) 的方式大不相同,因此您可能必须分别执行这些步骤。计算总数很容易。对于表格,我建议使用 tidyr
如下:
# required packages
require(dplyr)
require(tidyr)
# calculations
data %>%
group_by(DAY) %>% # group by day
mutate(TOTAL = sum(TOTAL)) %>% # first calculate total
gather(key, value, -DAY, -TOTAL) %>% # collapse
unite(group, key, value) %>% # get sensible column names
group_by(DAY, TOTAL) %>% # group by day and total
do(as.data.frame(table(.$group))) %>% # table
spread(Var1, Freq) # spread out
## DAY TOTAL GROUP_A GROUP_B SEX_FEMALE SEX_MALE
## 1 7/1/14 7 3 3 4 2
这似乎有点神秘,但这是一个简短的咒语
dat %>% group_by(DAY) %>%
summarise_each(funs(ifelse(is.numeric(.), sum(.), list(table(.))))) -> res
data.frame(DAY=res$DAY, t(unlist(res[, 2:ncol(res)])))
# DAY SEX.FEMALE SEX.MALE GROUP.A GROUP.B TOTAL
# 1 7/1/14 4 2 3 3 7
在这里,如果不是数字,您只需将每一列汇总为 table,如果是,则对其求和(对于总计列)。这需要作为列表返回,因为 summarise_each
需要一个值。然后,结果扩展为常规 data.frame
.
一种可能的方法:
library(reshape2)
library(data.table)
cbind(dcast(df, DAY~SEX),
dcast(df, DAY~GROUP)[-1],
setDT(df)[,.(total=sum(TOTAL)),DAY][,-1,with=F])
# DAY FEMALE MALE A B total
#1 7/1/14 4 2 3 3 7
data.table
的另一种方式,在 data.frame
上测试超过一天。
require(data.table)
setDT(data)[, as.list(c(table(SEX), table(GROUP), TOTAL=sum(TOTAL))), by=DAY]
# DAY FEMALE MALE A B TOTAL
#1: 7/1/14 3 0 1 2 3
#2: 8/1/14 1 2 2 1 4
编辑:另一个更少手动的选项(您不需要知道哪些变量是因子,哪些是数字),感谢@jangorecki 和@的一些帮助戴维阿伦伯格
wh_num <- sapply(data, is.numeric)[-1]
wh_fact <-sapply(data, is.factor)[-1]
setDT(data)[, as.list(c(lapply(.SD[, wh_fact, with = FALSE], table),
lapply(.SD[, wh_num, with = FALSE], sum),
recursive = TRUE)), by = DAY]
# DAY SEX.FEMALE SEX.MALE GROUP.A GROUP.B TOTAL
#1: 7/1/14 3 0 1 2 3
#2: 8/1/14 1 2 2 1 4
数据
data <- structure(list(DAY = c("7/1/14", "7/1/14", "7/1/14", "8/1/14",
"8/1/14", "8/1/14"), SEX = structure(c(1L, 1L, 1L, 1L, 2L, 2L
), .Label = c("FEMALE", "MALE"), class = "factor"), GROUP = structure(c(1L,
2L, 2L, 1L, 1L, 2L), .Label = c("A", "B"), class = "factor"),
TOTAL = c(1L, 1L, 1L, 1L, 1L, 2L)), .Names = c("DAY", "SEX",
"GROUP", "TOTAL"), row.names = c(NA, -6L), class = "data.frame")