使用列名称中的数字字符串计算列,data.table
Calculating columns using number strings in column names, data.table
我有一个 table(称为 money_table),其中许多列的名称中都带有财政年度后缀:
ID LunchMoney_1213 DinnerMondey_1213 LunchMoney_1314 DinnerMondy_1314
01 12 24 17 18
02 234 12 43 44
03 14 19 2 12
我需要创建新列,添加相关年份的 LunchMoney 和 DinnerMoney 金额,并删除旧列。这个想法是这样结束的:
ID TotalMoney_1213 TotalMoney_1314
01 36 35
02 246 87
03 33 14
我正在使用 data.table 因为 table 非常大,我可以使用以下代码做我想做的事:
money_table[,':='(TotalMoney_1213 = LunchMoney_1213 + DinnerMoney_1213,
TotalMoney_1314 = LunchMoney_1314 + DinnerMoney_1314)][,c(LunchMoney_1213,DinnerMoney_1213, LunchMoney_1314, DinnerMoney_1314) := NULL]
但是有很多年了,这样写出来太费时间了。我知道必须有一种方法可以使用列名中的数字并更有效地执行此操作,但我一直无法弄清楚。
非常感谢任何帮助。
我建议将您的数据框转换为整齐的格式,其中每一列都是一个变量:
money_table %>%
gather("key", "value", c(-ID)) %>% # Wide -> Long format
separate(key, into = c("type", "year"), sep = "_") %>% # Split what used to be colum names into type and year
spread(type, value) %>% # DinnerMoney and LunchMoney are now two separate variables with values for each year
group_by(ID, year) %>% # Group by ID and year
summarize(DinnerMoney = sum(DinnerMoney), # Sum up DinnerMoney for each year as well as LunchMoney for each year
LunchMoney = sum(LunchMoney)) %>%
mutate(total_value = DinnerMoney + LunchMoney) # Get total value for each year
# A tibble: 6 x 5
# Groups: ID [3]
ID year DinnerMoney LunchMoney total_value
<int> <chr> <int> <int> <int>
1 1 1213 24 12 36
2 1 1314 18 17 35
3 2 1213 12 234 246
4 2 1314 44 43 87
5 3 1213 19 14 33
6 3 1314 12 2 14
如果您希望将每年的总值作为列,您可以像这样旋转 table:
money_table %>%
select(ID, year, total_value) %>%
spread(year, total_value)
# A tibble: 3 x 3
# Groups: ID [3]
ID `1213` `1314`
<int> <int> <int>
1 1 36 35
2 2 246 87
3 3 33 14
我认为旋转(更长)、总结,然后旋转(更宽)是可行的。 (我想知道将它保持在 long 形式是否可能在长 运行 中更好,交给你了。)
library(data.table)
money_table <- setDT(structure(list(ID = 1:3, LunchMoney_1213 = c(12L, 234L, 14L), DinnerMondey_1213 = c(24L, 12L, 19L), LunchMoney_1314 = c(17L, 43L, 2L), DinnerMondy_1314 = c(18L, 44L, 12L)), row.names = c(NA, -3L), class = "data.frame"))
dcast(
melt(money_table, id.vars = "ID"
)[, yr := paste0("TotalMoney_", gsub(".*_", "", variable))
][, .(value = sum(value)), by = .(ID, yr)
],
ID ~ yr, value.vars = "value")
# ID TotalMoney_1213 TotalMoney_1314
# <int> <int> <int>
# 1: 1 36 35
# 2: 2 246 87
# 3: 3 33 14
如果您已经将 magrittr
用于其他用途(无论是否使用 dplyr
...我一直将其与 data.table
一起使用),这可能会稍微更具可读性:
library(magrittr)
melt(money_table, id.vars = "ID") %>%
.[, yr := paste0("TotalMoney_", gsub(".*_", "", variable))] %>%
.[, .(value = sum(value)), by = .(ID, yr)] %>%
dcast(., ID ~ yr, value.vars = "value")
我们可以尝试 data.table
和 split.default
> dt[, lapply(split.default(.SD, paste0("TotalMoney_", gsub(".*_", "", names(.SD)))), sum), ID]
ID TotalMoney_1213 TotalMoney_1314
1: 1 36 35
2: 2 246 87
3: 3 33 14
在开发版本 1.14.3 中,data.table 获得了一个 measure()
函数,该函数可用于融合在每个列名中编码了多个不同信息的数据 measure()
=33=](根据 OP 的要求)。
此外,聚合是在对 dcast()
的调用中完成的,这节省了额外的聚合步骤。
library(data.table) # development version 1.14.3 used here
melt(money_table, measure.vars = measure(money, year, sep = "_"))[
, dcast(.SD, ID ~ paste0("TotalMoney_", year), sum)]
ID TotalMoney_1213 TotalMoney_1314
1: 1 36 35
2: 2 246 87
3: 3 33 14
对 measure()
的调用告诉 melt()
将列名分成两部分,名称部分(称为 money
)和 year
部分:
melt(money_table, measure.vars = measure(money, year, sep = "_"))
ID money year value
1: 1 LunchMoney 1213 12
2: 2 LunchMoney 1213 234
3: 3 LunchMoney 1213 14
4: 1 DinnerMoney 1213 24
5: 2 DinnerMoney 1213 12
6: 3 DinnerMoney 1213 19
7: 1 LunchMoney 1314 17
8: 2 LunchMoney 1314 43
9: 3 LunchMoney 1314 2
10: 1 DinnerMoney 1314 18
11: 2 DinnerMoney 1314 44
12: 3 DinnerMoney 1314 12
数据
library(data.table)
money_table <- fread("
ID LunchMoney_1213 DinnerMoney_1213 LunchMoney_1314 DinnerMoney_1314
01 12 24 17 18
02 234 12 43 44
03 14 19 2 12")
我有一个 table(称为 money_table),其中许多列的名称中都带有财政年度后缀:
ID LunchMoney_1213 DinnerMondey_1213 LunchMoney_1314 DinnerMondy_1314
01 12 24 17 18
02 234 12 43 44
03 14 19 2 12
我需要创建新列,添加相关年份的 LunchMoney 和 DinnerMoney 金额,并删除旧列。这个想法是这样结束的:
ID TotalMoney_1213 TotalMoney_1314
01 36 35
02 246 87
03 33 14
我正在使用 data.table 因为 table 非常大,我可以使用以下代码做我想做的事:
money_table[,':='(TotalMoney_1213 = LunchMoney_1213 + DinnerMoney_1213,
TotalMoney_1314 = LunchMoney_1314 + DinnerMoney_1314)][,c(LunchMoney_1213,DinnerMoney_1213, LunchMoney_1314, DinnerMoney_1314) := NULL]
但是有很多年了,这样写出来太费时间了。我知道必须有一种方法可以使用列名中的数字并更有效地执行此操作,但我一直无法弄清楚。
非常感谢任何帮助。
我建议将您的数据框转换为整齐的格式,其中每一列都是一个变量:
money_table %>%
gather("key", "value", c(-ID)) %>% # Wide -> Long format
separate(key, into = c("type", "year"), sep = "_") %>% # Split what used to be colum names into type and year
spread(type, value) %>% # DinnerMoney and LunchMoney are now two separate variables with values for each year
group_by(ID, year) %>% # Group by ID and year
summarize(DinnerMoney = sum(DinnerMoney), # Sum up DinnerMoney for each year as well as LunchMoney for each year
LunchMoney = sum(LunchMoney)) %>%
mutate(total_value = DinnerMoney + LunchMoney) # Get total value for each year
# A tibble: 6 x 5
# Groups: ID [3]
ID year DinnerMoney LunchMoney total_value
<int> <chr> <int> <int> <int>
1 1 1213 24 12 36
2 1 1314 18 17 35
3 2 1213 12 234 246
4 2 1314 44 43 87
5 3 1213 19 14 33
6 3 1314 12 2 14
如果您希望将每年的总值作为列,您可以像这样旋转 table:
money_table %>%
select(ID, year, total_value) %>%
spread(year, total_value)
# A tibble: 3 x 3
# Groups: ID [3]
ID `1213` `1314`
<int> <int> <int>
1 1 36 35
2 2 246 87
3 3 33 14
我认为旋转(更长)、总结,然后旋转(更宽)是可行的。 (我想知道将它保持在 long 形式是否可能在长 运行 中更好,交给你了。)
library(data.table)
money_table <- setDT(structure(list(ID = 1:3, LunchMoney_1213 = c(12L, 234L, 14L), DinnerMondey_1213 = c(24L, 12L, 19L), LunchMoney_1314 = c(17L, 43L, 2L), DinnerMondy_1314 = c(18L, 44L, 12L)), row.names = c(NA, -3L), class = "data.frame"))
dcast(
melt(money_table, id.vars = "ID"
)[, yr := paste0("TotalMoney_", gsub(".*_", "", variable))
][, .(value = sum(value)), by = .(ID, yr)
],
ID ~ yr, value.vars = "value")
# ID TotalMoney_1213 TotalMoney_1314
# <int> <int> <int>
# 1: 1 36 35
# 2: 2 246 87
# 3: 3 33 14
如果您已经将 magrittr
用于其他用途(无论是否使用 dplyr
...我一直将其与 data.table
一起使用),这可能会稍微更具可读性:
library(magrittr)
melt(money_table, id.vars = "ID") %>%
.[, yr := paste0("TotalMoney_", gsub(".*_", "", variable))] %>%
.[, .(value = sum(value)), by = .(ID, yr)] %>%
dcast(., ID ~ yr, value.vars = "value")
我们可以尝试 data.table
和 split.default
> dt[, lapply(split.default(.SD, paste0("TotalMoney_", gsub(".*_", "", names(.SD)))), sum), ID]
ID TotalMoney_1213 TotalMoney_1314
1: 1 36 35
2: 2 246 87
3: 3 33 14
在开发版本 1.14.3 中,data.table 获得了一个 measure()
函数,该函数可用于融合在每个列名中编码了多个不同信息的数据 measure()
=33=](根据 OP 的要求)。
此外,聚合是在对 dcast()
的调用中完成的,这节省了额外的聚合步骤。
library(data.table) # development version 1.14.3 used here
melt(money_table, measure.vars = measure(money, year, sep = "_"))[
, dcast(.SD, ID ~ paste0("TotalMoney_", year), sum)]
ID TotalMoney_1213 TotalMoney_1314 1: 1 36 35 2: 2 246 87 3: 3 33 14
对 measure()
的调用告诉 melt()
将列名分成两部分,名称部分(称为 money
)和 year
部分:
melt(money_table, measure.vars = measure(money, year, sep = "_"))
ID money year value 1: 1 LunchMoney 1213 12 2: 2 LunchMoney 1213 234 3: 3 LunchMoney 1213 14 4: 1 DinnerMoney 1213 24 5: 2 DinnerMoney 1213 12 6: 3 DinnerMoney 1213 19 7: 1 LunchMoney 1314 17 8: 2 LunchMoney 1314 43 9: 3 LunchMoney 1314 2 10: 1 DinnerMoney 1314 18 11: 2 DinnerMoney 1314 44 12: 3 DinnerMoney 1314 12
数据
library(data.table)
money_table <- fread("
ID LunchMoney_1213 DinnerMoney_1213 LunchMoney_1314 DinnerMoney_1314
01 12 24 17 18
02 234 12 43 44
03 14 19 2 12")