R data.table - 是否有更有效的方法来融合多种方式并合并结果?
R data.table - Is there a more efficient way of melting multiple ways and merging results?
我正在尝试以两种不同的方式 melt
data.table
。然后我最终不得不合并结果 - 这很尴尬,因为我有不同的 measure.vars
并且如果我有更宽的 table / 更复杂的列名将无法扩展。
我从这个开始 data.table:
id p1 p2 p1_pos p2_pos
1: 1 A F 0.70644404 0.75523969
2: 2 B G 0.96798381 0.26280453
3: 3 C H 0.35558517 0.45418777
4: 4 D I 0.14662296 0.01969177
5: 5 E J 0.45155647 0.41373110
6: 6 A F 0.81074292 0.19421395
7: 7 B G 0.49014540 0.02094569
8: 8 C H 0.01445689 0.20199638
9: 9 D I 0.80327645 0.73982715
10: 10 E J 0.17625955 0.88250913
接下来我继续融化两次然后合并:
dat = data.table(id = as.character(rep(1:10)),
p1 = rep(c("A", "B", "C", "D", "E"), 2),
p2 = rep(c("F", "G", "H", "I", "J"), 2),
p1_pos = runif(10),
p2_pos = runif(10))
first_melt = melt(dat, id.vars = "id",
measure.vars = c("p1", "p2"),
variable.name = "loc",
value.name = "name",
value.factor = F,
variable.factor = F)
second_melt = melt(dat,
id.vars = "id",
measure.vars = c("p1_pos", "p2_pos"),
variable.name = "loc",
value.name = "pos",
value.factor = F,
variable.factor = F)
second_melt[, loc := substr(loc, 1,2)]
result = merge(first_melt, second_melt, by = c("id", "loc"))
result[order(id)]
尴尬来自不同的“measure.vars”,然后需要合并。
这会产生预期的结果:
id loc name pos
1: 1 p1 A 0.70644404
2: 1 p2 F 0.75523969
3: 10 p1 E 0.17625955
4: 10 p2 J 0.88250913
5: 2 p1 B 0.96798381
6: 2 p2 G 0.26280453
7: 3 p1 C 0.35558517
8: 3 p2 H 0.45418777
9: 4 p1 D 0.14662296
10: 4 p2 I 0.01969177
11: 5 p1 E 0.45155647
12: 5 p2 J 0.41373110
13: 6 p1 A 0.81074292
14: 6 p2 F 0.19421395
15: 7 p1 B 0.49014540
16: 7 p2 G 0.02094569
17: 8 p1 C 0.01445689
18: 8 p2 H 0.20199638
19: 9 p1 D 0.80327645
20: 9 p2 I 0.73982715
我的问题是这些是否是更有效的方法(即在单个 melt
命令中)?还是我尽可能地接近这个?
这是通过一个相同的列名合并两个数据框的简单方法
B = merge(df,otherdataname,by.x=1,by.y=1)
这个怎么样?
data.table::melt(
dat, id.vars = "id", measure.vars = patterns("p[0-9]+$", "p[0-9]+_pos"),
variable.name = "loc", value.name = c("name", "pos")
)[, loc := paste0("p", loc)][]
# id loc name pos
# <int> <char> <char> <num>
# 1: 1 p1 A 0.70644404
# 2: 2 p1 B 0.96798381
# 3: 3 p1 C 0.35558517
# 4: 4 p1 D 0.14662296
# 5: 5 p1 E 0.45155647
# 6: 6 p1 A 0.81074292
# 7: 7 p1 B 0.49014540
# 8: 8 p1 C 0.01445689
# 9: 9 p1 D 0.80327645
# 10: 10 p1 E 0.17625955
# 11: 1 p2 F 0.75523969
# 12: 2 p2 G 0.26280453
# 13: 3 p2 H 0.45418777
# 14: 4 p2 I 0.01969177
# 15: 5 p2 J 0.41373110
# 16: 6 p2 F 0.19421395
# 17: 7 p2 G 0.02094569
# 18: 8 p2 H 0.20199638
# 19: 9 p2 I 0.73982715
# 20: 10 p2 J 0.88250913
# id loc name pos
数据
library(data.table)
dat <- setDT(structure(list(id = 1:10, p1 = c("A", "B", "C", "D", "E", "A", "B", "C", "D", "E"), p2 = c("F", "G", "H", "I", "J", "F", "G", "H", "I", "J"), p1_pos = c(0.70644404, 0.96798381, 0.35558517, 0.14662296, 0.45155647, 0.81074292, 0.4901454, 0.01445689, 0.80327645, 0.17625955), p2_pos = c(0.75523969, 0.26280453, 0.45418777, 0.01969177, 0.4137311, 0.19421395, 0.02094569, 0.20199638, 0.73982715, 0.88250913)), row.names = c(NA, -10L), class = c("data.table", "data.frame"))_
这是一个 data.table 方法,与 r2evans 的回答略有不同。
无论 p1-p2 的编号如何,它都会起作用...它会很容易处理不同的(或非数字的)名称。
# get different p's
p.v <- grep("^p[0-9]$", names(dat), value = TRUE)
# now melt
dat.melt <- melt(dat,
id.vars = "id",
measure.vars = patterns(name = "^p[0-9]$", pos = "^p[0-9]_pos$"),
variable.name = "loc")
# set loc-attributes
setattr(dat.melt$loc, "levels", p.v)
# id loc name pos
# 1: 1 p1 A 0.59099882
# 2: 2 p1 B 0.79727305
# 3: 3 p1 C 0.04180905
# 4: 4 p1 D 0.53533198
# 5: 5 p1 E 0.75851590
# 6: 6 p1 A 0.47344565
# 7: 7 p1 B 0.47035125
# 8: 8 p1 C 0.88675906
# 9: 9 p1 D 0.18159266
#10: 10 p1 E 0.97808083
#11: 1 p2 F 0.84751133
#12: 2 p2 G 0.13917469
#13: 3 p2 H 0.57787425
#14: 4 p2 I 0.20052178
#15: 5 p2 J 0.49654451
#16: 6 p2 F 0.62705394
#17: 7 p2 G 0.04015590
#18: 8 p2 H 0.29792342
#19: 9 p2 I 0.72457705
#20: 10 p2 J 0.05694427
我正在尝试以两种不同的方式 melt
data.table
。然后我最终不得不合并结果 - 这很尴尬,因为我有不同的 measure.vars
并且如果我有更宽的 table / 更复杂的列名将无法扩展。
我从这个开始 data.table:
id p1 p2 p1_pos p2_pos
1: 1 A F 0.70644404 0.75523969
2: 2 B G 0.96798381 0.26280453
3: 3 C H 0.35558517 0.45418777
4: 4 D I 0.14662296 0.01969177
5: 5 E J 0.45155647 0.41373110
6: 6 A F 0.81074292 0.19421395
7: 7 B G 0.49014540 0.02094569
8: 8 C H 0.01445689 0.20199638
9: 9 D I 0.80327645 0.73982715
10: 10 E J 0.17625955 0.88250913
接下来我继续融化两次然后合并:
dat = data.table(id = as.character(rep(1:10)),
p1 = rep(c("A", "B", "C", "D", "E"), 2),
p2 = rep(c("F", "G", "H", "I", "J"), 2),
p1_pos = runif(10),
p2_pos = runif(10))
first_melt = melt(dat, id.vars = "id",
measure.vars = c("p1", "p2"),
variable.name = "loc",
value.name = "name",
value.factor = F,
variable.factor = F)
second_melt = melt(dat,
id.vars = "id",
measure.vars = c("p1_pos", "p2_pos"),
variable.name = "loc",
value.name = "pos",
value.factor = F,
variable.factor = F)
second_melt[, loc := substr(loc, 1,2)]
result = merge(first_melt, second_melt, by = c("id", "loc"))
result[order(id)]
尴尬来自不同的“measure.vars”,然后需要合并。
这会产生预期的结果:
id loc name pos
1: 1 p1 A 0.70644404
2: 1 p2 F 0.75523969
3: 10 p1 E 0.17625955
4: 10 p2 J 0.88250913
5: 2 p1 B 0.96798381
6: 2 p2 G 0.26280453
7: 3 p1 C 0.35558517
8: 3 p2 H 0.45418777
9: 4 p1 D 0.14662296
10: 4 p2 I 0.01969177
11: 5 p1 E 0.45155647
12: 5 p2 J 0.41373110
13: 6 p1 A 0.81074292
14: 6 p2 F 0.19421395
15: 7 p1 B 0.49014540
16: 7 p2 G 0.02094569
17: 8 p1 C 0.01445689
18: 8 p2 H 0.20199638
19: 9 p1 D 0.80327645
20: 9 p2 I 0.73982715
我的问题是这些是否是更有效的方法(即在单个 melt
命令中)?还是我尽可能地接近这个?
这是通过一个相同的列名合并两个数据框的简单方法
B = merge(df,otherdataname,by.x=1,by.y=1)
这个怎么样?
data.table::melt(
dat, id.vars = "id", measure.vars = patterns("p[0-9]+$", "p[0-9]+_pos"),
variable.name = "loc", value.name = c("name", "pos")
)[, loc := paste0("p", loc)][]
# id loc name pos
# <int> <char> <char> <num>
# 1: 1 p1 A 0.70644404
# 2: 2 p1 B 0.96798381
# 3: 3 p1 C 0.35558517
# 4: 4 p1 D 0.14662296
# 5: 5 p1 E 0.45155647
# 6: 6 p1 A 0.81074292
# 7: 7 p1 B 0.49014540
# 8: 8 p1 C 0.01445689
# 9: 9 p1 D 0.80327645
# 10: 10 p1 E 0.17625955
# 11: 1 p2 F 0.75523969
# 12: 2 p2 G 0.26280453
# 13: 3 p2 H 0.45418777
# 14: 4 p2 I 0.01969177
# 15: 5 p2 J 0.41373110
# 16: 6 p2 F 0.19421395
# 17: 7 p2 G 0.02094569
# 18: 8 p2 H 0.20199638
# 19: 9 p2 I 0.73982715
# 20: 10 p2 J 0.88250913
# id loc name pos
数据
library(data.table)
dat <- setDT(structure(list(id = 1:10, p1 = c("A", "B", "C", "D", "E", "A", "B", "C", "D", "E"), p2 = c("F", "G", "H", "I", "J", "F", "G", "H", "I", "J"), p1_pos = c(0.70644404, 0.96798381, 0.35558517, 0.14662296, 0.45155647, 0.81074292, 0.4901454, 0.01445689, 0.80327645, 0.17625955), p2_pos = c(0.75523969, 0.26280453, 0.45418777, 0.01969177, 0.4137311, 0.19421395, 0.02094569, 0.20199638, 0.73982715, 0.88250913)), row.names = c(NA, -10L), class = c("data.table", "data.frame"))_
这是一个 data.table 方法,与 r2evans 的回答略有不同。 无论 p1-p2 的编号如何,它都会起作用...它会很容易处理不同的(或非数字的)名称。
# get different p's
p.v <- grep("^p[0-9]$", names(dat), value = TRUE)
# now melt
dat.melt <- melt(dat,
id.vars = "id",
measure.vars = patterns(name = "^p[0-9]$", pos = "^p[0-9]_pos$"),
variable.name = "loc")
# set loc-attributes
setattr(dat.melt$loc, "levels", p.v)
# id loc name pos
# 1: 1 p1 A 0.59099882
# 2: 2 p1 B 0.79727305
# 3: 3 p1 C 0.04180905
# 4: 4 p1 D 0.53533198
# 5: 5 p1 E 0.75851590
# 6: 6 p1 A 0.47344565
# 7: 7 p1 B 0.47035125
# 8: 8 p1 C 0.88675906
# 9: 9 p1 D 0.18159266
#10: 10 p1 E 0.97808083
#11: 1 p2 F 0.84751133
#12: 2 p2 G 0.13917469
#13: 3 p2 H 0.57787425
#14: 4 p2 I 0.20052178
#15: 5 p2 J 0.49654451
#16: 6 p2 F 0.62705394
#17: 7 p2 G 0.04015590
#18: 8 p2 H 0.29792342
#19: 9 p2 I 0.72457705
#20: 10 p2 J 0.05694427