有条件地合并多个变量
Conditional merging with many variables
我想知道是否有更快、更优雅的 data.table 解决我的以下问题。
假设我们有两个数据集
set.seed(1)
library(data.table)
DT1 <- data.table(income = runif(10, 0,1),
ID = 1:10)
DT2 <- data.table(height = runif(20,0,1),
weight = runif(20,0,1),
V1 = runif(20,0,1),
V2 = runif(20,0,2),
type = rep(c("Parents", "Children"),10))
DT2 <- DT2[order(type)][, ID := rep(1:10,2)]
第一个数据集是“家庭”级别的数据集,其家庭标识符由 ID 给出,来自 1:10。
还有第二个数据集,DT2
,每个家庭 ID 有四个变量 parent 和 child。我想要做的是将 Parents 和 Children 的所有变量(身高、体重、V1、V2)合并到 DT1 的每个 row/observation 中。所以我们将有八个变量要合并,四个用于 parents,四个用于 children.
为此,我可以简单地编写以下内容:
DT1[DT2[type == "Parents"], c("height_parents", "weight_parents",
"V1_parents", "V2_parents") := list(i.height, i.weight,
i.V1, i.V2), on = c(ID = "ID")]
DT1[DT2[type == "Children"], c("height_children", "weight_children",
"V1_children", "V2_children") := list(i.height, i.weight,
i.V1, i.V2), on = c(ID = "ID")]
输出如下:
income ID height_parents weight_parents V1_parents V2_parents height_children
1: 0.70647001 1 0.49163534 0.164385214 0.6806198 1.72937701 0.04655907
2: 0.07658058 2 0.06776809 0.234182275 0.4456820 1.76814822 0.45665042
3: 0.49770601 3 0.23255515 0.709256017 0.3514867 1.83387012 0.24395311
4: 0.51944306 4 0.30555999 0.974742471 0.0529102 0.06094086 0.22356168
5: 0.23075737 5 0.51104028 0.007269433 0.4157508 1.00207079 0.98915308
6: 0.86449990 6 0.75420198 0.211342425 0.9837331 0.03520897 0.86080818
weight_children V1_children V2_Children
1: 0.3398447 0.7454582 0.6761706
2: 0.8475106 0.4716267 1.5231691
3: 0.8895790 0.1561395 1.8721056
4: 0.3503219 0.2663775 0.2408758
5: 0.3902352 0.9332958 1.3532260
6: 0.7648748 0.6969372 1.3289579
请注意,如果我有许多不同的“类型”(这里只有两个)and/or 要合并许多不同的变量(这里只有 4 个),上面的内容很快就会变得费力和冗长。我使用的数据集有许多不同的变量和类型。因此,我希望以更有效的方式做到这一点。特别是,我希望能够定义一个向量:
merge_variables = c("height", "weight", "V1", "V2")
并以 data.table 的方式,将所有这些变量合并到 parents 和 children 中。我希望新变量在类型名称中有一个下划线(例如 height_parents
和 height_children
)。
我希望我已经清楚地传达了我的要求。
谢谢!
我们可以使用 mget
来做到这一点,即使用 paste
创建对象名称作为字符串,然后在对 'DT2' 并分配列的输出(即 mget
returns 一个 list
向量)以在 'DT1'
中创建新列
merge_variables = c("height", "weight", "V1", "V2")
parent_nm <- paste0(merge_variables, "_parents")
child_nm <- paste0(merge_variables, "_children")
DT1[DT2[type == "Parents"], (parent_nm) := mget(paste0("i.", merge_variables)), on = .(ID)]
DT1[DT2[type == "Children"], (child_nm) := mget(paste0("i.", merge_variables)), on = .(ID)]
-输出
> DT1
income ID height_parents weight_parents V1_parents V2_parents height_children weight_children V1_children V2_children
1: 0.26550866 1 0.20597457 0.4820801 0.47761962 0.6781459 0.1765568 0.5995658 0.86120948 1.6788807
2: 0.37212390 2 0.68702285 0.4935413 0.43809711 0.6933670 0.3841037 0.1862176 0.24479728 0.6675499
3: 0.57285336 3 0.76984142 0.8273733 0.07067905 0.9527025 0.4976992 0.6684667 0.09946616 1.7843967
4: 0.90820779 4 0.71761851 0.7942399 0.31627171 1.7286789 0.9919061 0.1079436 0.51863426 0.7799791
5: 0.20168193 5 0.38003518 0.7237109 0.66200508 1.5546414 0.7774452 0.4112744 0.40683019 1.9212360
6: 0.89838968 6 0.93470523 0.8209463 0.91287592 0.8693190 0.2121425 0.6470602 0.29360337 1.4250294
7: 0.94467527 7 0.65167377 0.7829328 0.45906573 0.7999887 0.1255551 0.5530363 0.33239467 0.6507043
8: 0.66079779 8 0.26722067 0.5297196 0.65087047 1.5141743 0.3861141 0.7893562 0.25801678 0.4053845
9: 0.62911404 9 0.01339033 0.0233312 0.47854525 1.4222424 0.3823880 0.4772301 0.76631067 0.2433838
10: 0.06178627 10 0.86969085 0.7323137 0.08424691 0.4909770 0.3403490 0.6927316 0.87532133 0.2866088
或者另一种选择是 dcast
将 'DT2' 格式转换为 'wide' 格式并进行连接(但是,这里我们没有分配 (:=
),因此它不会更新原来的 DT1)
DT1[dcast(DT2, ID ~ type, value.var = merge_variables), on = .(ID)]
income ID height_Children height_Parents weight_Children weight_Parents V1_Children V1_Parents V2_Children V2_Parents
1: 0.26550866 1 0.1765568 0.20597457 0.5995658 0.4820801 0.86120948 0.47761962 1.6788807 0.6781459
2: 0.37212390 2 0.3841037 0.68702285 0.1862176 0.4935413 0.24479728 0.43809711 0.6675499 0.6933670
3: 0.57285336 3 0.4976992 0.76984142 0.6684667 0.8273733 0.09946616 0.07067905 1.7843967 0.9527025
4: 0.90820779 4 0.9919061 0.71761851 0.1079436 0.7942399 0.51863426 0.31627171 0.7799791 1.7286789
5: 0.20168193 5 0.7774452 0.38003518 0.4112744 0.7237109 0.40683019 0.66200508 1.9212360 1.5546414
6: 0.89838968 6 0.2121425 0.93470523 0.6470602 0.8209463 0.29360337 0.91287592 1.4250294 0.8693190
7: 0.94467527 7 0.1255551 0.65167377 0.5530363 0.7829328 0.33239467 0.45906573 0.6507043 0.7999887
8: 0.66079779 8 0.3861141 0.26722067 0.7893562 0.5297196 0.25801678 0.65087047 0.4053845 1.5141743
9: 0.62911404 9 0.3823880 0.01339033 0.4772301 0.0233312 0.76631067 0.47854525 0.2433838 1.4222424
10: 0.06178627 10 0.3403490 0.86969085 0.6927316 0.7323137 0.87532133 0.08424691 0.2866088 0.4909770
我想知道是否有更快、更优雅的 data.table 解决我的以下问题。
假设我们有两个数据集
set.seed(1)
library(data.table)
DT1 <- data.table(income = runif(10, 0,1),
ID = 1:10)
DT2 <- data.table(height = runif(20,0,1),
weight = runif(20,0,1),
V1 = runif(20,0,1),
V2 = runif(20,0,2),
type = rep(c("Parents", "Children"),10))
DT2 <- DT2[order(type)][, ID := rep(1:10,2)]
第一个数据集是“家庭”级别的数据集,其家庭标识符由 ID 给出,来自 1:10。
还有第二个数据集,DT2
,每个家庭 ID 有四个变量 parent 和 child。我想要做的是将 Parents 和 Children 的所有变量(身高、体重、V1、V2)合并到 DT1 的每个 row/observation 中。所以我们将有八个变量要合并,四个用于 parents,四个用于 children.
为此,我可以简单地编写以下内容:
DT1[DT2[type == "Parents"], c("height_parents", "weight_parents",
"V1_parents", "V2_parents") := list(i.height, i.weight,
i.V1, i.V2), on = c(ID = "ID")]
DT1[DT2[type == "Children"], c("height_children", "weight_children",
"V1_children", "V2_children") := list(i.height, i.weight,
i.V1, i.V2), on = c(ID = "ID")]
输出如下:
income ID height_parents weight_parents V1_parents V2_parents height_children
1: 0.70647001 1 0.49163534 0.164385214 0.6806198 1.72937701 0.04655907
2: 0.07658058 2 0.06776809 0.234182275 0.4456820 1.76814822 0.45665042
3: 0.49770601 3 0.23255515 0.709256017 0.3514867 1.83387012 0.24395311
4: 0.51944306 4 0.30555999 0.974742471 0.0529102 0.06094086 0.22356168
5: 0.23075737 5 0.51104028 0.007269433 0.4157508 1.00207079 0.98915308
6: 0.86449990 6 0.75420198 0.211342425 0.9837331 0.03520897 0.86080818
weight_children V1_children V2_Children
1: 0.3398447 0.7454582 0.6761706
2: 0.8475106 0.4716267 1.5231691
3: 0.8895790 0.1561395 1.8721056
4: 0.3503219 0.2663775 0.2408758
5: 0.3902352 0.9332958 1.3532260
6: 0.7648748 0.6969372 1.3289579
请注意,如果我有许多不同的“类型”(这里只有两个)and/or 要合并许多不同的变量(这里只有 4 个),上面的内容很快就会变得费力和冗长。我使用的数据集有许多不同的变量和类型。因此,我希望以更有效的方式做到这一点。特别是,我希望能够定义一个向量:
merge_variables = c("height", "weight", "V1", "V2")
并以 data.table 的方式,将所有这些变量合并到 parents 和 children 中。我希望新变量在类型名称中有一个下划线(例如 height_parents
和 height_children
)。
我希望我已经清楚地传达了我的要求。
谢谢!
我们可以使用 mget
来做到这一点,即使用 paste
创建对象名称作为字符串,然后在对 'DT2' 并分配列的输出(即 mget
returns 一个 list
向量)以在 'DT1'
merge_variables = c("height", "weight", "V1", "V2")
parent_nm <- paste0(merge_variables, "_parents")
child_nm <- paste0(merge_variables, "_children")
DT1[DT2[type == "Parents"], (parent_nm) := mget(paste0("i.", merge_variables)), on = .(ID)]
DT1[DT2[type == "Children"], (child_nm) := mget(paste0("i.", merge_variables)), on = .(ID)]
-输出
> DT1
income ID height_parents weight_parents V1_parents V2_parents height_children weight_children V1_children V2_children
1: 0.26550866 1 0.20597457 0.4820801 0.47761962 0.6781459 0.1765568 0.5995658 0.86120948 1.6788807
2: 0.37212390 2 0.68702285 0.4935413 0.43809711 0.6933670 0.3841037 0.1862176 0.24479728 0.6675499
3: 0.57285336 3 0.76984142 0.8273733 0.07067905 0.9527025 0.4976992 0.6684667 0.09946616 1.7843967
4: 0.90820779 4 0.71761851 0.7942399 0.31627171 1.7286789 0.9919061 0.1079436 0.51863426 0.7799791
5: 0.20168193 5 0.38003518 0.7237109 0.66200508 1.5546414 0.7774452 0.4112744 0.40683019 1.9212360
6: 0.89838968 6 0.93470523 0.8209463 0.91287592 0.8693190 0.2121425 0.6470602 0.29360337 1.4250294
7: 0.94467527 7 0.65167377 0.7829328 0.45906573 0.7999887 0.1255551 0.5530363 0.33239467 0.6507043
8: 0.66079779 8 0.26722067 0.5297196 0.65087047 1.5141743 0.3861141 0.7893562 0.25801678 0.4053845
9: 0.62911404 9 0.01339033 0.0233312 0.47854525 1.4222424 0.3823880 0.4772301 0.76631067 0.2433838
10: 0.06178627 10 0.86969085 0.7323137 0.08424691 0.4909770 0.3403490 0.6927316 0.87532133 0.2866088
或者另一种选择是 dcast
将 'DT2' 格式转换为 'wide' 格式并进行连接(但是,这里我们没有分配 (:=
),因此它不会更新原来的 DT1)
DT1[dcast(DT2, ID ~ type, value.var = merge_variables), on = .(ID)]
income ID height_Children height_Parents weight_Children weight_Parents V1_Children V1_Parents V2_Children V2_Parents
1: 0.26550866 1 0.1765568 0.20597457 0.5995658 0.4820801 0.86120948 0.47761962 1.6788807 0.6781459
2: 0.37212390 2 0.3841037 0.68702285 0.1862176 0.4935413 0.24479728 0.43809711 0.6675499 0.6933670
3: 0.57285336 3 0.4976992 0.76984142 0.6684667 0.8273733 0.09946616 0.07067905 1.7843967 0.9527025
4: 0.90820779 4 0.9919061 0.71761851 0.1079436 0.7942399 0.51863426 0.31627171 0.7799791 1.7286789
5: 0.20168193 5 0.7774452 0.38003518 0.4112744 0.7237109 0.40683019 0.66200508 1.9212360 1.5546414
6: 0.89838968 6 0.2121425 0.93470523 0.6470602 0.8209463 0.29360337 0.91287592 1.4250294 0.8693190
7: 0.94467527 7 0.1255551 0.65167377 0.5530363 0.7829328 0.33239467 0.45906573 0.6507043 0.7999887
8: 0.66079779 8 0.3861141 0.26722067 0.7893562 0.5297196 0.25801678 0.65087047 0.4053845 1.5141743
9: 0.62911404 9 0.3823880 0.01339033 0.4772301 0.0233312 0.76631067 0.47854525 0.2433838 1.4222424
10: 0.06178627 10 0.3403490 0.86969085 0.6927316 0.7323137 0.87532133 0.08424691 0.2866088 0.4909770