有条件地合并多个变量

Conditional merging with many variables

我想知道是否有更快、更优雅的 data.table 解决我的以下问题。

假设我们有两个数据集

set.seed(1)
library(data.table)

DT1 <- data.table(income = runif(10, 0,1),
                  ID = 1:10)
DT2 <- data.table(height = runif(20,0,1),
                  weight = runif(20,0,1),
                  V1 = runif(20,0,1),
                  V2 = runif(20,0,2),
                  type = rep(c("Parents", "Children"),10))
DT2 <- DT2[order(type)][, ID := rep(1:10,2)]

第一个数据集是“家庭”级别的数据集,其家庭标识符由 ID 给出,来自 1:10。

还有第二个数据集,DT2,每个家庭 ID 有四个变量 parent 和 child。我想要做的是将 Parents 和 Children 的所有变量(身高、体重、V1、V2)合并到 DT1 的每个 row/observation 中。所以我们将有八个变量要合并,四个用于 parents,四个用于 children.

为此,我可以简单地编写以下内容:


DT1[DT2[type == "Parents"], c("height_parents", "weight_parents",
                              "V1_parents", "V2_parents") := list(i.height, i.weight,
                                                                  i.V1, i.V2), on = c(ID = "ID")]
DT1[DT2[type == "Children"], c("height_children", "weight_children",
                               "V1_children", "V2_children") := list(i.height, i.weight,
                                                                     i.V1, i.V2), on = c(ID = "ID")]

输出如下:

 income ID height_parents weight_parents V1_parents V2_parents height_children
1: 0.70647001  1     0.49163534    0.164385214  0.6806198 1.72937701      0.04655907
2: 0.07658058  2     0.06776809    0.234182275  0.4456820 1.76814822      0.45665042
3: 0.49770601  3     0.23255515    0.709256017  0.3514867 1.83387012      0.24395311
4: 0.51944306  4     0.30555999    0.974742471  0.0529102 0.06094086      0.22356168
5: 0.23075737  5     0.51104028    0.007269433  0.4157508 1.00207079      0.98915308
6: 0.86449990  6     0.75420198    0.211342425  0.9837331 0.03520897      0.86080818
   weight_children V1_children V2_Children
1:       0.3398447   0.7454582   0.6761706
2:       0.8475106   0.4716267   1.5231691
3:       0.8895790   0.1561395   1.8721056
4:       0.3503219   0.2663775   0.2408758
5:       0.3902352   0.9332958   1.3532260
6:       0.7648748   0.6969372   1.3289579

请注意,如果我有许多不同的“类型”(这里只有两个)and/or 要合并许多不同的变量(这里只有 4 个),上面的内容很快就会变得费力和冗长。我使用的数据集有许多不同的变量和类型。因此,我希望以更有效的方式做到这一点。特别是,我希望能够定义一个向量:

merge_variables = c("height", "weight", "V1", "V2") 

并以 data.table 的方式,将所有这些变量合并到 parents 和 children 中。我希望新变量在类型名称中有一个下划线(例如 height_parentsheight_children)。

我希望我已经清楚地传达了我的要求。

谢谢!

我们可以使用 mget 来做到这一点,即使用 paste 创建对象名称作为字符串,然后在对 'DT2' 并分配列的输出(即 mget returns 一个 list 向量)以在 'DT1'

中创建新列
merge_variables = c("height", "weight", "V1", "V2") 
parent_nm <- paste0(merge_variables, "_parents")
child_nm <- paste0(merge_variables, "_children")

DT1[DT2[type == "Parents"], (parent_nm) := mget(paste0("i.", merge_variables)), on = .(ID)]
DT1[DT2[type == "Children"], (child_nm) := mget(paste0("i.", merge_variables)), on = .(ID)]

-输出

> DT1
        income ID height_parents weight_parents V1_parents V2_parents height_children weight_children V1_children V2_children
 1: 0.26550866  1     0.20597457      0.4820801 0.47761962  0.6781459       0.1765568       0.5995658  0.86120948   1.6788807
 2: 0.37212390  2     0.68702285      0.4935413 0.43809711  0.6933670       0.3841037       0.1862176  0.24479728   0.6675499
 3: 0.57285336  3     0.76984142      0.8273733 0.07067905  0.9527025       0.4976992       0.6684667  0.09946616   1.7843967
 4: 0.90820779  4     0.71761851      0.7942399 0.31627171  1.7286789       0.9919061       0.1079436  0.51863426   0.7799791
 5: 0.20168193  5     0.38003518      0.7237109 0.66200508  1.5546414       0.7774452       0.4112744  0.40683019   1.9212360
 6: 0.89838968  6     0.93470523      0.8209463 0.91287592  0.8693190       0.2121425       0.6470602  0.29360337   1.4250294
 7: 0.94467527  7     0.65167377      0.7829328 0.45906573  0.7999887       0.1255551       0.5530363  0.33239467   0.6507043
 8: 0.66079779  8     0.26722067      0.5297196 0.65087047  1.5141743       0.3861141       0.7893562  0.25801678   0.4053845
 9: 0.62911404  9     0.01339033      0.0233312 0.47854525  1.4222424       0.3823880       0.4772301  0.76631067   0.2433838
10: 0.06178627 10     0.86969085      0.7323137 0.08424691  0.4909770       0.3403490       0.6927316  0.87532133   0.2866088

或者另一种选择是 dcast 将 'DT2' 格式转换为 'wide' 格式并进行连接(但是,这里我们没有分配 (:=),因此它不会更新原来的 DT1)

DT1[dcast(DT2, ID ~ type, value.var = merge_variables), on = .(ID)]
        income ID height_Children height_Parents weight_Children weight_Parents V1_Children V1_Parents V2_Children V2_Parents
 1: 0.26550866  1       0.1765568     0.20597457       0.5995658      0.4820801  0.86120948 0.47761962   1.6788807  0.6781459
 2: 0.37212390  2       0.3841037     0.68702285       0.1862176      0.4935413  0.24479728 0.43809711   0.6675499  0.6933670
 3: 0.57285336  3       0.4976992     0.76984142       0.6684667      0.8273733  0.09946616 0.07067905   1.7843967  0.9527025
 4: 0.90820779  4       0.9919061     0.71761851       0.1079436      0.7942399  0.51863426 0.31627171   0.7799791  1.7286789
 5: 0.20168193  5       0.7774452     0.38003518       0.4112744      0.7237109  0.40683019 0.66200508   1.9212360  1.5546414
 6: 0.89838968  6       0.2121425     0.93470523       0.6470602      0.8209463  0.29360337 0.91287592   1.4250294  0.8693190
 7: 0.94467527  7       0.1255551     0.65167377       0.5530363      0.7829328  0.33239467 0.45906573   0.6507043  0.7999887
 8: 0.66079779  8       0.3861141     0.26722067       0.7893562      0.5297196  0.25801678 0.65087047   0.4053845  1.5141743
 9: 0.62911404  9       0.3823880     0.01339033       0.4772301      0.0233312  0.76631067 0.47854525   0.2433838  1.4222424
10: 0.06178627 10       0.3403490     0.86969085       0.6927316      0.7323137  0.87532133 0.08424691   0.2866088  0.4909770