在参考 table 中随机选择近似连接

Approximate join with random selection in reference table

我有一个数据集N我想加入参考tableREF。问题是在数据集中我没有合适的主键。我的想法是使用一个解决方案来承认它的缺点。因此,我将使用一个数值变量来查找近似匹配并将其加入数据集。 我试过 Merging two datasets on approximate values 并试图适应它但失败了。棘手的一点似乎是参考table中类似1值的数据和随机选择:

N <- data.table(NR = c( "999", "999", "999", "999", "999", "999", "999", "999", "999", "999", "999", "999", "999", "999", "999"),
  year = c("2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012"),
  los  = c( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1))

REF <- data.table(nr  =c( "A60D", "A91Z", "B70H", "B78C", "E64D", "F49F", "I66E", "I68E", "J68Z", "K63C", "L70A", "L70B", "L71Z", "O64B", "P60A", "P60C", "R65A", "R65B", "S60Z", "U60A", "U60B", "W60Z", "Y63Z"),
     alos = c(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.5, 1.4, 1.0, 1.0, 1.0, 1.0, 1.0, 1.3, 1.0))

此示例必然会生成更多数据 - 但我无法绕过正确的选择,最重要的是无法找到随机选择对 1 的引用的解决方案。

REF[, los := alos]
setkey(N, los)
setkey(REF, alos)
NEW <- N[REF, roll='nearest']

Desired output one row per row in N:

NR    year  los    nr   alos
999   2012   1     A60D   1.0
999   2012   1     A91Z   1.0
999   2012   1     A91Z   1.0
999   2012   1     W60Z   1.3
999   2012   1     P60C   1.4
999   2012   1     A91Z   1.0

这可能对你有用。我尝试使用滚动连接,但我认为您无法获得随机行为:

setkey(REF,alos)

N[, dif := min(abs(los - REF[, alos])), by = row.names(N)]

set.seed(123)
N[ , nr := REF[J(los-dif,los+dif),list(sample(nr,1))], by = row.names(N)]
N

     NR year los row dif   nr
 1: 999 2012   1   1   0 F49F
 2: 999 2012   1   2   0 R65B
 3: 999 2012   1   3   0 J68Z
 4: 999 2012   1   4   0 U60A
 5: 999 2012   1   5   0 U60B
 6: 999 2012   1   6   0 A60D
 7: 999 2012   1   7   0 L70A
 8: 999 2012   1   8   0 U60A
 9: 999 2012   1   9   0 L70B
10: 999 2012   1  10   0 K63C
11: 999 2012   1  11   0 Y63Z
12: 999 2012   1  12   0 K63C
13: 999 2012   1  13   0 O64B
14: 999 2012   1  14   0 L70B
15: 999 2012   1  15   0 B70H

这段代码所做的就是找出 REF[ alos] 中的哪些值最接近 N 中的关键值。然后它从 nr 中的那个值中随机抽样。我已经离开 rowdif 但你可以单独摆脱它