在参考 table 中随机选择近似连接
Approximate join with random selection in reference table
我有一个数据集N
我想加入参考tableREF
。问题是在数据集中我没有合适的主键。我的想法是使用一个解决方案来承认它的缺点。因此,我将使用一个数值变量来查找近似匹配并将其加入数据集。
我试过 Merging two datasets on approximate values 并试图适应它但失败了。棘手的一点似乎是参考table中类似1值的数据和随机选择:
N <- data.table(NR = c( "999", "999", "999", "999", "999", "999", "999", "999", "999", "999", "999", "999", "999", "999", "999"),
year = c("2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012"),
los = c( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1))
REF <- data.table(nr =c( "A60D", "A91Z", "B70H", "B78C", "E64D", "F49F", "I66E", "I68E", "J68Z", "K63C", "L70A", "L70B", "L71Z", "O64B", "P60A", "P60C", "R65A", "R65B", "S60Z", "U60A", "U60B", "W60Z", "Y63Z"),
alos = c(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.5, 1.4, 1.0, 1.0, 1.0, 1.0, 1.0, 1.3, 1.0))
此示例必然会生成更多数据 - 但我无法绕过正确的选择,最重要的是无法找到随机选择对 1 的引用的解决方案。
REF[, los := alos]
setkey(N, los)
setkey(REF, alos)
NEW <- N[REF, roll='nearest']
Desired output one row per row in N:
NR year los nr alos
999 2012 1 A60D 1.0
999 2012 1 A91Z 1.0
999 2012 1 A91Z 1.0
999 2012 1 W60Z 1.3
999 2012 1 P60C 1.4
999 2012 1 A91Z 1.0
这可能对你有用。我尝试使用滚动连接,但我认为您无法获得随机行为:
setkey(REF,alos)
N[, dif := min(abs(los - REF[, alos])), by = row.names(N)]
set.seed(123)
N[ , nr := REF[J(los-dif,los+dif),list(sample(nr,1))], by = row.names(N)]
N
NR year los row dif nr
1: 999 2012 1 1 0 F49F
2: 999 2012 1 2 0 R65B
3: 999 2012 1 3 0 J68Z
4: 999 2012 1 4 0 U60A
5: 999 2012 1 5 0 U60B
6: 999 2012 1 6 0 A60D
7: 999 2012 1 7 0 L70A
8: 999 2012 1 8 0 U60A
9: 999 2012 1 9 0 L70B
10: 999 2012 1 10 0 K63C
11: 999 2012 1 11 0 Y63Z
12: 999 2012 1 12 0 K63C
13: 999 2012 1 13 0 O64B
14: 999 2012 1 14 0 L70B
15: 999 2012 1 15 0 B70H
这段代码所做的就是找出 REF[ alos] 中的哪些值最接近 N 中的关键值。然后它从 nr 中的那个值中随机抽样。我已经离开 row
和 dif
但你可以单独摆脱它
我有一个数据集N
我想加入参考tableREF
。问题是在数据集中我没有合适的主键。我的想法是使用一个解决方案来承认它的缺点。因此,我将使用一个数值变量来查找近似匹配并将其加入数据集。
我试过 Merging two datasets on approximate values 并试图适应它但失败了。棘手的一点似乎是参考table中类似1值的数据和随机选择:
N <- data.table(NR = c( "999", "999", "999", "999", "999", "999", "999", "999", "999", "999", "999", "999", "999", "999", "999"),
year = c("2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012", "2012"),
los = c( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1))
REF <- data.table(nr =c( "A60D", "A91Z", "B70H", "B78C", "E64D", "F49F", "I66E", "I68E", "J68Z", "K63C", "L70A", "L70B", "L71Z", "O64B", "P60A", "P60C", "R65A", "R65B", "S60Z", "U60A", "U60B", "W60Z", "Y63Z"),
alos = c(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.5, 1.4, 1.0, 1.0, 1.0, 1.0, 1.0, 1.3, 1.0))
此示例必然会生成更多数据 - 但我无法绕过正确的选择,最重要的是无法找到随机选择对 1 的引用的解决方案。
REF[, los := alos]
setkey(N, los)
setkey(REF, alos)
NEW <- N[REF, roll='nearest']
Desired output one row per row in N:
NR year los nr alos
999 2012 1 A60D 1.0
999 2012 1 A91Z 1.0
999 2012 1 A91Z 1.0
999 2012 1 W60Z 1.3
999 2012 1 P60C 1.4
999 2012 1 A91Z 1.0
这可能对你有用。我尝试使用滚动连接,但我认为您无法获得随机行为:
setkey(REF,alos)
N[, dif := min(abs(los - REF[, alos])), by = row.names(N)]
set.seed(123)
N[ , nr := REF[J(los-dif,los+dif),list(sample(nr,1))], by = row.names(N)]
N
NR year los row dif nr
1: 999 2012 1 1 0 F49F
2: 999 2012 1 2 0 R65B
3: 999 2012 1 3 0 J68Z
4: 999 2012 1 4 0 U60A
5: 999 2012 1 5 0 U60B
6: 999 2012 1 6 0 A60D
7: 999 2012 1 7 0 L70A
8: 999 2012 1 8 0 U60A
9: 999 2012 1 9 0 L70B
10: 999 2012 1 10 0 K63C
11: 999 2012 1 11 0 Y63Z
12: 999 2012 1 12 0 K63C
13: 999 2012 1 13 0 O64B
14: 999 2012 1 14 0 L70B
15: 999 2012 1 15 0 B70H
这段代码所做的就是找出 REF[ alos] 中的哪些值最接近 N 中的关键值。然后它从 nr 中的那个值中随机抽样。我已经离开 row
和 dif
但你可以单独摆脱它