如何在 R data.table 中进行特殊类型的查找连接?

How to do a special type of lookup join in R data.table?

如何在 R 中进行特殊类型的查找连接 data.table?

假设 R 中有如下两个表:

library(data.table)

dt1 <- data.table(a = c("p", "q", "r"),
                  b = c("1,2", "1,2,3", "4,5"))

dt2 <- data.table(code = 1:5,
                  desc = c("good", "better", "best", "bad", "worst"))

它们看起来像:

> dt1
   a     b
1: p   1,2
2: q 1,2,3
3: r   4,5
> dt2
   code   desc
1:    1   good
2:    2 better
3:    3   best
4:    4    bad
5:    5  worst

目标是以 result 看起来像

的方式加入 dt1dt2
> result
   a     b             desc
1: p   1,2      good,better
2: q 1,2,3 good,better,best
3: r   4,5        bad,worst

任何人都可以展示如何在 R 中完成这种类型的连接吗?

想法是将列 b 作为整数列表,然后在 dt2 中对列 desc 进行子集(注意 code 只是行号,否则使用函数match).

library(purrr)
library(stringr)

dt1[, b := map(b, ~str_split(.x, ",") %>% unlist() %>% as.integer())]
dt1[, desc := map(b, ~dt2$desc[match(.x, dt2$code)])]
library(data.table)
library(magrittr)

dt1 <- data.table(a = c("p", "q", "r"),
                  b = c("1,2", "1,2,3", "4,5"))

dt2 <- data.table(code = 1:5,
                  desc = c("good", "better", "best", "bad", "worst"))

dt1 <- dt1[, list(b = unlist(strsplit(x = b, split = ","))), by = "a"] %>% 
  .[, b := type.convert(b)]

dt2[dt1, on = c("code == b")] %>% 
  .[, lapply(.SD, toString), by = "a"]
#>    a    code               desc
#> 1: p    1, 2       good, better
#> 2: q 1, 2, 3 good, better, best
#> 3: r    4, 5         bad, worst

reprex package (v2.0.0)

于 2021-07-27 创建

您可以用逗号拆分字符串并进行连接。

library(dplyr)
library(tidyr)

dt1 %>%
  separate_rows(b, sep = ',\s*', convert = TRUE) %>%
  left_join(dt2, by = c('b' = 'code')) %>%
  group_by(a) %>%
  summarise(desc = toString(desc))

#   a     desc              
#  <chr> <chr>             
#1 p     good, better      
#2 q     good, better, best
#3 r     bad, worst        

这不是真正的联接,但因为 dt1$b 包含令人费解的值,所以这是我的丑陋技巧:

dt2[, code := as.character(code)] 
dt1[, desc := b]
for (i in seq_along(dt2$code)) 
  dt1[, desc := stringr::str_replace_all(desc, dt2$code[i], dt2$desc[i])]
dt1[]
   a     b             desc
1: p   1,2      good,better
2: q 1,2,3 good,better,best
3: r   4,5        bad,worst

编辑:

替换必须从最长到最短code(字符串长度或字符数)并且desc 不能包含任何数字.

因此,将 setorder(dt2, -code) 添加到代码和 OP in the comment 提供的新用例:

dt1 <- data.table(a = c("p", "q", "r"), b = c("1,21", "23,11,36", "11,36"))
dt2 <- data.table(code = c(1,11,21,23,36), desc = c("good", "better", "best", "bad", "worst"))

setorder(dt2, -code) # set order first (descending numeric value)
dt2[, code := as.character(code)] # then convert to character
dt1[, desc := b]
for (i in seq_along(dt2$code)) 
  dt1[, desc := stringr::str_replace_all(desc, dt2$code[i], dt2$desc[i])]

dt1[]
   a        b             desc
1: p     1,21        good,best
2: q 23,11,36 bad,better,worst
3: r    11,36     better,worst

编辑 2:

根据 丑陋黑客的要求 desc 中的数字在生产数据中没有得到满足。 (因为当快速而肮脏的解决方案遇到现实世界的数据时,几乎总是会发生这种情况 :-))。

所以这是一个简明的 data.table 解决方案,它可以完成所有其他答案所做的事情:拆分列 b,加入或查找匹配的 desc,然后重新组合:

dt2[, code := as.character(code)][
  dt1[, strsplit(b, ","), by = .(a, b)], on = "code==V1"][
    , .(desc = paste(desc, collapse = ",")), by = .(a, b)]

使用 OP 的新用例

   a        b             desc
1: p     1,21        good,best
2: q 23,11,36 bad,better,worst
3: r    11,36     better,worst

请注意,分组同时使用 ab 列有两个原因:1) 方便(在最终结果中保留两列),2) 如果 a不是唯一标识符