如何在 R data.table 中进行特殊类型的查找连接?
How to do a special type of lookup join in R data.table?
如何在 R 中进行特殊类型的查找连接 data.table?
假设 R 中有如下两个表:
library(data.table)
dt1 <- data.table(a = c("p", "q", "r"),
b = c("1,2", "1,2,3", "4,5"))
dt2 <- data.table(code = 1:5,
desc = c("good", "better", "best", "bad", "worst"))
它们看起来像:
> dt1
a b
1: p 1,2
2: q 1,2,3
3: r 4,5
> dt2
code desc
1: 1 good
2: 2 better
3: 3 best
4: 4 bad
5: 5 worst
目标是以 result
看起来像
的方式加入 dt1
和 dt2
> result
a b desc
1: p 1,2 good,better
2: q 1,2,3 good,better,best
3: r 4,5 bad,worst
任何人都可以展示如何在 R 中完成这种类型的连接吗?
想法是将列 b
作为整数列表,然后在 dt2
中对列 desc
进行子集(注意 code
只是行号,否则使用函数match
).
library(purrr)
library(stringr)
dt1[, b := map(b, ~str_split(.x, ",") %>% unlist() %>% as.integer())]
dt1[, desc := map(b, ~dt2$desc[match(.x, dt2$code)])]
library(data.table)
library(magrittr)
dt1 <- data.table(a = c("p", "q", "r"),
b = c("1,2", "1,2,3", "4,5"))
dt2 <- data.table(code = 1:5,
desc = c("good", "better", "best", "bad", "worst"))
dt1 <- dt1[, list(b = unlist(strsplit(x = b, split = ","))), by = "a"] %>%
.[, b := type.convert(b)]
dt2[dt1, on = c("code == b")] %>%
.[, lapply(.SD, toString), by = "a"]
#> a code desc
#> 1: p 1, 2 good, better
#> 2: q 1, 2, 3 good, better, best
#> 3: r 4, 5 bad, worst
由 reprex package (v2.0.0)
于 2021-07-27 创建
您可以用逗号拆分字符串并进行连接。
library(dplyr)
library(tidyr)
dt1 %>%
separate_rows(b, sep = ',\s*', convert = TRUE) %>%
left_join(dt2, by = c('b' = 'code')) %>%
group_by(a) %>%
summarise(desc = toString(desc))
# a desc
# <chr> <chr>
#1 p good, better
#2 q good, better, best
#3 r bad, worst
这不是真正的联接,但因为 dt1$b
包含令人费解的值,所以这是我的丑陋技巧:
dt2[, code := as.character(code)]
dt1[, desc := b]
for (i in seq_along(dt2$code))
dt1[, desc := stringr::str_replace_all(desc, dt2$code[i], dt2$desc[i])]
dt1[]
a b desc
1: p 1,2 good,better
2: q 1,2,3 good,better,best
3: r 4,5 bad,worst
编辑:
替换必须从最长到最短code
(字符串长度或字符数)并且desc
不能包含任何数字.
因此,将 setorder(dt2, -code)
添加到代码和 OP in the comment 提供的新用例:
dt1 <- data.table(a = c("p", "q", "r"), b = c("1,21", "23,11,36", "11,36"))
dt2 <- data.table(code = c(1,11,21,23,36), desc = c("good", "better", "best", "bad", "worst"))
setorder(dt2, -code) # set order first (descending numeric value)
dt2[, code := as.character(code)] # then convert to character
dt1[, desc := b]
for (i in seq_along(dt2$code))
dt1[, desc := stringr::str_replace_all(desc, dt2$code[i], dt2$desc[i])]
dt1[]
a b desc
1: p 1,21 good,best
2: q 23,11,36 bad,better,worst
3: r 11,36 better,worst
编辑 2:
根据 丑陋黑客的要求 desc
中的数字在生产数据中没有得到满足。 (因为当快速而肮脏的解决方案遇到现实世界的数据时,几乎总是会发生这种情况 :-))。
所以这是一个简明的 data.table
解决方案,它可以完成所有其他答案所做的事情:拆分列 b
,加入或查找匹配的 desc
,然后重新组合:
dt2[, code := as.character(code)][
dt1[, strsplit(b, ","), by = .(a, b)], on = "code==V1"][
, .(desc = paste(desc, collapse = ",")), by = .(a, b)]
使用 OP 的新用例
a b desc
1: p 1,21 good,best
2: q 23,11,36 bad,better,worst
3: r 11,36 better,worst
请注意,分组同时使用 a
和 b
列有两个原因:1) 方便(在最终结果中保留两列),2) 如果 a
是不是唯一标识符
如何在 R 中进行特殊类型的查找连接 data.table?
假设 R 中有如下两个表:
library(data.table)
dt1 <- data.table(a = c("p", "q", "r"),
b = c("1,2", "1,2,3", "4,5"))
dt2 <- data.table(code = 1:5,
desc = c("good", "better", "best", "bad", "worst"))
它们看起来像:
> dt1
a b
1: p 1,2
2: q 1,2,3
3: r 4,5
> dt2
code desc
1: 1 good
2: 2 better
3: 3 best
4: 4 bad
5: 5 worst
目标是以 result
看起来像
dt1
和 dt2
> result
a b desc
1: p 1,2 good,better
2: q 1,2,3 good,better,best
3: r 4,5 bad,worst
任何人都可以展示如何在 R 中完成这种类型的连接吗?
想法是将列 b
作为整数列表,然后在 dt2
中对列 desc
进行子集(注意 code
只是行号,否则使用函数match
).
library(purrr)
library(stringr)
dt1[, b := map(b, ~str_split(.x, ",") %>% unlist() %>% as.integer())]
dt1[, desc := map(b, ~dt2$desc[match(.x, dt2$code)])]
library(data.table)
library(magrittr)
dt1 <- data.table(a = c("p", "q", "r"),
b = c("1,2", "1,2,3", "4,5"))
dt2 <- data.table(code = 1:5,
desc = c("good", "better", "best", "bad", "worst"))
dt1 <- dt1[, list(b = unlist(strsplit(x = b, split = ","))), by = "a"] %>%
.[, b := type.convert(b)]
dt2[dt1, on = c("code == b")] %>%
.[, lapply(.SD, toString), by = "a"]
#> a code desc
#> 1: p 1, 2 good, better
#> 2: q 1, 2, 3 good, better, best
#> 3: r 4, 5 bad, worst
由 reprex package (v2.0.0)
于 2021-07-27 创建您可以用逗号拆分字符串并进行连接。
library(dplyr)
library(tidyr)
dt1 %>%
separate_rows(b, sep = ',\s*', convert = TRUE) %>%
left_join(dt2, by = c('b' = 'code')) %>%
group_by(a) %>%
summarise(desc = toString(desc))
# a desc
# <chr> <chr>
#1 p good, better
#2 q good, better, best
#3 r bad, worst
这不是真正的联接,但因为 dt1$b
包含令人费解的值,所以这是我的丑陋技巧:
dt2[, code := as.character(code)]
dt1[, desc := b]
for (i in seq_along(dt2$code))
dt1[, desc := stringr::str_replace_all(desc, dt2$code[i], dt2$desc[i])]
dt1[]
a b desc 1: p 1,2 good,better 2: q 1,2,3 good,better,best 3: r 4,5 bad,worst
编辑:
替换必须从最长到最短code
(字符串长度或字符数)并且desc
不能包含任何数字.
因此,将 setorder(dt2, -code)
添加到代码和 OP in the comment 提供的新用例:
dt1 <- data.table(a = c("p", "q", "r"), b = c("1,21", "23,11,36", "11,36"))
dt2 <- data.table(code = c(1,11,21,23,36), desc = c("good", "better", "best", "bad", "worst"))
setorder(dt2, -code) # set order first (descending numeric value)
dt2[, code := as.character(code)] # then convert to character
dt1[, desc := b]
for (i in seq_along(dt2$code))
dt1[, desc := stringr::str_replace_all(desc, dt2$code[i], dt2$desc[i])]
dt1[]
a b desc 1: p 1,21 good,best 2: q 23,11,36 bad,better,worst 3: r 11,36 better,worst
编辑 2:
根据 desc
中的数字在生产数据中没有得到满足。 (因为当快速而肮脏的解决方案遇到现实世界的数据时,几乎总是会发生这种情况 :-))。
所以这是一个简明的 data.table
解决方案,它可以完成所有其他答案所做的事情:拆分列 b
,加入或查找匹配的 desc
,然后重新组合:
dt2[, code := as.character(code)][
dt1[, strsplit(b, ","), by = .(a, b)], on = "code==V1"][
, .(desc = paste(desc, collapse = ",")), by = .(a, b)]
使用 OP 的新用例
a b desc 1: p 1,21 good,best 2: q 23,11,36 bad,better,worst 3: r 11,36 better,worst
请注意,分组同时使用 a
和 b
列有两个原因:1) 方便(在最终结果中保留两列),2) 如果 a
是不是唯一标识符