在 data.table 中按组绑定列表的有效方法

Question

我有一个data.frame

数据

data = structure(list(mystring = c("AASDAASADDLKJLKADDLKKLLKJLJADDLJLKJLADLKLADD", 
    "ASDSDFJSKADDKJSJKDFKSADDLKJFLAK"), class = c("cat", "dog")), .Names = c("mystring", 
    "class"), row.names = c(NA, -2L), class = "data.frame")

看起来像

#> dtt1
#                                      mystring class
#1 AASDAASADDLKJLKADDLKKLLKJLJADDLJLKJLADLKLADD   cat
#2              ASDSDFJSKADDKJSJKDFKSADDLKJFLAK   dog

我正在搜索模式 "ADD" 的开始和结束位置，其中 mystring 下的字符串中的前 20 个字符将 class 视为组。

我正在使用 stringr 包的 str_locate 来执行此操作。这是我的尝试

setDT(dtt1)[, 
cbind(list(str_locate_all(substr(as.character(mystring), 1, 20),"ADD")[[1]][,1]),
      list(str_locate_all(substr(as.character(mystring), 1, 20),"ADD")[[1]][,2])), 
      by = class]

这给出了我的期望的输出

#   class V1 V2
#1:   cat  8 10
#2:   cat 16 18
#3:   dog 10 12

问题: 我想知道这是标准方法还是可以更有效的方式完成。 str_locate 在单独的列中给出匹配模式的 start 和 end 位置，我将它们放在单独的列表中 cbind 它们与 data.table ？另外，如何在此处为 cbinded columns 指定 colnames？

Answer 1

我认为你首先应该减少每个组的操作，所以我会先为所有组创建一个子字符串。

setDT(data)[, submystring := .Internal(substr(mystring, 1L, 20L))]

然后，使用 stringi 包（我不喜欢包装器），你可以做到（虽然目前不能保证效率）

library(stringi)
data[, data.table(matrix(unlist(stri_locate_all_fixed(submystring, "ADD")), ncol = 2)), by = class]
#    class V1 V2
# 1:   cat  8 10
# 2:   cat 16 18
# 3:   dog 10 12

或者，您可以避免每组调用 matrix 和 data.table，但在检测到所有位置后传播数据

res <- data[, unlist(stri_locate_all_fixed(submystring, "ADD")), by = class]
res[, `:=`(varnames = rep(c("V1", "V2"), each = .N/2), MatchCount = rep(1:(.N/2), .N/2)), by = class]
dcast(res, class + MatchCount ~ varnames, value.var = "V1")
#    class MatchCount V1 V2
# 1:   cat          1  8 10
# 2:   cat          2 16 18
# 3:   dog          1 10 12

第三个类似的选择可能是先对整个数据集运行 stri_locate_all_fixed 进行尝试，然后才对每个组 unlist 进行尝试（而不是运行每组 unlist 和 stri_locate_all_fixed）

res <- data[, .(stri_locate_all_fixed(submystring, "ADD"), class = class)]
res[, N := lengths(V1)/2L]
res2 <- res[, unlist(V1), by = "class,N"]
res2[, `:=`(varnames = rep(c("V1", "V2"), each = N[1L]), MatchCount = rep(1:(N[1L]), N[1L])), by = class]
dcast(res2, class + MatchCount ~ varnames, value.var = "V1")
#    class MatchCount V1 V2
# 1:   cat          1  8 10
# 2:   cat          2 16 18
# 3:   dog          1 10 12

Answer 2

我们可以将 matrix 输出从 str_locate_all 更改为 data.frame 并使用 rbindlist 创建列。

  setDT(data)[,rbindlist(lapply(str_locate_all(substr(mystring, 1, 20),
               'ADD'), as.data.frame)) , class]
  #   class start end
  #1:   cat     8  10
  #2:   cat    16  18
  #3:   dog    10  12

Answer 3

这是我的做法。

library(stringi)
library(dplyr)
library(magrittr)

data = structure(list(mystring = c("AASDAASADDLKJLKADDLKKLLKJLJADDLJLKJLADLKLADD", 
                                   "ASDSDFJSKADDKJSJKDFKSADDLKJFLAK"), class = c("cat", "dog")), .Names = c("mystring", 
                                                                                                            "class"), row.names = c(NA, -2L), class = "data.frame")

my_function = function(row)
  row$mystring %>% 
  stri_sub(to = 20) %>%
  stri_locate_all_fixed(pattern = "ADD") %>%
  extract2(1) %>%
  as_data_frame

test = 
  data %>%
  group_by(mystring) %>%
  do(my_function(.)) %>%
  left_join(data)

在 data.table 中按组绑定列表的有效方法

Efficient way to cbind list by groups in data.table

r

data.table