摆脱在使用正则表达式选择列时找不到的模式 data.table

Question

我有一个包含多个 data.table 的列表，它们可能都有不同的列。里面的一些表包含不需要的列，我想删除它们。假设它们被称为 "zRemoveThis1", "zRemoveThis2", "zRemoveThis3" 等。这是我的数据示例。

library(data.table)

dt_list <- list(
  item_1 = data.table(ID = paste(1:3, "item_1", sep = "_"),
                      Count_A = c(11:13),
                      Count_B = c(14:16),
                      zRemoveThis1 = c(14:16),
                      count_C = c(17:19),
                      zRemoveThis2 = c(24:26)),
  item_2 = data.table(ID = paste(1:3, "item_2", sep = "_"),
                      Count_A = c(1:3),
                      Count_B = c(4:6),
                      count_C = c(7:9))
)

我已经关注了这个，但是后来我遇到了一个新问题。当我将 patterns() 和 lapply 应用到我的列表时，它不起作用。

lapply(dt_list, function(x) { x[, .SD, .SDcols = ! patterns("zRemoveThis*")] })
#> Error in do_patterns(colsub, names_x): Pattern not found: [zRemoveThis*]

但是当我单独应用该函数时，它对列表的第一项有效，但对第二项无效。

#WORK
dt_list$item_1[, .SD, .SDcols = ! patterns("zRemoveThis*")]
#>          ID Count_A Count_B count_C
#> 1: 1_item_1      11      14      17
#> 2: 2_item_1      12      15      18
#> 3: 3_item_1      13      16      19

#DIDN'T WORK
dt_list$item_2[, .SD, .SDcols = ! patterns("zRemoveThis*")]
#> Error in do_patterns(colsub, names_x): Pattern not found: [zRemoveThis*]

我发现问题是如果没有匹配模式，该功能将无法运行。所以我对这个丑陋的 if-else 解决方案有了一个想法，但它奏效了。

lapply(dt_list, function(x) {
  if (any(grepl("zRemoveThis", colnames(x)))) {
    return(x[, .SD, .SDcols = ! patterns("zRemoveThis*")])
  } else return(x)
})
#> $item_1
#>          ID Count_A Count_B count_C
#> 1: 1_item_1      11      14      17
#> 2: 2_item_1      12      15      18
#> 3: 3_item_1      13      16      19
#> 
#> $item_2
#>          ID Count_A Count_B count_C
#> 1: 1_item_2       1       4       7
#> 2: 2_item_2       2       5       8
#> 3: 3_item_2       3       6       9

我的问题是，是否有针对我的问题的复杂 data.table 解决方案？任何帮助将不胜感激。提前致谢！

Answer 1

您可以使用 grepl -

lapply(dt_list, function(x) {
  cols <- !grepl('zRemoveThis', names(x))
  x[, ..cols]
})

#$item_1
#         ID Count_A Count_B count_C
#1: 1_item_1      11      14      17
#2: 2_item_1      12      15      18
#3: 3_item_1      13      16      19

#$item_2
#         ID Count_A Count_B count_C
#1: 1_item_2       1       4       7
#2: 2_item_2       2       5       8
#3: 3_item_2       3       6       9

Answer 2

我们可以通过在 .SDcols 和 return 和 .SD（data.table 的子集）中指定列名模式来直接子集，因为它们是 data.table使用 lapply

遍历 list 后的对象

library(data.table)
lapply(dt_list, function(x) x[, .SD,
       .SDcols = -startsWith(names(x), 'zRemoveThis')])
$item_1
         ID Count_A Count_B count_C
1: 1_item_1      11      14      17
2: 2_item_1      12      15      18
3: 3_item_1      13      16      19

$item_2
         ID Count_A Count_B count_C
1: 1_item_2       1       4       7
2: 2_item_2       2       5       8
3: 3_item_2       3       6       9

或使用 tidyverse，用 map 遍历 list，select 不具有 - 模式的列 starts_with 'zRemoveThis'

library(dplyr)
library(purrr)
map(dt_list, ~ .x %>% 
          select(-starts_with('zRemoveThis')))
$item_1
         ID Count_A Count_B count_C
1: 1_item_1      11      14      17
2: 2_item_1      12      15      18
3: 3_item_1      13      16      19

$item_2
         ID Count_A Count_B count_C
1: 1_item_2       1       4       7
2: 2_item_2       2       5       8
3: 3_item_2       3       6       9

摆脱在使用正则表达式选择列时找不到的模式 data.table

Get rid of pattern not found in selecting column with regex data.table

r

list

data.table