拆分字符串并仅保留一个特定部分

Split a string and keep just one specific part

在我的 data.table 中,我使用 tstrsplit 拆分 ValueId 列,并使用 keep= 参数。但在这种情况下,我不知道要放入 keep 的值,我想使用 Level 列中的值。

我所有的尝试都失败了。可能吗 ?也许不在 data.table ?

这是一个代表:

library(data.table)

foo <- data.table(Level = c(2,2,3,4,3),
                  ValueId = c("11983:1055521", "11983:1055521-5168:290668-198:100798", "11983:1055521-5168:290668-198:100798-92:91604-139:94569-135:94719-5161:290771-5162:290728-5166:290620",
                             "11983:1055521-5168:290668-198:100798-92:91604-139:94569-135:94719-5161:290771", " 11983:1055521-5168:290676-198:100794-92:91781-139:95090-135:95353"))

foo[, newvar := tstrsplit(ValueId, "-", fixed = TRUE, keep = 4)]

foo[, newvar := tstrsplit(ValueId, "-", fixed = TRUE, keep = Level)]

谢谢!!

您可以使用mapply[来提取由strsplit返回的子字符串在foo$Level中给定的位置。

mapply(`[`, strsplit(foo$ValueId, "-", fixed = TRUE), foo$Level)
#[1] NA            "5168:290668" "198:100798"  "92:91604"    "198:100794" 

有几个问题。其中之一在 tstrsplit 函数本身中,定义为:

function (x, ..., fill = NA, type.convert = FALSE, keep, names = FALSE) 
{
  if (!isTRUEorFALSE(names) && !is.character(names)) 
    stop("'names' must be TRUE/FALSE or a character vector.")
  ans = transpose(strsplit(as.character(x), ...), fill = fill, 
                  ignore.empty = FALSE)
  if (!missing(keep)) {
    keep = suppressWarnings(as.integer(keep))
    chk = min(keep) >= min(1L, length(ans)) & max(keep) <= 
      length(ans)
    if (!isTRUE(chk)) 
      stop("'keep' should contain integer values between ", 
           min(1L, length(ans)), " and ", length(ans), 
           ".")
    ans = ans[keep]
  }
  if (type.convert) 
    ans = lapply(ans, type.convert, as.is = TRUE)
  if (isFALSE(names)) 
    return(ans)
  else if (isTRUE(names)) 
    names = paste0("V", seq_along(ans))
  if (length(names) != length(ans)) {
    str = if (missing(keep)) 
      "ans"
    else "keep"
    stop("length(names) (= ", length(names), ") is not equal to length(", 
         str, ") (= ", length(ans), ").")
  }
  setattr(ans, "names", names)
  ans
}
<bytecode: 0x0000019bffd6da98>
  <environment: namespace:data.table>

需要注意的重要一点是 if 块检查您的 keep 是否适合 return。在您的示例中,第一行 returns NA。这在您的硬编码示例中起作用的原因是 strsplit 被矢量化,因此 NA 行与有效行同时为 运行,因此不会触发此 if 块.您可以通过将 4 更改为 40 来尝试此操作,您将收到此消息 Error in tstrsplit(ValueId, "-", fixed = TRUE, keep = 40) : 'keep' should contain integer values between 1 and 9. 因为在那种情况下没有任何效果。

所以你需要做的是重新定义 tstrsplit 函数,这样它 return NA 而不是停止

tstrsplitNA<-function (x, ..., fill = NA, type.convert = FALSE, keep) 
{
  ans = transpose(strsplit(as.character(x), ...), fill = fill, 
                  ignore.empty = FALSE)
  if (!missing(keep)) {
    keep = suppressWarnings(as.integer(keep))
    chk = min(keep) >= min(1L, length(ans)) & max(keep) <= 
      length(ans)
    if (!isTRUE(chk)) 
      ans<-NA_character_
    ans = ans[keep]
  }
  if (type.convert) 
    ans = lapply(ans, type.convert, as.is = TRUE)
    return(ans)
  ans
}

这还不够,因为 strsplit 是矢量化的,所以 foo[, newvar := tstrsplitNA(ValueId, split="-", fixed = TRUE, keep = Level)] 不仅仅是 运行 一行一行地执行该函数,而是提供整个 ValueId 列到 strsplit,然后将其转置 return 相对于你想要的内容的乱码。

您可以告诉 data.table 只需将 by 参数与 LevelValueId

一起使用即可逐行执行操作
foo[, newvar := tstrsplitNA(ValueId, split="-", fixed = TRUE, keep = Level), by=c('Level','ValueId')]

foo
  Level                                                                                               ValueId      newvar
1:     2                                                                                         11983:1055521          NA
2:     2                                                                  11983:1055521-5168:290668-198:100798 5168:290668
3:     3 11983:1055521-5168:290668-198:100798-92:91604-139:94569-135:94719-5161:290771-5162:290728-5166:290620  198:100798
4:     4                         11983:1055521-5168:290668-198:100798-92:91604-139:94569-135:94719-5161:290771    92:91604
5:     3                                     11983:1055521-5168:290676-198:100794-92:91781-139:95090-135:95353  198:100794