用最后一个非 NA 值加上另一个向量中的值以滚动方式填充向量中的 NA 值

Fill NA values in a vector with last non-NA value plus the values in another vector in a rolling manner

我有一个已经订购的数据框,如下所示:

mydf <- data.frame(ID="A1", Level=c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), Taxonomy=c("D__Eukaryota","K__Chloroplastida",NA,"C__Mamiellophyceae",NA,NA,"G__Crustomastix","S__Crustomastix sp. MBIC10709"), Letter=c("D","K","P","C","O","F","G","S"))

  ID   Level                      Taxonomy Letter
1 A1  domain                  D__Eukaryota      D
2 A1 kingdom             K__Chloroplastida      K
3 A1  phylum                          <NA>      P
4 A1   class            C__Mamiellophyceae      C
5 A1   order                          <NA>      O
6 A1  family                          <NA>      F
7 A1   genus               G__Crustomastix      G
8 A1 species S__Crustomastix sp. MBIC10709      S

我想要的是用最后一个非 NA 值替换 NA 值,以滚动方式在开头添加所有字母 "missed"...请看下面我的意思。

目标是获取这样的数据框:

  ID   Level                      Taxonomy Letter
1 A1  domain                  D__Eukaryota      D
2 A1 kingdom             K__Chloroplastida      K
3 A1  phylum          P__K__Chloroplastida      P
4 A1   class            C__Mamiellophyceae      C
5 A1   order         O__C__Mamiellophyceae      O
6 A1  family      F__O__C__Mamiellophyceae      F
7 A1   genus               G__Crustomastix      G
8 A1 species S__Crustomastix sp. MBIC10709      S

注意最后两个 NA,最后一个必须携带前一个的值。看看第一个如何以 O__C 开头,最后一个以 F__O__C.

开头

到目前为止,我最好的尝试如下(感谢 Ajay Ohri):

library(zoo)
mydf <- data.frame(ID="A1", Level=c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), Taxonomy=c("D__Eukaryota","K__Chloroplastida",NA,"C__Mamiellophyceae",NA,NA,"G__Crustomastix","S__Crustomastix sp. MBIC10709"), Letter=c("D","K","P","C","O","F","G","S"))
mydf <- data.frame(lapply(mydf, as.character), stringsAsFactors=FALSE)
mydf$Letter2 <- ifelse(is.na(mydf$Taxonomy),paste(mydf$Letter,'__',sep=''),"")
mydf
mydf$Taxonomy <- paste(mydf$Letter2, na.locf(mydf$Taxonomy), sep='')
mydf

请注意我仍然无法以滚动方式完成它(对于最后一个 NA,我得到 F__C 而不是 F__O__C )。有什么帮助吗?谢谢!

PS:如果它仍然令人困惑,请告诉我,所以我制作了另一个 MWE,其中连续包含更多 NA,所以我需要的更明显。

第 1 步

我会先创建一个带有 ifelse 的列

data$colnew=ifelse(is.na(data$Taxonomy),"missed","")

如果您不打算粘贴 missed 单词,则可以跳过此步骤

第2步 取最后一个值

来自 Replacing NAs with latest non-NA value(在此处查看其他方法)

使用 zoo 包中的 na.locf() 函数进行最后一次观察以替换您的 NA 值

新功能

repeat_last = function(x, forward = TRUE, maxgap = Inf, na.rm = FALSE) {
    if (!forward) x = rev(x)           # reverse x twice if carrying backward
    ind = which(!is.na(x))             # get positions of nonmissing values
    if (is.na(x[1]) && !na.rm)         # if it begins with NA
        ind = c(1,ind)                 # add first pos
    rep_times = diff(                  # diffing the indices + length yields how often
        c(ind, length(x) + 1) )          # they need to be repeated
    if (maxgap < Inf) {
        exceed = rep_times - 1 > maxgap  # exceeding maxgap
        if (any(exceed)) {               # any exceed?
            ind = sort(c(ind[exceed] + 1, ind))      # add NA in gaps
            rep_times = diff(c(ind, length(x) + 1) ) # diff again
        }
    }
    x = rep(x[ind], times = rep_times) # repeat the values at these indices
    if (!forward) x = rev(x)           # second reversion
    x
}

在 formr 包中也有功能(仅 Github)。 https://github.com/rubenarslan/formr

第三步

使用 paste

将带有 df$Letter 的两列 (newone) 连接成第三列

一种方法是使用 Reduceaccumulate = TRUE 参数。即

ind <- is.na(mydf$Taxonomy)
mydf$Taxonomy <- zoo::na.locf(mydf$Taxonomy)
mydf$Taxonomy[ind] <- paste0(with(mydf, ave(Level, Taxonomy, FUN = function(i) 
     Reduce(paste, toupper(substr(rev(i), 1, 1)), accumulate = TRUE)))[ind], '_', 
                                                      sub('.*_', '', mydf$Taxonomy[ind]))

mydf$Taxonomy <- gsub(' ', '_', mydf$Taxonomy)


mydf
#  ID   Level                      Taxonomy Letter
#1 A1  domain                  D__Eukaryota      D
#2 A1 kingdom             K__Chloroplastida      K
#3 A1  phylum            P_K_Chloroplastida      P
#4 A1   class            C__Mamiellophyceae      C
#5 A1   order           F_O_Mamiellophyceae      O
#6 A1  family         F_O_C_Mamiellophyceae      F
#7 A1   genus               G__Crustomastix      G
#8 A1 species S__Crustomastix_sp._MBIC10709      S

正如 OP 所提到的,内存消耗至关重要,这里有一个 data.table 方法,它使用 zoo 包中的 na.locf() 函数:

library(data.table)   # CRAN version 1.10.4 used
# coerce to data.table, convert factors to characters
DT <- data.table(mydf)[, lapply(.SD, as.character)]
# set marker for NA rows 
DT[, na := is.na(Taxonomy)][]
# fill NA by Last Observation Carried Forward
DT[, Taxonomy := zoo::na.locf(Taxonomy)][]
# create list of Letters and unique row count within each group of missing taxonomies
DT[(na), `:=`(tmp = .(Letter), rn = seq_len(.N)), by = .(ID, Taxonomy)][]
# replace incomplete taxonomies
DT[(na), Taxonomy := paste(c(rev(unlist(tmp)[1:rn]), Taxonomy), collapse = "__"), 
   by = .(ID, Taxonomy, rn)][]
# clean up
DT[, c("na", "tmp", "rn") := NULL][]
   ID   Level                      Taxonomy Letter
1: A1  domain                  D__Eukaryota      D
2: A1 kingdom             K__Chloroplastida      K
3: A1  phylum          P__K__Chloroplastida      P
4: A1   class            C__Mamiellophyceae      C
5: A1   order         O__C__Mamiellophyceae      O
6: A1  family      F__O__C__Mamiellophyceae      F
7: A1   genus               G__Crustomastix      G
8: A1 species S__Crustomastix sp. MBIC10709      S

我已经避免链接表达式,因此代码可以逐步执行。

请注意,data.table 正在就地更新,而不是复制整个数据集,这样可以节省内存和时间。

先决条件和附加说明

响应, the OP 起始数据帧是有序的且无冗余并且ID+Level应该是唯一的键数据框.

但是,由于上述解决方案取决于这些假设,因此值得添加一些检查:

# (1) ID + Level are unique keys: find duplicate Levels per ID
stopifnot(anyDuplicated(DT, by = c("ID", "Level")) == 0L)
# (2) rows missing: count rows per ID, there should be 8 Levels
DT[, .N, by = ID][, stopifnot(all(N == 8L))]
# (3) order, wrong Level names, and tests (1) and (2) as well
# create data.table with Level in proper order and a sequence number ln
levels <- data.table(
  ln = 1:8,
  Level = c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species")
)
# left inner join, i.e., keep only rows with matching Level, keep order of DT
# then check for consecutively ascending level sequence numbers
levels[DT, on = "Level", nomatch = 0][, stopifnot(all(diff(ln) == 1L)), by = ID]

此外,必须确保至少为顶级 Level“域”指定了 Taxonomy。这可以通过以下方式进行双重检查:

# count number of rows with missing Taxonomy on top level "domain"
stopifnot(nrow(DT[Level == "domain" & is.na(Taxonomy)] == 0L))

分组逻辑 by = .(ID, Taxonomy)na 上的选择一起使用,即 DT[(na), ...,以便将其他字母添加到 Taxonomy 之前,其中 Taxonomy最初是失踪的。在解决方案的开发过程中,我引入了一个额外的帮助列 gn := rleid(ID, Taxonomy),它将覆盖 中提到的重复项,最后,我认识到由于先决条件,我可以抓取此列。

由于您提到了内存和性能问题,您已切换到公认的 data.table 解决方案。

我正在添加另一个 data.table 变体,它不依赖于 zoo 等其他包,如果 Taxonomy 列不包含,它可能足够快NA 的序列太长,因为最长的序列决定了 while 循环的重复次数(例如,在示例数据的情况下为两次重复):

library(data.table)

mydf <- data.frame(ID="A1", Level=c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), Taxonomy=c("D__Eukaryota","K__Chloroplastida",NA,"C__Mamiellophyceae",NA,NA,"G__Crustomastix","S__Crustomastix sp. MBIC10709"), Letter=c("D","K","P","C","O","F","G","S"))

setDT(mydf)

# Fill NA value in "Taxonomy" with the value of the prev. row until no NAs occur anymore
prev.number.NAs <- 0    # required to stop the loop if no more NA values can be carried forward
repeat {
  number.NAs <- sum(is.na(mydf$Taxonomy))
  if( number.NAs == 0 | number.NAs == prev.number.NAs) break;
  mydf[, filler := shift(Taxonomy), by = .(ID)]     # fill temporary working column with the value of the prev. row of the same group
  mydf[!is.na(filler) & is.na(Taxonomy), Taxonomy := paste0(Letter, "__", filler)]
  prev.number.NAs <- number.NAs
}


mydf[, filler := NULL]   # remove working column
mydf

不幸的是,data.tableshift 函数没有提供 "last observation carry forward" 参数,因此我不得不使用 while 循环。

更新 1: 正如@UweBlock 在他下面的评论中提到的,我已经用 repeat 循环替换了 while 循环以避免无限循环第一行 Taxonomy 列中的 NA 值的情况。感谢您找到这个!

更新 2: 继续上次观察现在仅在同一组数据中进行(由 ID 列定义 -正如 OP 在评论中指出的那样)。感谢@UweBlock 指出这个问题!

一种在开头用 NA 填充 NA 值的方法,并且还简化了使用组的逻辑:

forward_fill <- function (x) {
  if (length(x) == 0) return (vector(mode(x), 0))      

  xt  <- tail(x, -1)  
  x0  <- c(x[1], xt[!is.na(xt)])
  id0 <- c(TRUE,    !is.na(xt))
  y   <- x0[cumsum(id0)]
  
  return (y)
}