用最后一个非 NA 值加上另一个向量中的值以滚动方式填充向量中的 NA 值
Fill NA values in a vector with last non-NA value plus the values in another vector in a rolling manner
我有一个已经订购的数据框,如下所示:
mydf <- data.frame(ID="A1", Level=c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), Taxonomy=c("D__Eukaryota","K__Chloroplastida",NA,"C__Mamiellophyceae",NA,NA,"G__Crustomastix","S__Crustomastix sp. MBIC10709"), Letter=c("D","K","P","C","O","F","G","S"))
ID Level Taxonomy Letter
1 A1 domain D__Eukaryota D
2 A1 kingdom K__Chloroplastida K
3 A1 phylum <NA> P
4 A1 class C__Mamiellophyceae C
5 A1 order <NA> O
6 A1 family <NA> F
7 A1 genus G__Crustomastix G
8 A1 species S__Crustomastix sp. MBIC10709 S
我想要的是用最后一个非 NA 值替换 NA 值,以滚动方式在开头添加所有字母 "missed"...请看下面我的意思。
目标是获取这样的数据框:
ID Level Taxonomy Letter
1 A1 domain D__Eukaryota D
2 A1 kingdom K__Chloroplastida K
3 A1 phylum P__K__Chloroplastida P
4 A1 class C__Mamiellophyceae C
5 A1 order O__C__Mamiellophyceae O
6 A1 family F__O__C__Mamiellophyceae F
7 A1 genus G__Crustomastix G
8 A1 species S__Crustomastix sp. MBIC10709 S
注意最后两个 NA,最后一个必须携带前一个的值。看看第一个如何以 O__C 开头,最后一个以 F__O__C.
开头
到目前为止,我最好的尝试如下(感谢 Ajay Ohri):
library(zoo)
mydf <- data.frame(ID="A1", Level=c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), Taxonomy=c("D__Eukaryota","K__Chloroplastida",NA,"C__Mamiellophyceae",NA,NA,"G__Crustomastix","S__Crustomastix sp. MBIC10709"), Letter=c("D","K","P","C","O","F","G","S"))
mydf <- data.frame(lapply(mydf, as.character), stringsAsFactors=FALSE)
mydf$Letter2 <- ifelse(is.na(mydf$Taxonomy),paste(mydf$Letter,'__',sep=''),"")
mydf
mydf$Taxonomy <- paste(mydf$Letter2, na.locf(mydf$Taxonomy), sep='')
mydf
请注意我仍然无法以滚动方式完成它(对于最后一个 NA,我得到 F__C 而不是 F__O__C )。有什么帮助吗?谢谢!
PS:如果它仍然令人困惑,请告诉我,所以我制作了另一个 MWE,其中连续包含更多 NA,所以我需要的更明显。
第 1 步
我会先创建一个带有 ifelse 的列
data$colnew=ifelse(is.na(data$Taxonomy),"missed","")
如果您不打算粘贴 missed 单词,则可以跳过此步骤
第2步
取最后一个值
来自 Replacing NAs with latest non-NA value(在此处查看其他方法)
使用 zoo 包中的 na.locf() 函数进行最后一次观察以替换您的 NA 值
或
新功能
repeat_last = function(x, forward = TRUE, maxgap = Inf, na.rm = FALSE) {
if (!forward) x = rev(x) # reverse x twice if carrying backward
ind = which(!is.na(x)) # get positions of nonmissing values
if (is.na(x[1]) && !na.rm) # if it begins with NA
ind = c(1,ind) # add first pos
rep_times = diff( # diffing the indices + length yields how often
c(ind, length(x) + 1) ) # they need to be repeated
if (maxgap < Inf) {
exceed = rep_times - 1 > maxgap # exceeding maxgap
if (any(exceed)) { # any exceed?
ind = sort(c(ind[exceed] + 1, ind)) # add NA in gaps
rep_times = diff(c(ind, length(x) + 1) ) # diff again
}
}
x = rep(x[ind], times = rep_times) # repeat the values at these indices
if (!forward) x = rev(x) # second reversion
x
}
在 formr 包中也有功能(仅 Github)。
https://github.com/rubenarslan/formr
第三步
使用 paste
将带有 df$Letter 的两列 (newone) 连接成第三列
一种方法是使用 Reduce
和 accumulate = TRUE
参数。即
ind <- is.na(mydf$Taxonomy)
mydf$Taxonomy <- zoo::na.locf(mydf$Taxonomy)
mydf$Taxonomy[ind] <- paste0(with(mydf, ave(Level, Taxonomy, FUN = function(i)
Reduce(paste, toupper(substr(rev(i), 1, 1)), accumulate = TRUE)))[ind], '_',
sub('.*_', '', mydf$Taxonomy[ind]))
mydf$Taxonomy <- gsub(' ', '_', mydf$Taxonomy)
mydf
# ID Level Taxonomy Letter
#1 A1 domain D__Eukaryota D
#2 A1 kingdom K__Chloroplastida K
#3 A1 phylum P_K_Chloroplastida P
#4 A1 class C__Mamiellophyceae C
#5 A1 order F_O_Mamiellophyceae O
#6 A1 family F_O_C_Mamiellophyceae F
#7 A1 genus G__Crustomastix G
#8 A1 species S__Crustomastix_sp._MBIC10709 S
正如 OP 所提到的,内存消耗至关重要,这里有一个 data.table
方法,它使用 zoo
包中的 na.locf()
函数:
library(data.table) # CRAN version 1.10.4 used
# coerce to data.table, convert factors to characters
DT <- data.table(mydf)[, lapply(.SD, as.character)]
# set marker for NA rows
DT[, na := is.na(Taxonomy)][]
# fill NA by Last Observation Carried Forward
DT[, Taxonomy := zoo::na.locf(Taxonomy)][]
# create list of Letters and unique row count within each group of missing taxonomies
DT[(na), `:=`(tmp = .(Letter), rn = seq_len(.N)), by = .(ID, Taxonomy)][]
# replace incomplete taxonomies
DT[(na), Taxonomy := paste(c(rev(unlist(tmp)[1:rn]), Taxonomy), collapse = "__"),
by = .(ID, Taxonomy, rn)][]
# clean up
DT[, c("na", "tmp", "rn") := NULL][]
ID Level Taxonomy Letter
1: A1 domain D__Eukaryota D
2: A1 kingdom K__Chloroplastida K
3: A1 phylum P__K__Chloroplastida P
4: A1 class C__Mamiellophyceae C
5: A1 order O__C__Mamiellophyceae O
6: A1 family F__O__C__Mamiellophyceae F
7: A1 genus G__Crustomastix G
8: A1 species S__Crustomastix sp. MBIC10709 S
我已经避免链接表达式,因此代码可以逐步执行。
请注意,data.table
正在就地更新,而不是复制整个数据集,这样可以节省内存和时间。
先决条件和附加说明
响应, the OP 起始数据帧是有序的且无冗余并且ID+Level应该是唯一的键数据框.
但是,由于上述解决方案取决于这些假设,因此值得添加一些检查:
# (1) ID + Level are unique keys: find duplicate Levels per ID
stopifnot(anyDuplicated(DT, by = c("ID", "Level")) == 0L)
# (2) rows missing: count rows per ID, there should be 8 Levels
DT[, .N, by = ID][, stopifnot(all(N == 8L))]
# (3) order, wrong Level names, and tests (1) and (2) as well
# create data.table with Level in proper order and a sequence number ln
levels <- data.table(
ln = 1:8,
Level = c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species")
)
# left inner join, i.e., keep only rows with matching Level, keep order of DT
# then check for consecutively ascending level sequence numbers
levels[DT, on = "Level", nomatch = 0][, stopifnot(all(diff(ln) == 1L)), by = ID]
此外,必须确保至少为顶级 Level
“域”指定了 Taxonomy
。这可以通过以下方式进行双重检查:
# count number of rows with missing Taxonomy on top level "domain"
stopifnot(nrow(DT[Level == "domain" & is.na(Taxonomy)] == 0L))
分组逻辑 by = .(ID, Taxonomy)
与 na
上的选择一起使用,即 DT[(na), ...
,以便将其他字母添加到 Taxonomy
之前,其中 Taxonomy
最初是失踪的。在解决方案的开发过程中,我引入了一个额外的帮助列 gn := rleid(ID, Taxonomy)
,它将覆盖 中提到的重复项,最后,我认识到由于先决条件,我可以抓取此列。
由于您提到了内存和性能问题,您已切换到公认的 data.table
解决方案。
我正在添加另一个 data.table
变体,它不依赖于 zoo
等其他包,如果 Taxonomy 列不包含,它可能足够快NA
的序列太长,因为最长的序列决定了 while 循环的重复次数(例如,在示例数据的情况下为两次重复):
library(data.table)
mydf <- data.frame(ID="A1", Level=c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), Taxonomy=c("D__Eukaryota","K__Chloroplastida",NA,"C__Mamiellophyceae",NA,NA,"G__Crustomastix","S__Crustomastix sp. MBIC10709"), Letter=c("D","K","P","C","O","F","G","S"))
setDT(mydf)
# Fill NA value in "Taxonomy" with the value of the prev. row until no NAs occur anymore
prev.number.NAs <- 0 # required to stop the loop if no more NA values can be carried forward
repeat {
number.NAs <- sum(is.na(mydf$Taxonomy))
if( number.NAs == 0 | number.NAs == prev.number.NAs) break;
mydf[, filler := shift(Taxonomy), by = .(ID)] # fill temporary working column with the value of the prev. row of the same group
mydf[!is.na(filler) & is.na(Taxonomy), Taxonomy := paste0(Letter, "__", filler)]
prev.number.NAs <- number.NAs
}
mydf[, filler := NULL] # remove working column
mydf
不幸的是,data.table
的 shift
函数没有提供 "last observation carry forward" 参数,因此我不得不使用 while
循环。
更新 1: 正如@UweBlock 在他下面的评论中提到的,我已经用 repeat
循环替换了 while
循环以避免无限循环第一行 Taxonomy 列中的 NA
值的情况。感谢您找到这个!
更新 2: 继续上次观察现在仅在同一组数据中进行(由 ID 列定义 -正如 OP 在评论中指出的那样)。感谢@UweBlock 指出这个问题!
一种在开头用 NA
填充 NA
值的方法,并且还简化了使用组的逻辑:
forward_fill <- function (x) {
if (length(x) == 0) return (vector(mode(x), 0))
xt <- tail(x, -1)
x0 <- c(x[1], xt[!is.na(xt)])
id0 <- c(TRUE, !is.na(xt))
y <- x0[cumsum(id0)]
return (y)
}
我有一个已经订购的数据框,如下所示:
mydf <- data.frame(ID="A1", Level=c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), Taxonomy=c("D__Eukaryota","K__Chloroplastida",NA,"C__Mamiellophyceae",NA,NA,"G__Crustomastix","S__Crustomastix sp. MBIC10709"), Letter=c("D","K","P","C","O","F","G","S"))
ID Level Taxonomy Letter
1 A1 domain D__Eukaryota D
2 A1 kingdom K__Chloroplastida K
3 A1 phylum <NA> P
4 A1 class C__Mamiellophyceae C
5 A1 order <NA> O
6 A1 family <NA> F
7 A1 genus G__Crustomastix G
8 A1 species S__Crustomastix sp. MBIC10709 S
我想要的是用最后一个非 NA 值替换 NA 值,以滚动方式在开头添加所有字母 "missed"...请看下面我的意思。
目标是获取这样的数据框:
ID Level Taxonomy Letter
1 A1 domain D__Eukaryota D
2 A1 kingdom K__Chloroplastida K
3 A1 phylum P__K__Chloroplastida P
4 A1 class C__Mamiellophyceae C
5 A1 order O__C__Mamiellophyceae O
6 A1 family F__O__C__Mamiellophyceae F
7 A1 genus G__Crustomastix G
8 A1 species S__Crustomastix sp. MBIC10709 S
注意最后两个 NA,最后一个必须携带前一个的值。看看第一个如何以 O__C 开头,最后一个以 F__O__C.
开头到目前为止,我最好的尝试如下(感谢 Ajay Ohri):
library(zoo)
mydf <- data.frame(ID="A1", Level=c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), Taxonomy=c("D__Eukaryota","K__Chloroplastida",NA,"C__Mamiellophyceae",NA,NA,"G__Crustomastix","S__Crustomastix sp. MBIC10709"), Letter=c("D","K","P","C","O","F","G","S"))
mydf <- data.frame(lapply(mydf, as.character), stringsAsFactors=FALSE)
mydf$Letter2 <- ifelse(is.na(mydf$Taxonomy),paste(mydf$Letter,'__',sep=''),"")
mydf
mydf$Taxonomy <- paste(mydf$Letter2, na.locf(mydf$Taxonomy), sep='')
mydf
请注意我仍然无法以滚动方式完成它(对于最后一个 NA,我得到 F__C 而不是 F__O__C )。有什么帮助吗?谢谢!
PS:如果它仍然令人困惑,请告诉我,所以我制作了另一个 MWE,其中连续包含更多 NA,所以我需要的更明显。
第 1 步
我会先创建一个带有 ifelse 的列
data$colnew=ifelse(is.na(data$Taxonomy),"missed","")
如果您不打算粘贴 missed 单词,则可以跳过此步骤
第2步 取最后一个值
来自 Replacing NAs with latest non-NA value(在此处查看其他方法)
使用 zoo 包中的 na.locf() 函数进行最后一次观察以替换您的 NA 值
或 新功能
repeat_last = function(x, forward = TRUE, maxgap = Inf, na.rm = FALSE) {
if (!forward) x = rev(x) # reverse x twice if carrying backward
ind = which(!is.na(x)) # get positions of nonmissing values
if (is.na(x[1]) && !na.rm) # if it begins with NA
ind = c(1,ind) # add first pos
rep_times = diff( # diffing the indices + length yields how often
c(ind, length(x) + 1) ) # they need to be repeated
if (maxgap < Inf) {
exceed = rep_times - 1 > maxgap # exceeding maxgap
if (any(exceed)) { # any exceed?
ind = sort(c(ind[exceed] + 1, ind)) # add NA in gaps
rep_times = diff(c(ind, length(x) + 1) ) # diff again
}
}
x = rep(x[ind], times = rep_times) # repeat the values at these indices
if (!forward) x = rev(x) # second reversion
x
}
在 formr 包中也有功能(仅 Github)。 https://github.com/rubenarslan/formr
第三步
使用 paste
将带有 df$Letter 的两列 (newone) 连接成第三列一种方法是使用 Reduce
和 accumulate = TRUE
参数。即
ind <- is.na(mydf$Taxonomy)
mydf$Taxonomy <- zoo::na.locf(mydf$Taxonomy)
mydf$Taxonomy[ind] <- paste0(with(mydf, ave(Level, Taxonomy, FUN = function(i)
Reduce(paste, toupper(substr(rev(i), 1, 1)), accumulate = TRUE)))[ind], '_',
sub('.*_', '', mydf$Taxonomy[ind]))
mydf$Taxonomy <- gsub(' ', '_', mydf$Taxonomy)
mydf
# ID Level Taxonomy Letter
#1 A1 domain D__Eukaryota D
#2 A1 kingdom K__Chloroplastida K
#3 A1 phylum P_K_Chloroplastida P
#4 A1 class C__Mamiellophyceae C
#5 A1 order F_O_Mamiellophyceae O
#6 A1 family F_O_C_Mamiellophyceae F
#7 A1 genus G__Crustomastix G
#8 A1 species S__Crustomastix_sp._MBIC10709 S
正如 OP 所提到的,内存消耗至关重要,这里有一个 data.table
方法,它使用 zoo
包中的 na.locf()
函数:
library(data.table) # CRAN version 1.10.4 used
# coerce to data.table, convert factors to characters
DT <- data.table(mydf)[, lapply(.SD, as.character)]
# set marker for NA rows
DT[, na := is.na(Taxonomy)][]
# fill NA by Last Observation Carried Forward
DT[, Taxonomy := zoo::na.locf(Taxonomy)][]
# create list of Letters and unique row count within each group of missing taxonomies
DT[(na), `:=`(tmp = .(Letter), rn = seq_len(.N)), by = .(ID, Taxonomy)][]
# replace incomplete taxonomies
DT[(na), Taxonomy := paste(c(rev(unlist(tmp)[1:rn]), Taxonomy), collapse = "__"),
by = .(ID, Taxonomy, rn)][]
# clean up
DT[, c("na", "tmp", "rn") := NULL][]
ID Level Taxonomy Letter 1: A1 domain D__Eukaryota D 2: A1 kingdom K__Chloroplastida K 3: A1 phylum P__K__Chloroplastida P 4: A1 class C__Mamiellophyceae C 5: A1 order O__C__Mamiellophyceae O 6: A1 family F__O__C__Mamiellophyceae F 7: A1 genus G__Crustomastix G 8: A1 species S__Crustomastix sp. MBIC10709 S
我已经避免链接表达式,因此代码可以逐步执行。
请注意,data.table
正在就地更新,而不是复制整个数据集,这样可以节省内存和时间。
先决条件和附加说明
响应
但是,由于上述解决方案取决于这些假设,因此值得添加一些检查:
# (1) ID + Level are unique keys: find duplicate Levels per ID
stopifnot(anyDuplicated(DT, by = c("ID", "Level")) == 0L)
# (2) rows missing: count rows per ID, there should be 8 Levels
DT[, .N, by = ID][, stopifnot(all(N == 8L))]
# (3) order, wrong Level names, and tests (1) and (2) as well
# create data.table with Level in proper order and a sequence number ln
levels <- data.table(
ln = 1:8,
Level = c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species")
)
# left inner join, i.e., keep only rows with matching Level, keep order of DT
# then check for consecutively ascending level sequence numbers
levels[DT, on = "Level", nomatch = 0][, stopifnot(all(diff(ln) == 1L)), by = ID]
此外,必须确保至少为顶级 Level
“域”指定了 Taxonomy
。这可以通过以下方式进行双重检查:
# count number of rows with missing Taxonomy on top level "domain"
stopifnot(nrow(DT[Level == "domain" & is.na(Taxonomy)] == 0L))
分组逻辑 by = .(ID, Taxonomy)
与 na
上的选择一起使用,即 DT[(na), ...
,以便将其他字母添加到 Taxonomy
之前,其中 Taxonomy
最初是失踪的。在解决方案的开发过程中,我引入了一个额外的帮助列 gn := rleid(ID, Taxonomy)
,它将覆盖
由于您提到了内存和性能问题,您已切换到公认的 data.table
解决方案。
我正在添加另一个 data.table
变体,它不依赖于 zoo
等其他包,如果 Taxonomy 列不包含,它可能足够快NA
的序列太长,因为最长的序列决定了 while 循环的重复次数(例如,在示例数据的情况下为两次重复):
library(data.table)
mydf <- data.frame(ID="A1", Level=c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), Taxonomy=c("D__Eukaryota","K__Chloroplastida",NA,"C__Mamiellophyceae",NA,NA,"G__Crustomastix","S__Crustomastix sp. MBIC10709"), Letter=c("D","K","P","C","O","F","G","S"))
setDT(mydf)
# Fill NA value in "Taxonomy" with the value of the prev. row until no NAs occur anymore
prev.number.NAs <- 0 # required to stop the loop if no more NA values can be carried forward
repeat {
number.NAs <- sum(is.na(mydf$Taxonomy))
if( number.NAs == 0 | number.NAs == prev.number.NAs) break;
mydf[, filler := shift(Taxonomy), by = .(ID)] # fill temporary working column with the value of the prev. row of the same group
mydf[!is.na(filler) & is.na(Taxonomy), Taxonomy := paste0(Letter, "__", filler)]
prev.number.NAs <- number.NAs
}
mydf[, filler := NULL] # remove working column
mydf
不幸的是,data.table
的 shift
函数没有提供 "last observation carry forward" 参数,因此我不得不使用 while
循环。
更新 1: 正如@UweBlock 在他下面的评论中提到的,我已经用 repeat
循环替换了 while
循环以避免无限循环第一行 Taxonomy 列中的 NA
值的情况。感谢您找到这个!
更新 2: 继续上次观察现在仅在同一组数据中进行(由 ID 列定义 -正如 OP 在评论中指出的那样)。感谢@UweBlock 指出这个问题!
一种在开头用 NA
填充 NA
值的方法,并且还简化了使用组的逻辑:
forward_fill <- function (x) {
if (length(x) == 0) return (vector(mode(x), 0))
xt <- tail(x, -1)
x0 <- c(x[1], xt[!is.na(xt)])
id0 <- c(TRUE, !is.na(xt))
y <- x0[cumsum(id0)]
return (y)
}