根据周围的非缺失值有条件地替换缺失值
Conditionally replace missing values depending on surrounding non-missing values
我正在尝试替换向量中的缺失值 (NA
)。 NA
两个相等的数字被那个数字代替。 NA
在两个不同的值之间,应该保持 NA
。例如,给定向量 "a",我希望它是 "b".
a = c(1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3)
b = c(1, 1, 1, 1, 1, NA, NA, NA, 2, 2, 2, 2, 3, 3, 3, 3)
如您所见,NA
的第二个 运行,介于值 1
和 2
之间,未被替换。
有没有向量化计算的方法?
你可以创建这样的函数:
fill_data <- function(vec) {
for(l in unique(vec[!is.na(vec)])) {
g <- which(vec %in% l)
indexes <- list()
for(i in 1:(length(g) - 1)) {
indexes[[i]] <- (g[i]+1):(g[i+1]-1)
}
for(i in 1:(length(g) - 1)) {
if(all(is.na(vec[indexes[[i]]]))) {
vec[indexes[[i]]] <- l
}
}
}
return(vec)
}
运行 函数:
a = c(1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3)
fill_data(a)
[1] 1 1 1 1 1 NA NA NA 2 2 2 2 3 3 3 3
如果你有一个在不同地方有值的向量,它也可以工作:
ab = c(1, NA, NA, NA, 1, NA, NA, NA, 1, NA, 2, NA, NA, NA, 2, NA , 1, NA, 1, 3, NA, NA, 3)
fill_data(ab)
[1] 1 1 1 1 1 1 1 1 1 NA 2 2 2 2 2 NA 1 1 1 3 3 3 3
解释:
首先,您找到唯一的 non-NA 个值。
然后它获取每个唯一 non-NA 值的索引并获取它们之间的值;
然后它测试这些值是否都是 NA,如果是,则将它们替换为级别的值。
您可以使用 zoo
包中的便利函数。在这里,我们替换原始向量中的 NA
,其中插值(由 na.approx
创建)等于 'last observations carried forward'(由 na.locf
创建):
library(zoo)
a_ap <- na.approx(a)
a_locf <- na.locf(a)
a[which(a_ap == a_locf)] <- a_ap[which(a_ap == a_locf)]
a
# [1] 1 1 1 1 1 NA NA NA 2 2 2 2 3 3 3 3
要考虑前导和尾随 NA
,添加 na.rm = FALSE
:
a <- c(NA, 1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3, NA)
a_ap <- na.approx(a, na.rm = FALSE)
a_locf <- na.locf(a, na.rm = FALSE)
a[which(a_ap == a_locf)] <- a_ap[which(a_ap == a_locf)]
a
# [1] NA 1 1 1 1 1 NA NA NA 2 2 2 2 3 3 3 3 NA
OP 要求 vecgorized 解决方案,所以这里有一个可能的向量化基础 R 解决方案(没有 for 循环),它也可以处理 leading/lagging NAs
# Define a vector with Leading/Lagging NAs
a <- c(NA, NA, 1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3, NA, NA)
# Save the boolean vector as we are going to reuse it a lot
na_vals <- is.na(a)
# Find the NAs location compared to the non-NAs
ind <- findInterval(which(na_vals), which(!na_vals))
# Find the consecutive values that equal
ind2 <- which(!diff(a[!na_vals]))
# Fill only NAs between equal consequtive files
a[na_vals] <- a[!na_vals][ind2[match(ind, ind2)]]
a
# [1] NA NA 1 1 1 1 1 NA NA NA 2 2 2 2 3 3 3 3 NA NA
大向量的一些时间比较
# Create a big vector
set.seed(123)
a <- sample(c(NA, 1:5), 5e7, replace = TRUE)
############################################
##### Cainã Max Couto-Silva
fill_data <- function(vec) {
for(l in unique(vec[!is.na(vec)])) {
g <- which(vec %in% l)
indexes <- list()
for(i in 1:(length(g) - 1)) {
indexes[[i]] <- (g[i]+1):(g[i+1]-1)
}
for(i in 1:(length(g) - 1)) {
if(all(is.na(vec[indexes[[i]]]))) {
vec[indexes[[i]]] <- l
}
}
}
return(vec)
}
system.time(res <- fill_data(a))
# user system elapsed
# 81.73 4.41 86.48
############################################
##### Henrik
system.time({
a_ap <- na.approx(a, na.rm = FALSE)
a_locf <- na.locf(a, na.rm = FALSE)
a[which(a_ap == a_locf)] <- a_ap[which(a_ap == a_locf)]
})
# user system elapsed
# 12.55 3.39 15.98
# Validate
identical(res, as.integer(a))
# [1] TRUE
############################################
##### David
## Recreate a as it been overridden
set.seed(123)
a <- sample(c(NA, 1:5), 5e7, replace = TRUE)
system.time({
# Save the boolean vector as we are going to reuse it a lot
na_vals <- is.na(a)
# Find the NAs location compaed to the non-NAs
ind <- findInterval(which(na_vals), which(!na_vals))
# Find the consecutive values that equl
ind2 <- which(!diff(a[!na_vals]))
# Fill only NAs between equal consequtive files
a[na_vals] <- a[!na_vals][ind2[match(ind, ind2)]]
})
# user system elapsed
# 3.39 0.71 4.13
# Validate
identical(res, a)
# [1] TRUE
我正在尝试替换向量中的缺失值 (NA
)。 NA
两个相等的数字被那个数字代替。 NA
在两个不同的值之间,应该保持 NA
。例如,给定向量 "a",我希望它是 "b".
a = c(1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3)
b = c(1, 1, 1, 1, 1, NA, NA, NA, 2, 2, 2, 2, 3, 3, 3, 3)
如您所见,NA
的第二个 运行,介于值 1
和 2
之间,未被替换。
有没有向量化计算的方法?
你可以创建这样的函数:
fill_data <- function(vec) {
for(l in unique(vec[!is.na(vec)])) {
g <- which(vec %in% l)
indexes <- list()
for(i in 1:(length(g) - 1)) {
indexes[[i]] <- (g[i]+1):(g[i+1]-1)
}
for(i in 1:(length(g) - 1)) {
if(all(is.na(vec[indexes[[i]]]))) {
vec[indexes[[i]]] <- l
}
}
}
return(vec)
}
运行 函数:
a = c(1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3)
fill_data(a)
[1] 1 1 1 1 1 NA NA NA 2 2 2 2 3 3 3 3
如果你有一个在不同地方有值的向量,它也可以工作:
ab = c(1, NA, NA, NA, 1, NA, NA, NA, 1, NA, 2, NA, NA, NA, 2, NA , 1, NA, 1, 3, NA, NA, 3)
fill_data(ab)
[1] 1 1 1 1 1 1 1 1 1 NA 2 2 2 2 2 NA 1 1 1 3 3 3 3
解释:
首先,您找到唯一的 non-NA 个值。
然后它获取每个唯一 non-NA 值的索引并获取它们之间的值;
然后它测试这些值是否都是 NA,如果是,则将它们替换为级别的值。
您可以使用 zoo
包中的便利函数。在这里,我们替换原始向量中的 NA
,其中插值(由 na.approx
创建)等于 'last observations carried forward'(由 na.locf
创建):
library(zoo)
a_ap <- na.approx(a)
a_locf <- na.locf(a)
a[which(a_ap == a_locf)] <- a_ap[which(a_ap == a_locf)]
a
# [1] 1 1 1 1 1 NA NA NA 2 2 2 2 3 3 3 3
要考虑前导和尾随 NA
,添加 na.rm = FALSE
:
a <- c(NA, 1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3, NA)
a_ap <- na.approx(a, na.rm = FALSE)
a_locf <- na.locf(a, na.rm = FALSE)
a[which(a_ap == a_locf)] <- a_ap[which(a_ap == a_locf)]
a
# [1] NA 1 1 1 1 1 NA NA NA 2 2 2 2 3 3 3 3 NA
OP 要求 vecgorized 解决方案,所以这里有一个可能的向量化基础 R 解决方案(没有 for 循环),它也可以处理 leading/lagging NAs
# Define a vector with Leading/Lagging NAs
a <- c(NA, NA, 1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3, NA, NA)
# Save the boolean vector as we are going to reuse it a lot
na_vals <- is.na(a)
# Find the NAs location compared to the non-NAs
ind <- findInterval(which(na_vals), which(!na_vals))
# Find the consecutive values that equal
ind2 <- which(!diff(a[!na_vals]))
# Fill only NAs between equal consequtive files
a[na_vals] <- a[!na_vals][ind2[match(ind, ind2)]]
a
# [1] NA NA 1 1 1 1 1 NA NA NA 2 2 2 2 3 3 3 3 NA NA
大向量的一些时间比较
# Create a big vector
set.seed(123)
a <- sample(c(NA, 1:5), 5e7, replace = TRUE)
############################################
##### Cainã Max Couto-Silva
fill_data <- function(vec) {
for(l in unique(vec[!is.na(vec)])) {
g <- which(vec %in% l)
indexes <- list()
for(i in 1:(length(g) - 1)) {
indexes[[i]] <- (g[i]+1):(g[i+1]-1)
}
for(i in 1:(length(g) - 1)) {
if(all(is.na(vec[indexes[[i]]]))) {
vec[indexes[[i]]] <- l
}
}
}
return(vec)
}
system.time(res <- fill_data(a))
# user system elapsed
# 81.73 4.41 86.48
############################################
##### Henrik
system.time({
a_ap <- na.approx(a, na.rm = FALSE)
a_locf <- na.locf(a, na.rm = FALSE)
a[which(a_ap == a_locf)] <- a_ap[which(a_ap == a_locf)]
})
# user system elapsed
# 12.55 3.39 15.98
# Validate
identical(res, as.integer(a))
# [1] TRUE
############################################
##### David
## Recreate a as it been overridden
set.seed(123)
a <- sample(c(NA, 1:5), 5e7, replace = TRUE)
system.time({
# Save the boolean vector as we are going to reuse it a lot
na_vals <- is.na(a)
# Find the NAs location compaed to the non-NAs
ind <- findInterval(which(na_vals), which(!na_vals))
# Find the consecutive values that equl
ind2 <- which(!diff(a[!na_vals]))
# Fill only NAs between equal consequtive files
a[na_vals] <- a[!na_vals][ind2[match(ind, ind2)]]
})
# user system elapsed
# 3.39 0.71 4.13
# Validate
identical(res, a)
# [1] TRUE