将 NA 替换为之前出现的

Question

这是我的 CSV 数据样本。它包含 ~10 列。

    Product_id    Product_Weight    Product_Name    Shop_Name ...
[1]    A             10                xxxx            Walmart
[2]    B             12                yyyy            Target
[3]    C             11                zzzz            Target
[4]    A             NA                xxxx            Walmart
[5]    C             NA                zzzz            Target

我想分别用 10 和 11 填充第 4 行和第 5 行中的 NA（因为 A 和 C 的产品重量已从第 1 行和第 3 行中获知）。我希望最终的数据框是这样的

    Product_id    Product_Weight    Product_Name    Shop_Name ...
[1]    A             10                xxxx            Walmart
[2]    B             12                yyyy            Target
[3]    C             11                zzzz            Target
[4]    A             10                xxxx            Walmart
[5]    C             11                zzzz            Target

在 R 中执行此操作的最佳方法是什么？

Answer 1

虽然问题问的是"previous occurrence"，但如果任何Product_id中的第一个Product_Weight是NA，那么它就无法填写即使我们从后续的 Product_id 中知道 Product_Weight，所以我们不使用前一次出现的值，而是采用具有相同 Product_id 的所有非 NA 的平均值。由于这些应该都是相同的，所以它们的平均值就是它们的共同价值。

如果您确实想要上一次出现，请使用 Prev 函数，其中：

Prev <- function(x) na.locf(x, na.rm = FALSE)

代替 (1) 和 (3) 中的 na.aggregate，不要使用 (2)。

以下解决方案具有它们所有的优点：

保留输入的顺序
即使任何 Product_id 中的第一个 Product_Weight 是 NA
不修改输入

第一个解决方案的额外优势是只有一行代码（加上 library 语句），第二个解决方案的额外优势是不使用任何包。

1) zoo::na.aggregate 我们在 zoo 包中使用 na.aggregate（用非 NA 的平均值替换所有 NA）和我们将它分别应用于 Product_Weight 每个 Product_id.

library(zoo)
transform(DF, Product_Weight = ave(Product_Weight, Product_id, FUN = na.aggregate))

给予：

  Product_id Product_Weight Product_Name Shop_Name
1          A             10         xxxx   Walmart
2          B             12         yyyy    Target
3          C             11         zzzz    Target
4          A             10         xxxx   Walmart
5          C             11         zzzz    Target

2) 无包 或者使用 Mean 代替 na.aggregate，其中 Mean 定义为：

Mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))

3) dplyr/zoo 添加行号，按 Product_id 分组，使用 na.aggregate如下图或Mean，排列回原来的顺序，去掉行号：

library(dplyr)
library(zoo)

DF %>% 
   mutate(row = row_number()) %>% 
   group_by(Product_id) %>% 
   mutate(Product_Weight = na.aggregate(Product_Weight)) %>% 
   ungroup() %>% 
   arrange(row) %>% 
   select(-row)

注意：这用于输入DF：

Lines <- "    Product_id    Product_Weight    Product_Name    Shop_Name
    A             10                xxxx            Walmart
    B             12                yyyy            Target
    C             11                zzzz            Target
    A             NA                xxxx            Walmart
    C             NA                zzzz            Target"
DF <- read.table(text = Lines, header = TRUE)

Answer 2

dplyr 和 tidyr 的另一个选项：

library(dplyr); library(tidyr);
df %>% group_by(Product_id) %>% fill(Product_Weight)

Source: local data frame [5 x 4]
Groups: Product_id [3]

  Product_id Product_Weight Product_Name Shop_Name
      (fctr)          (int)       (fctr)    (fctr)
1          A             10         xxxx   Walmart
2          A             10         xxxx   Walmart
3          B             12         yyyy    Target
4          C             11         zzzz    Target
5          C             11         zzzz    Target

虽然结果按 Product_id 排序。

Answer 3

这是使用基本 R 命令的解决方案：

#   create lookup table with item and weight combinations
lookup<-unique(df[complete.cases(df[ ,1:2]),])

#     find the NAs needing replacement: which(is.na(df$weight))
#     find index in lookup tabe:match(df$a[which(is.na(df$weight))
#     subset: df$weight[which(is.na(df$weight))
df$weight[which(is.na(df$weight))]<-lookup$weight[match(df$Product_id[which(is.na(df$weight))], lookup$Product_id)]

很可能不如上面提到的 dplyr/tidyr 解决方案有效。

将 NA 替换为之前出现的

Replace NA with previous occurrence

r

na