如果 R 中有许多 NA,则合并行数据

merge rows data if many NAs in R

我有以下数据table:

library(data.table)
# Example table
table <- data.table(ID = c("Entity_A","Entity_A","Entity_B","Entity_B"),
                  Level = c("Individual_1","Individual_2","Individual_1","Individual_2"),
                  Amount1 = c("100","100","120","n.a."),
                  Amount2 = c("n.a.","40","30","30"),
                  Amount3 =c("20","n.a.","40","n.a."),
                  Amount4 =c("10","n.a.","n.a.","n.a.")
                  )
# Transform "n.a." in real NA
table <- table %>% mutate(across(where(is.character), ~na_if(., "n.a.")))
# Count which rows have more NAs
table$na_count <- apply(table, 1, function(x) sum(is.na(x)))
# Show example table
table
         ID        Level Amount1 Amount2 Amount3 Amount4 na_count
1: Entity_A Individual_1     100    <NA>      20      10        1
2: Entity_A Individual_2     100      40    <NA>    <NA>        2
3: Entity_B Individual_1     120      30      40    <NA>        1
4: Entity_B Individual_2    <NA>      30    <NA>    <NA>        3

对于每个实体(“ID”列中的实体 A、实体 B 等...),我想获取 NA 数量最多的行中的可用值(来自“[=22”列=]") 并将此信息与具有最少 NA 数量的相应行合并(如果确实有要合并的信息)。 生成的数据框将是:

         ID        Level Amount1 Amount2 Amount3 Amount4
1: Entity_A Individual_1     100      40      20      10
2: Entity_B Individual_1     120      30      40    <NA>

例如,对于 实体 A,Amount2(以前为 NA)在第一行中不可用(Individual_1,实体的 NA 数量最少A),但它实际上在第二行可用(Individual_2,实体 A 的 NA 数量最多)。所以代码应该用第二行中可用的内容填充第一行。而对于 实体 B,由于在第 4 行中没有可以合并的附加信息,因此最后一行将继续与第 3 行类似。 有人可以帮忙吗?

arrange 数据由 na_countfill 每个 ID 和 select 的 NA 值组成,每个组中的第一行.

library(dplyr)
library(tidyr)

table %>%
  arrange(ID, na_count) %>%
  group_by(ID) %>%
  fill(starts_with('Amount'), .direction = 'updown') %>%
  slice(1L) %>%
  ungroup %>% 
  dplyr::select(-na_count)

#  ID       Level        Amount1 Amount2 Amount3 Amount4
#  <chr>    <chr>        <chr>   <chr>   <chr>   <chr>  
#1 Entity_A Individual_1 100     40      20      10     
#2 Entity_B Individual_1 120     30      40      NA     

由于输入是data.table,我们可以使用data.table方法

library(data.table)
table[order(na_count),lapply(.SD, function(x) 
   nafill(nafill(as.numeric(x), type = 'locf'), type = 'nocb')[1]),
      ID, .SDcols = startsWith(names(table), 'Amount')]
         ID Amount1 Amount2 Amount3 Amount4
1: Entity_A     100      40      20      10
2: Entity_B     120      30      40      NA