如果 R 中有许多 NA,则合并行数据
merge rows data if many NAs in R
我有以下数据table:
library(data.table)
# Example table
table <- data.table(ID = c("Entity_A","Entity_A","Entity_B","Entity_B"),
Level = c("Individual_1","Individual_2","Individual_1","Individual_2"),
Amount1 = c("100","100","120","n.a."),
Amount2 = c("n.a.","40","30","30"),
Amount3 =c("20","n.a.","40","n.a."),
Amount4 =c("10","n.a.","n.a.","n.a.")
)
# Transform "n.a." in real NA
table <- table %>% mutate(across(where(is.character), ~na_if(., "n.a.")))
# Count which rows have more NAs
table$na_count <- apply(table, 1, function(x) sum(is.na(x)))
# Show example table
table
ID Level Amount1 Amount2 Amount3 Amount4 na_count
1: Entity_A Individual_1 100 <NA> 20 10 1
2: Entity_A Individual_2 100 40 <NA> <NA> 2
3: Entity_B Individual_1 120 30 40 <NA> 1
4: Entity_B Individual_2 <NA> 30 <NA> <NA> 3
对于每个实体(“ID”列中的实体 A、实体 B 等...),我想获取 NA 数量最多的行中的可用值(来自“[=22”列=]") 并将此信息与具有最少 NA 数量的相应行合并(如果确实有要合并的信息)。
生成的数据框将是:
ID Level Amount1 Amount2 Amount3 Amount4
1: Entity_A Individual_1 100 40 20 10
2: Entity_B Individual_1 120 30 40 <NA>
例如,对于 实体 A,Amount2(以前为 NA)在第一行中不可用(Individual_1,实体的 NA 数量最少A),但它实际上在第二行可用(Individual_2,实体 A 的 NA 数量最多)。所以代码应该用第二行中可用的内容填充第一行。而对于 实体 B,由于在第 4 行中没有可以合并的附加信息,因此最后一行将继续与第 3 行类似。
有人可以帮忙吗?
arrange
数据由 na_count
和 fill
每个 ID
和 select 的 NA
值组成,每个组中的第一行.
library(dplyr)
library(tidyr)
table %>%
arrange(ID, na_count) %>%
group_by(ID) %>%
fill(starts_with('Amount'), .direction = 'updown') %>%
slice(1L) %>%
ungroup %>%
dplyr::select(-na_count)
# ID Level Amount1 Amount2 Amount3 Amount4
# <chr> <chr> <chr> <chr> <chr> <chr>
#1 Entity_A Individual_1 100 40 20 10
#2 Entity_B Individual_1 120 30 40 NA
由于输入是data.table,我们可以使用data.table
方法
library(data.table)
table[order(na_count),lapply(.SD, function(x)
nafill(nafill(as.numeric(x), type = 'locf'), type = 'nocb')[1]),
ID, .SDcols = startsWith(names(table), 'Amount')]
ID Amount1 Amount2 Amount3 Amount4
1: Entity_A 100 40 20 10
2: Entity_B 120 30 40 NA
我有以下数据table:
library(data.table)
# Example table
table <- data.table(ID = c("Entity_A","Entity_A","Entity_B","Entity_B"),
Level = c("Individual_1","Individual_2","Individual_1","Individual_2"),
Amount1 = c("100","100","120","n.a."),
Amount2 = c("n.a.","40","30","30"),
Amount3 =c("20","n.a.","40","n.a."),
Amount4 =c("10","n.a.","n.a.","n.a.")
)
# Transform "n.a." in real NA
table <- table %>% mutate(across(where(is.character), ~na_if(., "n.a.")))
# Count which rows have more NAs
table$na_count <- apply(table, 1, function(x) sum(is.na(x)))
# Show example table
table
ID Level Amount1 Amount2 Amount3 Amount4 na_count
1: Entity_A Individual_1 100 <NA> 20 10 1
2: Entity_A Individual_2 100 40 <NA> <NA> 2
3: Entity_B Individual_1 120 30 40 <NA> 1
4: Entity_B Individual_2 <NA> 30 <NA> <NA> 3
对于每个实体(“ID”列中的实体 A、实体 B 等...),我想获取 NA 数量最多的行中的可用值(来自“[=22”列=]") 并将此信息与具有最少 NA 数量的相应行合并(如果确实有要合并的信息)。 生成的数据框将是:
ID Level Amount1 Amount2 Amount3 Amount4
1: Entity_A Individual_1 100 40 20 10
2: Entity_B Individual_1 120 30 40 <NA>
例如,对于 实体 A,Amount2(以前为 NA)在第一行中不可用(Individual_1,实体的 NA 数量最少A),但它实际上在第二行可用(Individual_2,实体 A 的 NA 数量最多)。所以代码应该用第二行中可用的内容填充第一行。而对于 实体 B,由于在第 4 行中没有可以合并的附加信息,因此最后一行将继续与第 3 行类似。 有人可以帮忙吗?
arrange
数据由 na_count
和 fill
每个 ID
和 select 的 NA
值组成,每个组中的第一行.
library(dplyr)
library(tidyr)
table %>%
arrange(ID, na_count) %>%
group_by(ID) %>%
fill(starts_with('Amount'), .direction = 'updown') %>%
slice(1L) %>%
ungroup %>%
dplyr::select(-na_count)
# ID Level Amount1 Amount2 Amount3 Amount4
# <chr> <chr> <chr> <chr> <chr> <chr>
#1 Entity_A Individual_1 100 40 20 10
#2 Entity_B Individual_1 120 30 40 NA
由于输入是data.table,我们可以使用data.table
方法
library(data.table)
table[order(na_count),lapply(.SD, function(x)
nafill(nafill(as.numeric(x), type = 'locf'), type = 'nocb')[1]),
ID, .SDcols = startsWith(names(table), 'Amount')]
ID Amount1 Amount2 Amount3 Amount4
1: Entity_A 100 40 20 10
2: Entity_B 120 30 40 NA