用组内其他列的第一个或最后一个值替换数据框中的 NA 值

Replacing NA value in dataframe by first or last value of other columns within group

我有以下数据框:

Group<-c(A,A,A,B,B,B)
Dates<-(c("01-01-2000","02-01-2000","03-01-2000","01-05-2020","02-05-2020","03-05-2020"))
Departure<-c("01-01-2000","01-01-2000","01-01-2000",NA,NA,NA)
Arrival<-c(NA,NA,NA,"03-02-2020","03-02-2020","03-02-2020")
Dates<-data.frame(Dates,Departure,Arrival)
Dates

 Group  Dates      Departure    Arrival
     1  01-01-2000 02-01-2000       <NA>
     1  02-01-2000 02-01-2000       <NA>
     1  03-01-2000 02-01-2000       <NA>
     2  01-05-2000       <NA> 31-12-2020
     2  02-05-2000       <NA> 31-12-2020
     2  03-05-2000       <NA> 31-12-2020

这是我想要做的:

然后我将获得以下数据框:

 Group  Dates      Departure    Arrival
     1  01-01-2000 02-01-2000   03-01-2000
     1  02-01-2000 02-01-2000   03-01-2000
     1  03-01-2000 02-01-2000   03-01-2000
     2  01-05-2000 01-05-2000   31-12-2020
     2  02-05-2000 01-05-2000   31-12-2020
     2  03-05-2000 01-05-2000   31-12-2020

我正在考虑使用 dplyr 的 if else 和 group_by 的组合,但除此之外我被卡住了。如有任何建议,我们将不胜感激!!

一个选项是在按 'Group' 分组后使用 replace_na(来自 tidyr)将 NA 元素替换为 firstlast 'Dates' 列的值

library(dplyr)
library(tidyr)
df1 %>% 
   group_by(Group) %>% 
   mutate(Departure = replace_na(Departure, first(Dates)), 
          Arrival = replace_na(Arrival, last(Dates))) %>% 
   ungroup

注意:这里我们假设 'Dates' 已经 ordered。如果不是,取minmax转换为Dateclass

library(lubridate)
df1 %>% 
   mutate(across(-Group, dmy)) %>%
   group_by(Group) %>% 
   mutate(Departure = replace_na(Departure, min(Dates)), 
          Arrival = replace_na(Arrival, max(Dates))) %>% 
   ungroup

一个data.table选项

setDT(Dates)[
  ,
  .(
    Dates = Dates,
    Departure = replace(Departure, is.na(Departure), min(Dates)),
    Arrival = replace(Arrival, is.na(Arrival), max(Dates))
  ),
  Group
]

给予

   Group      Dates  Departure    Arrival
1:     A 01-01-2000 01-01-2000 03-01-2000
2:     A 02-01-2000 01-01-2000 03-01-2000
3:     A 03-01-2000 01-01-2000 03-01-2000
4:     B 01-05-2020 01-05-2020 03-02-2020
5:     B 02-05-2020 01-05-2020 03-02-2020
6:     B 03-05-2020 01-05-2020 03-02-2020

OP 已要求替换 data.frame 中的 NA 个值。

data.table 的强项之一是能够通过引用更新,即无需复制即可替换值 整个数据集。

此外,data.tablefcoalesce()功能与Map()一起使用。

library(data.table)
cols <- c("Departure", "Arrival")
setDT(df_Dates)[, (cols) := Map(fcoalesce, .SD, Dates[c(1L, .N)]), .SDcols = cols, by = Group]
df_Dates
   Group      Dates  Departure    Arrival
1:     A 01-01-2000 01-01-2000 03-01-2000
2:     A 02-01-2000 01-01-2000 03-01-2000
3:     A 03-01-2000 01-01-2000 03-01-2000
4:     B 01-05-2020 01-05-2020 03-02-2020
5:     B 02-05-2020 01-05-2020 03-02-2020
6:     B 03-05-2020 01-05-2020 03-02-2020

Map() 为第一列 Departures 选择每个组中 Dates 的第一个值,为第二列 Arrival 选择最后一个值 Dates[.N]调用 fcoalesce().

请注意,原始数据集已就地更改,可以通过调用前后调用 address() 来验证。

分别使用 min(Dates)max(Dates) 而不是 first(Dates)last(Dates),或 Dates[1L]Dates[.N],可能会导致其他数据集如 Dates 的意外结果以 DD-MM-YYYY 格式的字符日期给出,将在每月的第几天排序。

数据

df_Dates <- data.frame(
  Group = c("A", "A", "A", "B", "B", "B"), 
  Dates = c("01-01-2000", "02-01-2000", "03-01-2000", "01-05-2020", "02-05-2020", "03-05-2020"), 
  Departure = c("01-01-2000", "01-01-2000", "01-01-2000", NA, NA, NA), 
  Arrival = c(NA, NA, NA, "03-02-2020", "03-02-2020", "03-02-2020"))