R - 使用重复键将 NA 替换为日期
R - Using Duplicate Keys to Replace NA's with Dates
我有数百个重复的主键和与这些键关联的日期。日期可能有也可能没有缺失的条目,但需要缺失的条目确实需要用 max(date) 替换。
#Create Proxy dataframe
df <- tibble(
key = c("a", "a", "b", "b", "c", "c", "d", "d", "e", "e", "f", "f", "h", "h", "i","i", "j", "j", "k", "k", "l", "l", "m", "m"),
date1 = c("NA", "2017-02-13", "NA", "2017-04-14", "2017-05-18", "2017-05-18", "NA", "2018-01-07",
"2017-09-24", "2017-09-25", "NA", "2017-09-29", "NA", "2017-08-13", "NA", "2017-04-29",
"NA", "2018-01-28", "NA", "2017-10-08", "NA", "2017-01-10", "NA", "2017-11-01")
)
df$date1 <- as.Date(df$date1, format = "%Y-%m-%d")
注意
-key "a" 缺少日期,需要用唯一可用的日期替换
-key "c" 没有任何缺失的日期
-key "e" 有两个不同的日期,但需要记录最后一个日期
df
# A tibble: 24 x 2
key date1
<chr> <date>
1 a NA
2 a 2017-02-13
3 b NA
4 b 2017-04-14
5 c 2017-05-18
6 c 2017-05-18
7 d NA
8 d 2018-01-07
9 e 2017-09-24
10 e 2017-09-25
# ... with 14 more rows
我尝试过但不起作用的解决方案:
library(lubridate)
df$date <- with(df$date, as.Date(ifelse(is.na(df$date), orderDate, df$date), origin = "1970-01-01"))
library(dplyr)
df %>% group_by(key) %>%
mutate(date = (date, NA, df$date)) %>%
as.data.frame
如有任何帮助,我们将不胜感激!谢谢!
假设您只想在 date1
为 NA 时用每个组中的 max()
值替换,这将起作用。注意需要指定na.rm = TRUE
,因为max(NA, 1)
returns NA,而不是1.
library(dplyr)
df %>% group_by(key) %>%
mutate(Date = case_when(
is.na(date1) ~ max(date1, na.rm = TRUE),
TRUE ~ date1)
)
# A tibble: 24 x 3
# Groups: key [12]
key date1 Date
<chr> <date> <date>
1 a NA 2017-02-13
2 a 2017-02-13 2017-02-13
3 b NA 2017-04-14
4 b 2017-04-14 2017-04-14
5 c 2017-05-18 2017-05-18
6 c 2017-05-18 2017-05-18
7 d NA 2018-01-07
8 d 2018-01-07 2018-01-07
9 e 2017-09-24 2017-09-24
10 e 2017-09-25 2017-09-25
有一种替代方法比 快得多。它在加入 时使用 更新将每个 key
组的 NA
值替换为 max(date1)
:
library(data.table)
DT <- as.data.table(df)
tmp <- DT[, .(date1 = as.Date(NA), max(date1, na.rm = TRUE)), by = key]
DT[tmp, on = .(key, date1), date1 := V2][]
key date1
1: a 2017-02-13
2: a 2017-02-13
3: b 2017-04-14
4: b 2017-04-14
5: c 2017-05-18
6: c 2017-05-18
7: d 2018-01-07
8: d 2018-01-07
9: e 2017-09-24
10: e 2017-09-25
11: f 2017-09-29
12: f 2017-09-29
13: h 2017-08-13
14: h 2017-08-13
15: i 2017-04-29
16: i 2017-04-29
17: j 2018-01-28
18: j 2018-01-28
19: k 2017-10-08
20: k 2017-10-08
21: l 2017-01-10
22: l 2017-01-10
23: m 2017-11-01
24: m 2017-11-01
key date1
请注意,只有 date1
为 NA
的行被 原地 替换,即不复制整个数据对象。
tmp
包含每个 key
组的 replacemnet 值:
key date1 V2
1: a <NA> 2017-02-13
2: b <NA> 2017-04-14
3: c <NA> 2017-05-18
4: d <NA> 2018-01-07
5: e <NA> 2017-09-25
6: f <NA> 2017-09-29
7: h <NA> 2017-08-13
8: i <NA> 2017-04-29
9: j <NA> 2018-01-28
10: k <NA> 2017-10-08
11: l <NA> 2017-01-10
12: m <NA> 2017-11-01
基准
创建基准数据:
library(dplyr)
library(data.table)
n_row <- 1e5L
n_key <- 500L
share_na <- 0.5
set.seed(123L)
DT0 <- data.table(
key1 = sprintf("%04i", sample.int(n_key, n_row, TRUE)),
date1 = as.Date("2017-01-01") + sample.int(n_key, n_row, TRUE)
)
# set NA values
DT0[sample.int(n_row, share_na * n_row), date1 := NA]
# coerce to tibble
df0 <- as_tibble(DT0)
运行 基准:
library(microbenchmark)
bm <- microbenchmark(
dplyr = {
copy(df0) %>% group_by(key1) %>%
mutate(date1 = case_when(
is.na(date1) ~ max(date1, na.rm = TRUE),
TRUE ~ date1)
)
},
dt = {
DT <- copy(DT0)
tmp <- DT[, .(date1 = as.Date(NA), max(date1, na.rm = TRUE)), by = key1]
DT[tmp, on = .(key1, date1), date1 := V2][]
},
times = 21L
)
print(bm)
Unit: milliseconds
expr min lq mean median uq max neval cld
dplyr 131.02040 136.81967 142.63845 137.78741 141.36084 191.37755 21 b
dt 18.14997 18.68349 19.65384 19.32424 19.54815 26.87965 21 a
对于 10 万行、500 个组和 50% 的 NA
值的给定问题大小,data.table
方法比 dplyr
版本快大约 7 倍。
请注意,DT0
和 df0
的新副本用于每次重复,因为 DT
已就地更新。对 copy()
的调用包含在两种情况的时间中。 dplyr
版本已修改为更新 date1
而不是在输出中创建第三列。
我有数百个重复的主键和与这些键关联的日期。日期可能有也可能没有缺失的条目,但需要缺失的条目确实需要用 max(date) 替换。
#Create Proxy dataframe
df <- tibble(
key = c("a", "a", "b", "b", "c", "c", "d", "d", "e", "e", "f", "f", "h", "h", "i","i", "j", "j", "k", "k", "l", "l", "m", "m"),
date1 = c("NA", "2017-02-13", "NA", "2017-04-14", "2017-05-18", "2017-05-18", "NA", "2018-01-07",
"2017-09-24", "2017-09-25", "NA", "2017-09-29", "NA", "2017-08-13", "NA", "2017-04-29",
"NA", "2018-01-28", "NA", "2017-10-08", "NA", "2017-01-10", "NA", "2017-11-01")
)
df$date1 <- as.Date(df$date1, format = "%Y-%m-%d")
注意
-key "a" 缺少日期,需要用唯一可用的日期替换
-key "c" 没有任何缺失的日期
-key "e" 有两个不同的日期,但需要记录最后一个日期
df
# A tibble: 24 x 2
key date1
<chr> <date>
1 a NA
2 a 2017-02-13
3 b NA
4 b 2017-04-14
5 c 2017-05-18
6 c 2017-05-18
7 d NA
8 d 2018-01-07
9 e 2017-09-24
10 e 2017-09-25
# ... with 14 more rows
我尝试过但不起作用的解决方案:
library(lubridate)
df$date <- with(df$date, as.Date(ifelse(is.na(df$date), orderDate, df$date), origin = "1970-01-01"))
library(dplyr)
df %>% group_by(key) %>%
mutate(date = (date, NA, df$date)) %>%
as.data.frame
如有任何帮助,我们将不胜感激!谢谢!
假设您只想在 date1
为 NA 时用每个组中的 max()
值替换,这将起作用。注意需要指定na.rm = TRUE
,因为max(NA, 1)
returns NA,而不是1.
library(dplyr)
df %>% group_by(key) %>%
mutate(Date = case_when(
is.na(date1) ~ max(date1, na.rm = TRUE),
TRUE ~ date1)
)
# A tibble: 24 x 3
# Groups: key [12]
key date1 Date
<chr> <date> <date>
1 a NA 2017-02-13
2 a 2017-02-13 2017-02-13
3 b NA 2017-04-14
4 b 2017-04-14 2017-04-14
5 c 2017-05-18 2017-05-18
6 c 2017-05-18 2017-05-18
7 d NA 2018-01-07
8 d 2018-01-07 2018-01-07
9 e 2017-09-24 2017-09-24
10 e 2017-09-25 2017-09-25
有一种替代方法比 key
组的 NA
值替换为 max(date1)
:
library(data.table)
DT <- as.data.table(df)
tmp <- DT[, .(date1 = as.Date(NA), max(date1, na.rm = TRUE)), by = key]
DT[tmp, on = .(key, date1), date1 := V2][]
key date1 1: a 2017-02-13 2: a 2017-02-13 3: b 2017-04-14 4: b 2017-04-14 5: c 2017-05-18 6: c 2017-05-18 7: d 2018-01-07 8: d 2018-01-07 9: e 2017-09-24 10: e 2017-09-25 11: f 2017-09-29 12: f 2017-09-29 13: h 2017-08-13 14: h 2017-08-13 15: i 2017-04-29 16: i 2017-04-29 17: j 2018-01-28 18: j 2018-01-28 19: k 2017-10-08 20: k 2017-10-08 21: l 2017-01-10 22: l 2017-01-10 23: m 2017-11-01 24: m 2017-11-01 key date1
请注意,只有 date1
为 NA
的行被 原地 替换,即不复制整个数据对象。
tmp
包含每个 key
组的 replacemnet 值:
key date1 V2 1: a <NA> 2017-02-13 2: b <NA> 2017-04-14 3: c <NA> 2017-05-18 4: d <NA> 2018-01-07 5: e <NA> 2017-09-25 6: f <NA> 2017-09-29 7: h <NA> 2017-08-13 8: i <NA> 2017-04-29 9: j <NA> 2018-01-28 10: k <NA> 2017-10-08 11: l <NA> 2017-01-10 12: m <NA> 2017-11-01
基准
创建基准数据:
library(dplyr)
library(data.table)
n_row <- 1e5L
n_key <- 500L
share_na <- 0.5
set.seed(123L)
DT0 <- data.table(
key1 = sprintf("%04i", sample.int(n_key, n_row, TRUE)),
date1 = as.Date("2017-01-01") + sample.int(n_key, n_row, TRUE)
)
# set NA values
DT0[sample.int(n_row, share_na * n_row), date1 := NA]
# coerce to tibble
df0 <- as_tibble(DT0)
运行 基准:
library(microbenchmark)
bm <- microbenchmark(
dplyr = {
copy(df0) %>% group_by(key1) %>%
mutate(date1 = case_when(
is.na(date1) ~ max(date1, na.rm = TRUE),
TRUE ~ date1)
)
},
dt = {
DT <- copy(DT0)
tmp <- DT[, .(date1 = as.Date(NA), max(date1, na.rm = TRUE)), by = key1]
DT[tmp, on = .(key1, date1), date1 := V2][]
},
times = 21L
)
print(bm)
Unit: milliseconds expr min lq mean median uq max neval cld dplyr 131.02040 136.81967 142.63845 137.78741 141.36084 191.37755 21 b dt 18.14997 18.68349 19.65384 19.32424 19.54815 26.87965 21 a
对于 10 万行、500 个组和 50% 的 NA
值的给定问题大小,data.table
方法比 dplyr
版本快大约 7 倍。
请注意,DT0
和 df0
的新副本用于每次重复,因为 DT
已就地更新。对 copy()
的调用包含在两种情况的时间中。 dplyr
版本已修改为更新 date1
而不是在输出中创建第三列。