na.rm = TRUE 汇总数据
Summarizing data with na.rm = TRUE
考虑以下示例,该示例使用 dplyr
的 summarise
管道总结了一个数据帧,以识别与某些 CHAR
关联的 min
imum DATE
:
library('tidyverse')
library('lubridate')
temp <- data.frame(
CHAR = c(
'A',
'B',
'C'
),
DATE = c(
'20090101',
'20100101',
NA
) %>% ymd(), # Turn character strings to dates
stringsAsFactors = FALSE
) %>% group_by(
CHAR
) %>% summarise(
DATE = min(DATE, na.rm = TRUE) # Extract minimum date
) %>% ungroup()
使用 is.na
:
测试了 min
imum 是否为 NA
temp %>% mutate(
DATE_lgl = DATE %>% is.na() # Identify dates that are missing/NA
)
输出
# A tibble: 3 x 3
CHAR DATE DATE_lgl
<chr> <date> <lgl>
1 A 2009-01-01 FALSE
2 B 2010-01-01 FALSE
3 C NA FALSE
错误地将 DATE_lgl
显示为 FALSE
,其中 DATE
是 NA
。这是为什么?
删除 na.rm = TRUE
解决了这个问题,但不适用于需要 na.rm = TRUE
来消除缺失条目的以下配置:
temp <- data.frame(
CHAR = c(
'A',
'B',
'C',
'C'
),
DATE = c(
'20090101',
'20100101',
NA,
'20110101'
) %>% ymd(), # Turn character strings to dates
stringsAsFactors = FALSE
) %>% group_by(
CHAR
) %>% summarise(
DATE = min(DATE, na.rm = TRUE) # Extract minimum date
) %>% ungroup()
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252 LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C LC_TIME=English_Canada.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 lubridate_1.7.4 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.5 purrr_0.2.5
[7] readr_1.1.1 tidyr_0.8.1 tibble_1.4.2 ggplot2_2.2.1 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 cellranger_1.1.0 pillar_1.2.3 compiler_3.5.0 plyr_1.8.4 bindr_0.1.1
[7] tools_3.5.0 jsonlite_1.5 nlme_3.1-137 gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.1
[13] rlang_0.2.1 psych_1.8.4 cli_1.0.0 rstudioapi_0.7 yaml_2.1.19 parallel_3.5.0
[19] haven_1.1.1 xml2_1.2.0 httr_1.3.1 hms_0.4.2 grid_3.5.0 tidyselect_0.2.4
[25] glue_1.2.0 R6_2.2.2 readxl_1.1.0 foreign_0.8-70 modelr_0.1.2 reshape2_1.4.3
[31] magrittr_1.5 scales_0.5.0 rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5 colorspace_1.3-2
[37] utf8_1.1.4 stringi_1.1.7 lazyeval_0.2.1 munsell_0.4.3 broom_0.4.4 crayon_1.3.4
问题是您正在评估
min(NA, na.rm=TRUE)
# Inf
对于第 3 行,这导致它成为
dput(temp$DATE[3])
# structure(Inf, class = "Date")
将 is.finite
添加到您的 mutate
temp %>%
mutate(DATE_lgl = is.finite(DATE) | is.na(DATE) # Identify dates that are missing/NA)
# A tibble: 3 x 3
# CHAR DATE DATE_lgl
# <chr> <date> <lgl>
# 1 A 2009-01-01 TRUE
# 2 B 2010-01-01 TRUE
# 3 C NA FALSE
打印 NA
可能是日期 class
的打印限制
as.Date(Inf, origin="1970-01-01")
# NA
dput(as.Date(Inf, origin="1970-01-01"))
# structure(Inf, class = "Date")
解决方法是将 Date
列转换为字符,然后评估它是否为 NA
。
temp %>% mutate(
DATE_lgl = is.na(as.character(DATE))
)
# # A tibble: 3 x 3
# CHAR DATE DATE_lgl
# <chr> <date> <lgl>
# 1 A 2009-01-01 FALSE
# 2 B 2010-01-01 FALSE
# 3 C NA TRUE
考虑以下示例,该示例使用 dplyr
的 summarise
管道总结了一个数据帧,以识别与某些 CHAR
关联的 min
imum DATE
:
library('tidyverse')
library('lubridate')
temp <- data.frame(
CHAR = c(
'A',
'B',
'C'
),
DATE = c(
'20090101',
'20100101',
NA
) %>% ymd(), # Turn character strings to dates
stringsAsFactors = FALSE
) %>% group_by(
CHAR
) %>% summarise(
DATE = min(DATE, na.rm = TRUE) # Extract minimum date
) %>% ungroup()
使用 is.na
:
min
imum 是否为 NA
temp %>% mutate(
DATE_lgl = DATE %>% is.na() # Identify dates that are missing/NA
)
输出
# A tibble: 3 x 3
CHAR DATE DATE_lgl
<chr> <date> <lgl>
1 A 2009-01-01 FALSE
2 B 2010-01-01 FALSE
3 C NA FALSE
错误地将 DATE_lgl
显示为 FALSE
,其中 DATE
是 NA
。这是为什么?
删除 na.rm = TRUE
解决了这个问题,但不适用于需要 na.rm = TRUE
来消除缺失条目的以下配置:
temp <- data.frame(
CHAR = c(
'A',
'B',
'C',
'C'
),
DATE = c(
'20090101',
'20100101',
NA,
'20110101'
) %>% ymd(), # Turn character strings to dates
stringsAsFactors = FALSE
) %>% group_by(
CHAR
) %>% summarise(
DATE = min(DATE, na.rm = TRUE) # Extract minimum date
) %>% ungroup()
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252 LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C LC_TIME=English_Canada.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 lubridate_1.7.4 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.5 purrr_0.2.5
[7] readr_1.1.1 tidyr_0.8.1 tibble_1.4.2 ggplot2_2.2.1 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 cellranger_1.1.0 pillar_1.2.3 compiler_3.5.0 plyr_1.8.4 bindr_0.1.1
[7] tools_3.5.0 jsonlite_1.5 nlme_3.1-137 gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.1
[13] rlang_0.2.1 psych_1.8.4 cli_1.0.0 rstudioapi_0.7 yaml_2.1.19 parallel_3.5.0
[19] haven_1.1.1 xml2_1.2.0 httr_1.3.1 hms_0.4.2 grid_3.5.0 tidyselect_0.2.4
[25] glue_1.2.0 R6_2.2.2 readxl_1.1.0 foreign_0.8-70 modelr_0.1.2 reshape2_1.4.3
[31] magrittr_1.5 scales_0.5.0 rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5 colorspace_1.3-2
[37] utf8_1.1.4 stringi_1.1.7 lazyeval_0.2.1 munsell_0.4.3 broom_0.4.4 crayon_1.3.4
问题是您正在评估
min(NA, na.rm=TRUE)
# Inf
对于第 3 行,这导致它成为
dput(temp$DATE[3])
# structure(Inf, class = "Date")
将 is.finite
添加到您的 mutate
temp %>%
mutate(DATE_lgl = is.finite(DATE) | is.na(DATE) # Identify dates that are missing/NA)
# A tibble: 3 x 3
# CHAR DATE DATE_lgl
# <chr> <date> <lgl>
# 1 A 2009-01-01 TRUE
# 2 B 2010-01-01 TRUE
# 3 C NA FALSE
打印 NA
可能是日期 class
as.Date(Inf, origin="1970-01-01")
# NA
dput(as.Date(Inf, origin="1970-01-01"))
# structure(Inf, class = "Date")
解决方法是将 Date
列转换为字符,然后评估它是否为 NA
。
temp %>% mutate(
DATE_lgl = is.na(as.character(DATE))
)
# # A tibble: 3 x 3
# CHAR DATE DATE_lgl
# <chr> <date> <lgl>
# 1 A 2009-01-01 FALSE
# 2 B 2010-01-01 FALSE
# 3 C NA TRUE