为缺失的数据添加指定行

Question

我想为指定的情节、时间和日期插入 'NA' 值，它们的位置是随机的。我弄清楚了如何使用 add_row 函数手动执行操作，但我主要担心的是我有大量数据，手动操作无济于事。我的数据是这种格式。

Plot Date Time Canopyheight
B1 10/22/2019 22 50
B1 10/22/2019 1 80
B1 10/22/2019 4 9

所以我在每个情节中，有 4 个时间戳，如 22、1、4 和 6，有时会缺少时间戳，如 B1 10/22/2019 6 Na。我可以使用下面的代码添加这些行

  add_row(agg, Date = '10/21/2019', Plot = 'BG107B2', Time = 22,
          Canopyheight = NA, .before = 1)

但我有几个日期和地块需要添加行。我试过以下代码

test <- agg %>%
  mutate(ID2 = as.integer(factor(Plot, levels = unique(.$Plot)))) %>%
  split(f = .$ID2) %>%
  map_if(.p = function(x) unique(x$ID2) != unique(last(.)$ID2),
         ~bind_rows(.x, tibble(Time = unique(.x$Time), Canopyheight = NA,
                               ID2 = unique(.x$ID2)))) %>%
  bind_rows() %>%
  select(-ID2)

但是，我还是做不到，有没有什么方法可以自动化而不是手动？

谢谢，祝你有美好的一天。

Answer 1

一种方法实际上是对预期的 date/time 组合进行完全连接。这自然会将 NA 引入到剩余的列中。例如：

library(dplyr)
library(tidyr)
agg <- read.table(header=TRUE, stringsAsFactors=FALSE, text="
Plot Date Time Canopyheight
B1 10/22/2019 22 50
B1 10/22/2019 1 80
B1 10/22/2019 4 9")
distinct(agg, Plot, Date) %>%
  crossing(Time = c(22L, 1L, 4L, 6L)) %>%
  full_join(agg, ., by = c("Plot", "Date", "Time"))
#   Plot       Date Time Canopyheight
# 1   B1 10/22/2019   22           50
# 2   B1 10/22/2019    1           80
# 3   B1 10/22/2019    4            9
# 4   B1 10/22/2019    6           NA

管道的前两行仅提供您期望包含时间的所有日期，然后我们强制（使用 tidyr::crossing）Time 与 [=15] 的所有组合=]组合：

distinct(agg, Plot, Date) %>%
  crossing(Time = c(22L, 1L, 4L, 6L))
# # A tibble: 4 x 3
#   Plot  Date        Time
#   <chr> <chr>      <int>
# 1 B1    10/22/2019     1
# 2 B1    10/22/2019     4
# 3 B1    10/22/2019     6
# 4 B1    10/22/2019    22

如果您以前从未做过，join 和 merge 与数据集的概念可能不直观，我建议阅读更多关于他们在别处。如果您打算使用 SQL 数据库，那么（在我看来）它就成为一项需要改进的更重要的技能。下面是一些有价值的参考资料（并非所有关于 R，但概念仍然相关）：

What is the difference between Left, Right, Outer and Inner Joins?
What is the difference between "INNER JOIN" and "OUTER JOIN"?
How to join (merge) data frames (inner, outer, left, right)
https://www.shanelynn.ie/merge-join-dataframes-python-pandas-index-1/

Answer 2

我们可以使用 tidyr 中的 complete 来完成每个 Plot.

缺失的 Time 组合

tidyr::complete(df, Plot, Date, Time = c(22, 1, 4, 6))

#  Plot  Date        Time Canopyheight
#  <fct> <fct>      <dbl>        <int>
#1 B1    10/22/2019     1           80
#2 B1    10/22/2019     4            9
#3 B1    10/22/2019     6           NA
#4 B1    10/22/2019    22           50
#5 B2    10/22/2019     1           NA
#6 B2    10/22/2019     4            9
#7 B2    10/22/2019     6           80
#8 B2    10/22/2019    22           50

数据

又包含一组 Plot 用于测试解决方案。

df <- structure(list(Plot = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("B1", 
"B2"), class = "factor"), Date = structure(c(1L, 1L, 1L, 1L, 
1L, 1L), .Label = "10/22/2019", class = "factor"), Time = c(22L, 
1L, 4L, 22L, 6L, 4L), Canopyheight = c(50L, 80L, 9L, 50L, 80L, 
9L)), class = "data.frame", row.names = c(NA, -6L))

为缺失的数据添加指定行

Adding specified row for the data missing

bind

r

tibble