R:complete/expand 添加了新列的数据集

R: complete/expand a dataset with a new column added

我的数据集如下所示:

(可视化下面的数据集可能有助于您理解问题)

original <- data.frame(
  ID = c(rep("John", 3), "Steve"),
  A = c(rep(3, 3), 1),
  B = c(rep(4, 3), 2),
  b = c(2, 3, 2, 2),
  detail = c(rep("GOOOOD", 4))
)

变量 ABb 中的值都是整数。变量 b 在这个数据集中是不完整的,它实际上有从 1 到 B.

的值

我需要添加一个新变量 a 来完成此数据集,完成的数据集将如下所示:

completed1 <- data.frame(
  ID = c(rep("John", 12), rep("Steve", 2)),
  A = c(rep(3, 12), rep(1, 2)),
  a = c(rep(1, 4), rep(2, 4), rep(3, 4), rep(1, 2)),
  B = c(rep(4, 12), rep(2, 2)),
  b = c(rep(1:4, 3), 1, 2),
  detail = c(NA, "GOOOOD", "GOOOOD", NA, NA, "GOOOOD", rep(NA, 7), "GOOOOD")
)

变量 a 中的值也是整数,a 的值从 1 到 A 的值。 b 中的值嵌套在 a 的每个值中,a 中的值嵌套在 ID.

的每个因子中

我认为以这种方式完成数据集最相关的函数是tidyr::complete()tidyr::expand(),但它们只能完成现有变量中值的组合,不能添加新列(变量)。

我知道挑战在于有多个位置可以在 detail 中分配值对应于新添加的值 a 通过嵌套关系,例如完成的数据集也可以是这样的:

completed2 <- data.frame(
  ID = c(rep("John", 12), rep("Steve", 2)),
  A = c(rep(3, 12), rep(1, 2)),
  a = c(rep(1, 4), rep(2, 4), rep(3, 4), rep(1, 2)),
  B = c(rep(4, 12), rep(2, 2)),
  b = c(rep(1:4, 3), 1, 2),
  detail = c(NA, "GOOOOD", rep(NA, 4), "GOOOOD", NA, NA, "GOOOOD", rep(NA, 3), "GOOOOD")
)

detail 中的值在完整数据集中的位置对我来说并不重要。我的实际数据集有超过 40,000 行,所以我真的需要一些东西来自动化它。

这可以吗? 非常感谢!!!

我想知道是否可以执行两次 complete,首先是 a,然后是 b。您可以根据需要调整不同的嵌套,或group_by

取决于最大值 a 是否来自 ID 组中的 A ,您是否应该 adjust/remove group_by (类似于 ba 组中)

library(dplyr)
library(tidyr)

original %>%
  dplyr::mutate(a = 1) %>%
  dplyr::group_by( ID ) %>%
  tidyr::complete( a = 1:max(A), nesting(ID, A, B, b), fill = list( detail = NA_character_)) %>%
  group_by( a ) %>%
  tidyr::complete( b = 1:max(B), nesting(ID, A, B, a), fill = list( detail = NA_character_)) %>%
  dplyr::ungroup()

使用 for 循环非常混乱,它会给出非常随机的 GOOOOD

位置
comp_dummy <- original %>%
  group_by(ID) %>%
  expand(A = A, a = 1:A, B = B, b = 1:B)

original <- original %>%
  group_by(ID, A, B, b) %>%
  summarise(n = n())

vec <- rep(NA_character_, nrow(comp_dummy))

for (i in 1:nrow(original)){
  x <- original[i,]
  
  y <- comp_dummy %>%
    rownames_to_column(., "row") %>%
    filter(ID == x$ID, A == x$A, B == x$B, b == x$b)  %>%
    pull(row)
  z <- sample(y, x$n, replace = FALSE)  %>% as.numeric()
  print(z)
  vec[{z}] <- "GOOOOD"
}

comp_dummy$detail <- vec
comp_dummy

   ID        A     a     B     b detail
   <chr> <dbl> <int> <dbl> <int> <chr> 
 1 John      3     1     4     1 NA    
 2 John      3     1     4     2 GOOOOD
 3 John      3     1     4     3 NA    
 4 John      3     1     4     4 NA    
 5 John      3     2     4     1 NA    
 6 John      3     2     4     2 NA    
 7 John      3     2     4     3 NA    
 8 John      3     2     4     4 NA    
 9 John      3     3     4     1 NA    
10 John      3     3     4     2 GOOOOD
11 John      3     3     4     3 GOOOOD
12 John      3     3     4     4 NA    
13 Steve     1     1     2     1 NA    
14 Steve     1     1     2     2 GOOOOD

基础 R 解决方案

do.call(
  rbind,
  by(original,list(original$ID),function(x){
    tmp=merge(
      unique(x),
      setNames(
        expand.grid(
          unique(x$ID),
          x$A[1],
          1:max(x$A),
          x$B[1],
          1:max(x$B)
        ),
        c("ID","A","a","B","b")
      ),
      by=c("ID","A","B","b"),
      all=T
    )
    tmp[order(tmp$a,tmp$b),c("ID","A","a","B","b","detail")]
  })
)

导致

           ID A a B b detail
John.1   John 3 1 4 1   <NA>
John.5   John 3 1 4 2 GOOOOD
John.8   John 3 1 4 3 GOOOOD
John.11  John 3 1 4 4   <NA>
John.2   John 3 2 4 1   <NA>
John.4   John 3 2 4 2 GOOOOD
John.9   John 3 2 4 3 GOOOOD
John.12  John 3 2 4 4   <NA>
John.3   John 3 3 4 1   <NA>
John.6   John 3 3 4 2 GOOOOD
John.7   John 3 3 4 3 GOOOOD
John.10  John 3 3 4 4   <NA>
Steve.1 Steve 1 1 2 1   <NA>
Steve.2 Steve 1 1 2 2 GOOOOD