以编程方式分解数据框中选定的列，整洁的方式？

Question

这是一个简化的例子：

library(tidyverse)

frame <- tribble(
  ~a, ~b, ~c,
   1,  1,  2,
   5,  4,  7,
   2,  3,  4, 
   3,  1,  6
)

key <- tribble(
  ~col, ~name, ~type, ~labels,
     1,   "a",   "f",     c("one", "two", "three", "four", "five"),
     2,   "b",   "f",     c("uno", "dos", "tres", "cuatro"),
     3,   "c",   "f",     1:7
)

是否有一种优雅的方法可以根据 key 中的参数以编程方式扫过 frame 中的列并应用特定因子 class？预期结果将是：

# A tibble: 4 x 3
       a      b      c
  <fctr> <fctr> <fctr>
1    one    uno      2
2   five cuatro      7
3    two   tres      4
4  three    uno      6

我目前最好的解决方案是使用 purrr 的 map2() 但 IMO 的分配不是最优雅的：

frame[key$col] <- map2(key$col, key$labels, 
        function(x, y) factor(frame[[x]], levels = 1:length(y), labels = y))

有没有人有更整洁的解决方案？请注意，我的原始数据框有数百列，我需要用不同的 levels/labels 重构其中的大部分，因此该过程必须自动化。

Answer 1

我不知道这个答案是否满足您对整洁的要求，因为它使用了普通的旧 for 循环。但它完成了工作，在我看来它很容易 read/understand 并且相当快。

library(tidyverse)
frame <- tribble(
 ~a, ~b, ~c,
 1,  1,  2,
 5,  4,  7,
 2,  3,  4, 
 3,  1,  6
)

key <- tribble(
 ~col, ~name, ~type, ~labels,
 1,   "a",   "f",     c("one", "two", "three", "four", "five"),
 2,   "b",   "f",     c("uno", "dos", "tres", "cuatro"),
 3,   "c",   "f",     1:7
)

for (i in 1:nrow(key)) {
 var <- key$name[[i]]
 x <- frame[[var]]
 labs <- key$labels[[i]]
 lvls <- 1:max(length(x), length(labs)) # make sure to have the right lengths

 frame <- frame %>% mutate(!! var := factor(x, levels = lvls, labels = labs))
}

frame
#> # A tibble: 4 x 3
#>        a      b      c
#>   <fctr> <fctr> <fctr>
#> 1    one    uno      2
#> 2   five cuatro      7
#> 3    two   tres      4
#> 4  three    uno      6

典型的整洁方法是重塑数据以将所有变量放在一列中，然后对该列应用一个函数，最后将其重塑为原始格式。然而，事实并非如此，因此我们需要使用其他方式。因素甚至被认为是整洁的吗？

编辑

关于我认为 for 循环类似于 map2 函数的假设，我错了。

这里有一些基准：

library(microbenchmark)

frame1 <- frame
frame2 <- frame

microbenchmark(
 map2 = {
  frame1[key$col] <- map2(key$col, key$labels, 
                          function(x, y) factor(frame[[x]], 
                                                levels = 1:max(frame[[x]],
                                                               length(y)), 
                                                labels = y))
 },
 forloop = {
  for (i in 1:nrow(key)) {
   var <- key$name[[i]]
   x <- frame2[[var]]
   labs <- key$labels[[i]]
   lvls <- 1:max(length(x), length(labs))
   frame2 <- frame2 %>% mutate(!! var := factor(x, levels = lvls, labels = labs))
  }
 }
)

# Unit: microseconds
# expr         min         lq       mean    median         uq       max neval cld
# map2      375.53   416.5805   514.3126   450.825   484.2175  3601.636   100  a 
# forloop 11407.80 12110.0090 12816.6606 12564.176 13425.6840 16632.682   100   b

Answer 2

我很想知道为此提出了哪些其他解决方案。我唯一的建议是稍微更改建议的解决方案，以便更清楚 frame 将以某种方式进行修改，而不是将其留在 map2 使用的函数主体中。

例如，在对 map2 的调用中将 frame 作为附加参数传递：

frame[key$col] <- map2(key$col, key$labels, 
                       function(x, y, z) factor(z[[x]], levels = 1:length(y), labels = y), 
                       frame)

或者使用管道运算符做同样的事情 %>%:

frame[key$col] <- frame %>%
  { map2(key$col, key$labels, 
         function(x, y, z) factor(z[[x]], levels = 1:length(y), labels = y), .) }

Answer 3

对于这个问题，您可以使用基础 R 代码：

(A=`names<-`(data.frame(mapply(function(x,y)x[y],key$labels,frame)),key$name))
      a      b c
1   one    uno 2
2  five cuatro 7
3   two   tres 4
4 three    uno 6

 sapply(A,class)
   a        b        c 
"factor" "factor" "factor"

Answer 4

这是另一种解决方案。我不确定 "elegant" 怎么样。希望有人可以对此进行改进。

suppressPackageStartupMessages(library(tidyverse))

frame <- tribble(
  ~a, ~b, ~c,
  1,  1,  2,
  5,  4,  7,
  2,  3,  4, 
  3,  1,  6
)

key <- tribble(
  ~col, ~name, ~type, ~labels,
  1,   "a",   "f",     c("one", "two", "three", "four", "five"),
  2,   "b",   "f",     c("uno", "dos", "tres", "cuatro"),
  3,   "c",   "f",     1:7
)

colnames(frame) %>% 
  map(~ {
    factor(pull(frame, .x),
           levels = 1:length(pluck(key[key$name == .x, "labels"], 1, 1)),
           labels = pluck(key[key$name == .x, "labels"], 1, 1))
  }) %>% 
  set_names(colnames(frame)) %>% 
  as_tibble()
#> # A tibble: 4 x 3
#>        a      b      c
#>   <fctr> <fctr> <fctr>
#> 1    one    uno      2
#> 2   five cuatro      7
#> 3    two   tres      4
#> 4  three    uno      6

以编程方式分解数据框中选定的列，整洁的方式？

Programmatically factorize selected columns in data frame, the tidy way?

r

dplyr

purrr

tidyverse

编辑