在 R 中使用 mutate_at 和 max() 函数尝试使用自己的数据编写代码。出现警告消息：没有非遗漏的最大参数

Question

我目前正在通过一本书学习 R，并正在尝试 dplyr 的 mutate_at 函数。在这个例子中，我想在 0 到 1 的范围内标准化调查项目。为此，我们可以将每个值除以规模的（理论）最大值。

“pradadata”包中的书籍示例 stats_test 工作得很好：

data(stats_test, package = "pradadata")
  stats_test %>%
  drop_na() %>% 
  mutate_at(.vars = vars(study_time, self_eval, interest),
            .funs = funs(prop = ./max(.))) %>%                                         
  select(contains("_prop"))

输出：

study_time_prop self_eval_prop interest_prop
             <dbl>          <dbl>         <dbl>
 1             0.6            0.7         0.667
 2             0.8            0.8         0.833
 3             0.6            0.4         0.167
 4             0.8            0.7         0.833
 5             0.4            0.6         0.5  
 6             0.4            0.6         0.667
 7             0.8            0.6         0.5  
 8             0.2            0.7         0.667
 9             0.6            0.8         0.833
10             0.6            0.7         0.833
# ... with 1,617 more rows

用我自己的数据尝试了相同的代码，但它不起作用，我不明白为什么。我数据中的变量 RG04 的范围是 1-5。我试图将变量从数字转换为整数，因为数据 stats_test 中的变量也是整数：

df_literacy_2 <- transform(df_literacy, RG04 = as.integer(RG04))
df_literacy_2 <- tibble(df_literacy_2)


df_literacy_2 %>% 
  drop_na() %>% 
  mutate_at(.vars = vars(RG04),
            .funs = funs(prop = ./max(.))) %>% 
select(contains("_prop"))

输出：

# A tibble: 0 x 0
Warning messages:
1: Problem with `mutate()` input `prop`.
i no non-missing arguments to max; returning -Inf
i Input `prop` is `RG04/max(RG04)`. 
2: In base::max(x, ..., na.rm = na.rm) :
  no non-missing arguments to max; returning -Inf


str(df_literacy_2$RG04)
int [1:630] 2 4 2 1 2 2 1 3 1 3 ...

为什么它对我的数据不起作用？

感谢您的帮助。

使用 df_literacy 的示例进行编辑：

> dput(head(df_literacy,20))
structure(list(CASE = c(40, 41, 44, 45, 48, 49, 54, 55, 56, 57, 
58, 61, 62, 63, 64, 65, 66, 67, 68, 69), SERIAL = c(NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA), REF = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA), QUESTNNR = c("base", "base", 
"base", "base", "base", "base", "base", "base", "base", "base", 
"base", "base", "base", "base", "base", "base", "base", "base", 
"base", "base"), MODE = c("interview", "interview", "interview", 
"interview", "interview", "interview", "interview", "interview", 
"interview", "interview", "interview", "interview", "interview", 
"interview", "interview", "interview", "interview", "interview", 
"interview", "interview"), STARTED = structure(c(1607290462, 
1607290608, 1607291086, 1607291118, 1607291265, 1607291793, 1607294071, 
1607294336, 1607294337, 1607294419, 1607294814, 1607296474, 1607301809, 
1607329348, 1607333933, 1607335996, 1607336207, 1607336378, 1607343194, 
1607343414), tzone = "UTC", class = c("POSIXct", "POSIXt")), 
    EI01 = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L), .Label = c("Ja", 
    "Nein", "Nicht beantwortet"), class = "factor"), EI02 = c(2, 
    2, 2, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 2, 3), 
    RF01 = c(4, 2, 4, 3, 4, 4, 1, 3, 2, 3, 4, 3, 2, 3, 2, 2, 
    4, 2, 5, 3), RF02 = c(1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 2, 1, 
    1, 1, 2, 2, 2, 2, 2, 2), RF03 = c(1, 2, 2, 2, 1, 2, 1, 1, 
    1, 1, 2, 1, 1, 2, 2, 2, 1, 2, 1, 2), RG01 = c(2, 2, 2, 2, 
    2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2), RG02 = c(3, 
    3, 3, 3, 4, 3, 4, 2, 4, 2, 3, 4, 4, 2, 4, 3, 4, 3, 4, 4), 
    RG03 = c(3, 2, 2, 3, 3, 3, 1, 3, 1, 2, 3, 1, 2, 2, 1, 3, 
    2, 3, 2, 2), RG04 = c(2, 4, 2, 1, 2, 2, 1, 3, 1, 3, 2, 4, 
    1, 1, 1, 1, 1, 2, 4, 1), RG05 = c(1, 1, 1, 1, 1, 1, 1, 2, 
    1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1), SD01 = structure(c(2L, 
    1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 
    2L, 2L, 1L, 1L), .Label = c("weiblich", "männlich", "divers", 
    "nicht beantwortet"), class = "factor"), SD03 = c(4, 3, 2, 
    2, 1, 2, 4, 4, 1, 4, 3, 1, 2, 3, 2, 4, 2, 3, 1, 3), SD05_01 = c(23, 
    22, 22, 21, 18, 22, 21, 27, 17, 22, 17, 21, 21, 22, 50, 25, 
    23, 20, 23, 23), TIME001 = c(2, 3, 23, 73, 29, 2, 3, 3, 29, 7, 
    50, 55, 3, 2, 10, 2, 1, 5, 7, 35), TIME002 = c(2, 2, 16, 
    34, 12, 14, 2, 2, 21, 2, 30, 24, 21, 3, 3, 2, 3, 2, 3, 22
    ), TIME003 = c(34, 8, 12, 15, 13, 12, 12, 7, 13, 11, 16, 
    10, 11, 16, 8, 8, 7, 8, 11, 14), TIME004 = c(60, 33, 25, 
    31, 45, 25, 14, 13, 38, 35, 50, 50, 37, 32, 32, 25, 72, 55, 
    28, 29), TIME005 = c(84, 21, 29, 41, 54, 33, 30, 22, 32, 
    42, 44, 23, 65, 30, 28, 32, 51, 31, 27, 44), TIME006 = c(14, 
    9, 27, 11, 24, 8, 8, 9, 18, 12, 35, 33, 27, 46, 11, 15, 8, 
    14, 12, 14), TIME007 = c(3, 18, 3, 5, 6, 2, 9, 2, 3, 3, 6, 
    7, 3, 13, 4, 4, 378, 3, 4, 10), TIME_SUM = c(199, 94, 135, 
    142, 183, 96, 78, 58, 154, 112, 186, 152, 167, 142, 96, 88, 
    146, 118, 92, 168), MAILSENT = c(NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), 
    LASTDATA = structure(c(1607290661, 1607290702, 1607291221, 
    1607291328, 1607291448, 1607291889, 1607294149, 1607294394, 
    1607294491, 1607294531, 1607295045, 1607296676, 1607301976, 
    1607329490, 1607334030, 1607336084, 1607336727, 1607336496, 
    1607343286, 1607343582), tzone = "UTC", class = c("POSIXct", 
    "POSIXt")), FINISHED = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1), Q_VIEWER = c(0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), LASTPAGE = c(7, 
    7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7), 
    MAXPAGE = c(7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 
    7, 7, 7, 7, 7), MISSING = c(7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 
    7, 7, 7, 7, 7, 7, 0, 7, 7, 7), MISSREL = c(1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1), TIME_RSI = c("46023", 
    "14246", "0.75", "0.63", "0.54", "12055", "17533", "30682", 
    "0.7", "44197", "0.45", "0.58", "0.83", "44378", "44501", 
    "18629", "46753", "46388", "44197", "0.57"), DEG_TIME = c(27, 
    27, 3, 1, 0, 23, 30, 42, 2, 17, 0, 2, 7, 18, 10, 27, 43, 
    18, 8, 0)), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

使用 TRUE 和 FALSE NA 进行编辑：

> sapply(df_literacy, function(a) table(c(T,F,is.na(a)))-1)
      CASE SERIAL REF QUESTNNR MODE STARTED EI01 EI02 RF01 RF02 RF03 RG01 RG02 RG03 RG04 RG05 SD01 SD03 SD05_01 TE03_01 TIME001 TIME002 TIME003
FALSE  630      0   0      630  630     630  630  630  630  630  630  630  630  630  630  630  629  629     615      99     630     630     630
TRUE     0    630 630        0    0       0    0    0    0    0    0    0    0    0    0    0    1    1      15     531       0       0       0
      TIME004 TIME005 TIME006 TIME007 TIME_SUM MAILSENT LASTDATA FINISHED Q_VIEWER LASTPAGE MAXPAGE MISSING MISSREL TIME_RSI DEG_TIME
FALSE     630     630     629     625      630        0      630      630      630      630     630     630     630      630      630
TRUE        0       0       1       5        0      630        0        0        0        0       0       0       0        0        0

Answer 1

这里有几处需要更正。

drop_na() 正在删除您的所有数据。

drop_na(df_literacy)
# # A tibble: 0 x 37
# # ... with 37 variables: CASE <dbl>, SERIAL <lgl>, REF <lgl>, QUESTNNR <chr>,
# #   MODE <chr>, STARTED <dttm>, EI01 <fct>, EI02 <dbl>, RF01 <dbl>, RF02 <dbl>,
# #   RF03 <dbl>, RG01 <dbl>, RG02 <dbl>, RG03 <dbl>, RG04 <dbl>, RG05 <dbl>,
# #   SD01 <fct>, SD03 <dbl>, SD05_01 <dbl>, TIME001 <dbl>, TIME002 <dbl>,
# #   TIME003 <dbl>, TIME004 <dbl>, TIME005 <dbl>, TIME006 <dbl>, TIME007 <dbl>,
# #   TIME_SUM <dbl>, MAILSENT <lgl>, LASTDATA <dttm>, FINISHED <dbl>,
# #   Q_VIEWER <dbl>, LASTPAGE <dbl>, MAXPAGE <dbl>, MISSING <dbl>,
# #   MISSREL <dbl>, TIME_RSI <chr>, DEG_TIME <dbl>

问题是您有几列完全是 NA，即 SERIAL、REF 和 MAILSENT。

sapply(df_literacy, function(a) table(c(T,F,is.na(a)))-1)
#       CASE SERIAL REF QUESTNNR MODE STARTED EI01 EI02 RF01 RF02 RF03 RG01 RG02
# FALSE   20      0   0       20   20      20   20   20   20   20   20   20   20
# TRUE     0     20  20        0    0       0    0    0    0    0    0    0    0
#       RG03 RG04 RG05 SD01 SD03 SD05_01 TIME001 TIME002 TIME003 TIME004 TIME005
# FALSE   20   20   20   20   20      20      20      20      20      20      20
# TRUE     0    0    0    0    0       0       0       0       0       0       0
#       TIME006 TIME007 TIME_SUM MAILSENT LASTDATA FINISHED Q_VIEWER LASTPAGE
# FALSE      20      20       20        0       20       20       20       20
# TRUE        0       0        0       20        0        0        0        0
#       MAXPAGE MISSING MISSREL TIME_RSI DEG_TIME
# FALSE      20      20      20       20       20
# TRUE        0       0       0        0        0

删除 drop_na()，或至少 drop_na(-SERIAL, -REF, -MAILSENT)。

您的代码正在使用 funs，自 dplyr-0.8.0.

以来已弃用

# Warning: `funs()` is deprecated as of dplyr 0.8.0.
# Please use a list of either functions or lambdas: 
#   # Simple named list: 
#   list(mean = mean, median = median)
#   # Auto named with `tibble::lst()`: 
#   tibble::lst(mean, median)
#   # Using lambdas
#   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

虽然这不会导致错误，但会导致警告（并且可能会在某个时候停止工作。将您的 mutate_at 更改为：

  mutate_at(.vars = vars(RG04, RF02),
            .funs = list(prop = ~ . / max(.)))

您在 .vars 中使用了一个变量，在 .funs 中使用了一个函数，因此列名将按原样保留（并且您不会看到 _prop列）。来自 ?mutate_at:

     The names of the new columns are derived from the names of the
     input variables and the names of the functions.

        • if there is only one unnamed function (i.e. if '.funs' is an
          unnamed list of length one), the names of the input variables
          are used to name the new columns;

        • for _at functions, if there is only one unnamed variable
          (i.e., if '.vars' is of the form 'vars(a_single_column)') and
          '.funs' has length greater than one, the names of the
          functions are used to name the new columns;

        • otherwise, the new names are created by concatenating the
          names of the input variables and the names of the functions,
          separated with an underscore '"_"'.

如果您不打算添加更多变量和函数，那么您需要在调用中自行命名，如mutate_at(.vars = vars(RG04 = RG04), ...)。奇怪的是，这导致它产生 RG04_prop.

如果我们解决了所有这些问题，那么它就会起作用。

df_literacy %>%
  drop_na(-SERIAL, -REF, -MAILSENT) %>%
  mutate_at(.vars = vars(RG04 = RG04),
            .funs = list(prop = ~ ./max(.))) %>%
  select(contains("_prop")) %>%
  head(3)
# A tibble: 3 x 1
#   RG04_prop
#       <dbl>
# 1       0.5
# 2       1  
# 3       0.5

在 R 中使用 mutate_at 和 max() 函数尝试使用自己的数据编写代码。出现警告消息：没有非遗漏的最大参数

Tried code in R with mutate_at and max() functions with own data. Warning messages come up: no non-missing arguments to max

r

max

dplyr