Distinct 在 dplyr 中不起作用（有时）

Question

我有以下数据框，这是我从计数中获得的。我使用 dput 使数据框可用，然后编辑数据框，因此存在 A 的副本。

df <- structure(list(Procedure = structure(c(4L, 1L, 2L, 3L), .Label = c("A", "A", "C", "D", "-1"), 
                                         class = "factor"), n = c(10717L, 4412L, 2058L, 1480L)), 
              class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L), .Names = c("Procedure", "n"))

print(df)

# A tibble: 4 x 2
  Procedure     n
  <fct>     <int>
1 D         10717
2 A          4412
3 A          2058
4 C          1480

现在我想对程序进行区分，只保留第一个 A。

df %>% 
  distinct(Procedure, .keep_all=TRUE)

# A tibble: 4 x 2
  Procedure     n
  <fct>     <int>
1 D         10717
2 A          4412
3 A          2058
4 C          1480

它不起作用。奇怪...

Answer 1

您在标签参数 .Label = c("A", "A", "C", "D", "-1") 中有重复的值。这是一个问题。顺便说一句，你初始化 tibble 的方式似乎很奇怪（我不确切知道你的目标，但仍然）

为什么不使用


df <- tibble(
    Procedure = c("D", "A", "A", "C"),
    n = c(10717L, 4412L, 2058L, 1480L)
)

Answer 2

如果我们打印 Procedure 列，我们可以看到 a 有重复的级别，这对 distinct 函数来说是有问题的。

df$Procedure
[1] D A A C
Levels: A A C D -1
Warning message:
In print.factor(x) : duplicated level [2] in factor

一种解决方法是降低因子水平。我们可以使用 factor 函数来实现这一点。另一种方法是将 Procedure 列转换为字符。

df <- structure(list(Procedure = structure(c(4L, 1L, 2L, 3L), .Label = c("A", "A", "C", "D", "-1"), 
                                           class = "factor"), n = c(10717L, 4412L, 2058L, 1480L)), 
                class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L), .Names = c("Procedure", "n"))


library(tidyverse)

df %>% 
  mutate(Procedure = factor(Procedure)) %>%
  distinct(Procedure, .keep_all=TRUE)
# # A tibble: 3 x 2
#   Procedure     n
#   <fct>     <int>
# 1 D         10717
# 2 A          4412
# 3 C          1480

Distinct 在 dplyr 中不起作用（有时）

Distinct in dplyr does not work (sometimes)

r

dplyr

tibble