为什么 dplyr::distinct 对于分组数据帧表现得像这样

Question

我的问题涉及 dplyr 中的 distinct 函数。

首先，设置数据：

set.seed(0)

df <- data.frame(
    x = sample(10, 100, rep = TRUE),
    y = sample(10, 100, rep = TRUE)
)

考虑以下两种 distinct.

的用法

df %>%
    group_by(x) %>%
    distinct()

df %>%
    group_by(x) %>%
    distinct(y)

第一个产生的结果与第二个不同。据我所知，第一组操作找到 "All distinct values of x, and return first value of y"，而第二组操作找到 "For each value of x, find all distinct values of y".

为什么会这样

df %>%
    distinct(x, y)

df %>% distinct()

产生相同的结果？

编辑：看起来这已经是一个已知错误：https://github.com/hadley/dplyr/issues/1110

Answer 1

据我所知，答案是 distinct 在确定清晰度时考虑对列进行分组，这对我来说似乎与 dplyr 其余部分的工作方式不一致。

因此：

df %>%
group_by(x) %>%
distinct()

按 x 分组，查找 x(!) 中不同的值。这似乎是一个错误。

但是：

df %>%
group_by(x) %>%
distinct(y)

按 x 分组，在给定 x 的 y 中找到不同的值。这相当于以下任一情况：

df %>%
distinct(x, y)

df %>% distinct()

两者都在 x 和 y 中找到不同的值。

要点似乎是：不要使用分组和 distinct。只需使用相关的列名作为 distinct 中的参数。

为什么 dplyr::distinct 对于分组数据帧表现得像这样

Why does dplyr::distinct behave like this for grouped data frames

r

dplyr