在 dplyr 中使用 & 的 case_when 语句有问题吗？

Question

我正在尝试在我的数据集中创建一个附加列来对百分位数进行分桶。理想情况下，我会创建如下逻辑：

CASE 
  WHEN  percentile >=  75 AND  percentile < 90  THEN "75%-89% Percentile"
  WHEN percentile >=  50 AND  percentile < 75  THEN "50%-75% Percentile"

END

我尝试过的 dplyr 如下：

  mutate(Bucket = case_when(as.double(percentile) >= 90 ~ "90%-100% Percentile",
                            as.double(percentile) >=  75 & as.double(percentile) < 90  ~ "75%-89% Percentile",
                            as.double(percentile) <  75 & as.double(percentile) >= 50  ~ "50%-75% Percentile",
                            as.double(percentile) <  50 & as.double(percentile) >= 25  ~ "25%-50% Percentile",
                            as.double(percentile) <  25 & as.double(percentile) >= 0  ~ "0%-25% Percentile"))

但是它没有正确分桶，请查看屏幕截图中的结果示例。这些百分位数的存储桶标志应为“75%-89% 百分位数”：

Answer 1

第 percentile 列是 factor。我们需要先转换成characterclass再转换成numeric

library(dplyr)
 df1 %>%
     mutate(percentile = as.numeric(as.character(percentile))) %>%
     ...

发生的情况是，当我们直接强制转换为 numeric/integer 时，它会被强制转换为整数存储值而不是实际值

v1 <- factor(c(81.9, 82.7, 81.9, 82.5))
as.numeric(v1)
#[1] 1 3 1 2

与以下不同

as.numeric(as.character(v1))
#[1] 81.9 82.7 81.9 82.5

或者 levels

可能更快

as.numeric(levels(v1)[v1])
#[1] 81.9 82.7 81.9 82.5

在 dplyr 中使用 & 的 case_when 语句有问题吗？

Issue with case_when statement using & in dplyr?

nested

r

case

dplyr